Thursday, May 28, 2015

snps calling

/Bioinformatics/Users/zfu/packages/freebayes/bin/freebayes -F 0.01 -p 1 --fasta-reference /Bioinformatics/Users/zfu/2015_Stephanie_BAC/00.RawData/clone_GFP.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/05.Clone_GFP/L80_clone_GFP_sorted.bam > /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/06.Freebayes/L80_clone_GFP_sorted.vcf

/Bioinformatics/Users/zfu/packages/vcflib/bin/vcffilter -f "QUAL > 20" /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/06.Freebayes/L80_clone_GFP_sorted.vcf > /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/06.Freebayes/L80_clone_GFP_filtered.vcf

cd /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L51/src/

python generateConsensusData_Ori.py /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/06.Freebayes/L80_clone_GFP_sorted.vcf /Bioinformatics/Users/zfu/2015_Stephanie_BAC/00.RawData/oneLine.fasta/clone_oneLine_GFP.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/07.Consensus.Fasta/L80_clone_GFP_Consensus.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/07.Consensus.Fasta/L80_clone_GFP_Consensus_noGap.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/06.Freebayes/L80_clone_GFP_snpInfo.tsv

python generateConsensusData_Freq.py /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/06.Freebayes/L80_clone_GFP_filtered.vcf /Bioinformatics/Users/zfu/2015_Stephanie_BAC/00.RawData/oneLine.fasta/clone_oneLine_GFP.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/07.Consensus.Fasta/L80_clone_GFP_Consensus.filtered.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/07.Consensus.Fasta/L80_clone_GFP_Consensus_noGap.filtered.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/06.Freebayes/L80_clone_GFP_snpInfo.filtered.tsv /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/06.Freebayes/L80_clone_GFP_snpInfo_filtered_Freq.vcf

Wednesday, May 20, 2015

freebayes

/Bioinformatics/Users/zfu/packages/freebayes/bin/freebayes -F 0.01 -p 1 --fasta-reference /Bioinformatics/Users/zfu/2015_Stephanie_BAC/00.RawData/clone_GFP.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/05.Clone_GFP/L74_clone_GFP_sorted.bam > /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/06.Freebayes/L74_clone_GFP_sorted.vcf

/Bioinformatics/Users/zfu/packages/vcflib/bin/vcffilter -f "QUAL > 20" /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/06.Freebayes/L74_clone_GFP_sorted.vcf > /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/06.Freebayes/L74_clone_GFP_filtered.vcf

cd /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L51/src/

python generateConsensusData_Ori.py /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/06.Freebayes/L74_clone_GFP_sorted.vcf /Bioinformatics/Users/zfu/2015_Stephanie_BAC/00.RawData/oneLine.fasta/clone_oneLine_GFP.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/07.Consensus.Fasta/L74_clone_GFP_Consensus.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/07.Consensus.Fasta/L74_clone_GFP_Consensus_noGap.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/06.Freebayes/L74_clone_GFP_snpInfo.tsv

python generateConsensusData_Freq.py /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/06.Freebayes/L74_clone_GFP_filtered.vcf /Bioinformatics/Users/zfu/2015_Stephanie_BAC/00.RawData/oneLine.fasta/clone_oneLine_GFP.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/07.Consensus.Fasta/L74_clone_GFP_Consensus.filtered.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/07.Consensus.Fasta/L74_clone_GFP_Consensus_noGap.filtered.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/06.Freebayes/L74_clone_GFP_snpInfo.filtered.tsv /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/06.Freebayes/L74_clone_GFP_snpInfo_filtered_Freq.vcf

Wednesday, May 13, 2015

RseqC

/Bioinformatics/Users/zfu/packages/RSeQC/share/apps/python/python-2.7.6/bin/RPKM_saturation.py -r Clone.bed -i ../L74/04.Clone/L74_clone_sorted.bam -o L74 &
/Bioinformatics/Users/zfu/packages/RSeQC/share/apps/python/python-2.7.6/bin/RPKM_saturation.py -r Clone.bed -i ../L80/04.Clone/L80_clone_sorted.bam -o L80 &

Tuesday, May 12, 2015

vcf to fasta

4.1 years ago by

Francisco Roque •110

Bergen, Norway

This will not apply directly to the vcf file, but using extra annotation from bcftools (assuming you are using it for snp and indel calling):

samtools mpileup -uf reference.fasta alignment.bam | \
bcftools view -cg - | vcfutils.pl vcf2fq > consensus.fastq

After that just convert the fastq into a consensus fasta using seqtk, converting all bases with quality lower than 20 to lowercase

seqtk fq2fa consensus.fastq 20 > consensus.fasta

Freebayes

Calling variants with freebayes

Erik Garrison

erik.garrison@bc.edu / @erikgarrison

Summary

This tutorial is a brief walkthrough demonstrating how to use FreeBayes to detect short sequence variants in read alignments generated from a resequencing process.

Getting started

You've sequenced a sample. You've aligned your reads to a reference genome. You've sorted your alignments. You've marked PCR duplicates. You've indexed your alignments.

Now, you want to determine the sequence variants present in your sample. FreeBayes lets you do this very easily!

You first need to download and install freebayes:

git clone --recursive git://github.com/ekg/freebayes.git
cd freebayes
make

Alternatively, you can use an existing tarball:

wget http://clavius.bc.edu/~erik/freebayes/freebayes-5d5b8ac0.tar.gz
tar xzvf freebayes-5d5b8ac0.tar.gz
cd freebayes
make

(These tarballs aren't built regularly, so your best bet is to set up and use git to download the code. It's really useful!)

Now that freebayes is built, we can run it:

bin/freebayes        # default summary
bin/freebayes -h     # verbose help text, summary of options

If all is well we should see a description of the method. You can install freebayes by moving it into your path, or adding freebayes/bin to your path.

Working with test data

To call variants we need an alignment file (in BAM format). We also need a reference genomein (uncompressed) FASTA format.

An easy way to get started is to run tests on publicly-available data sets. For this we can use the ever-popular CEU hapmap sample NA12878. This sample is a good target because it provides us a putative truth set, currently in development by the NIST Genome in a Bottle Project.

The 1000 Genomes Project has incorporated a down-sampled alignment set into the project's low-coverage (~5-6x) analyses. As such, we have readily-available alignments processed using the same pipeline used for the alignment of all the 1000 Genomes samples.

We first need to download the reference used for the alignment. Our analysis will be limited to chr20, so we'll just get a copy of that chromosome.

wget http://bioinformatics.bc.edu/marthlab/download/gkno-cshl-2013/chr20.fa

Now, we should grab a piece of the low-coverage NA12878 alignments using samtools:

wget ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
wget ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam.bai

These alignments are just for human chromosome 20, which amounts to about 1/50th of the genome. This will be plenty for testing.

To have a truth set to compare to, we should download the NIST Genome in a Bottle variants. Note that we are obtaining both the VCF (variants) and a target list in which the project determined it is possible to confidently make genotype calls:

wget ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/variant_calls/NIST/NISTIntegratedCalls_13datasets_130719_allcall_UGHapMerge_HetHomVarPASS_VQSRv2.17_all_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs.vcf.gz
wget ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/variant_calls/NIST/union13callableMQonlymerged_addcert_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs_v2.17.bed.gz
gunzip *.vcf.gz
gunzip *.bed.gz

And, we can also grab a curated set of these variants shared by the Broad Genome Sequencing and Analysis Group:

wget http://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/working/20130806_broad_na12878_truth_set/NA12878.wgs.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz
wget http://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/working/20130806_broad_na12878_truth_set/NA12878.wgs.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz.tbi

Now we have a reference, alignment data, and a set of truth sets available for NA12878.

Variant calling

To call variants, we can just use freebayes in its default configuration:

freebayes -f chr20.fa \
    NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam >NA12878.chr20.freebayes.vcf

(This should take 5-10 minutes on a typical system.)

You could target a particular region as well (for instance, to speed up testing) by providing freebayes the --region parameter:

freebayes -f chr20.fa --region 20:1000000-2000000 \
    NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam >NA12878.chr20.1mb_2mb.freebayes.vcf

Understanding our results

Take a look at the output.

less -S NA12878.chr20.freebayes.vcf

There is a header:

##fileformat=VCFv4.1
##fileDate=20131121
##source=freeBayes v0.9.9.2-23-g8a98f11-dirty
##reference=chr20.fa
##phasing=none
##commandline="freebayes -f chr20.fa NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam"
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
....

And then there are variants:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
20      61795   .       G       T       127.882 .       AB=0.454545;ABP=3.20771;AC=1;AF=0.5
20      63244   .       A       C       90.7756 .       AB=0;ABP=0;AC=2;AF=1
....

We can quickly examine our aggregate results using tools from vcflib, such as vcfstats:

vcfstats NA12878.chr20.freebayes.vcf

This will provide us a lot of information about small variants, such as the counts of SNPs, indels, and other variant types, the transition/transversion ratio, and a summary of the indel length-frequency spectrum. An interesting thing to note is that the ts/tv ratio is different for high and low-quality variants:

vcffilter -f "QUAL > 20" NA12878.chr20.freebayes.vcf | vcfstats | grep "ts\|bial"
# compare to
vcffilter -f "QUAL < 20" NA12878.chr20.freebayes.vcf | vcfstats | grep "ts\|bial"

Comparing our results to the Broad NA12878 Knowledge-Base

First, we should either use tabix to extract just chr20 from the NA12878-K.B.:

tabix -h NA12878.chr20.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz 20

Or, you can download a pre-made set:

wget http://clavius.bc.edu/~erik/NA12878/NA12878.chr20.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz
wget http://clavius.bc.edu/~erik/NA12878/NA12878.chr20.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz.tbi

And, for this comparison we want to both exclude regions which weren't considered confidently-callable by the GiaB set, and keep only variants which are considered TRUE_POSITIVE in the Broad/GiaB NA12878 truth set:

zcat NA12878.chr20.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz \
    | vcfintersect -b union13callableMQonlymerged_addcert_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs_v2.17.bed \
    | grep "^#\|TRUE_POSITIVE" >NA12878.chr20.broad_truth_set.TP.callable_regions.vcf

We'll do the same for our calls, restricting them to confidently-callable regions:

cat NA12878.chr20.freebayes.vcf \
    | vcfintersect -b \
        union13callableMQonlymerged_addcert_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs_v2.17.bed \
        >NA12878.chr20.freebayes.callable_regions.vcf

Using this result, we can generate a kind of receiver-operator characteristic using a tool in vcflib, vcfroc:

vcfroc -r chr20.fa \
    -t NA12878.chr20.broad_truth_set.callable_regions.vcf \
    NA12878.chr20.freebayes.vcf >broad_truth_set.roc.tsv

We can generate a plot of this using some tools in vcflib:

( vcfroc -r chr20.fa -t \
    NA12878.chr20.broad_truth_set.TP.callable_regions.vcf \
    NA12878.chr20.broad_truth_set.TP.callable_regions.vcf \
    | prependcol set answers; \
    cat broad_truth_set.roc.tsv \
    | prependcol set freebayes | grep -v set ) >roc.tsv
plot_roc.r freebayesNA12878 answers roc.tsv 0 1 0 1

Here, prependcol is a script that prepends a column named "set" with each row set to "answers" or "freebayes."

The resulting image freebayesNA12878.both.png, shows that we can achieve < 5% FDR in this sample (assuming perfect sensitivity in the truth set--- which is unlikely), and that our overall sensitivity is rather low. This matches our expectations given the low coverage of the sample.

SNP

Question: What Methods Do You Use For In/Del/Snp Calling?

5.1 years ago by

Pierre Lindenbaum ♦70k

France

After you've mapped your short reads on a reference sequence, what is your favorite method/workflow to detect some (new) SNP/insertions/deletions. Did you compare various strategies ?

On my side, I played with MAQ/SAM/BWA but I wonder if there is 'widely' adopted (robust) workflow to find some new SNPs.

Thanks Pierre

ADD COMMENT • link •

modified 18 months ago by Biostar ♦♦ 0 • written 5.1 years ago by Pierre Lindenbaum ♦ 70k

I am a researcher in the Marth Lab at Boston College, which has developed Mosaik and GigaBayes. Over the past six months, I have completely rewritten GigaBayes to enable its efficient use on the >1k sample datasets which are presently being generated by the 1000 Genomes Project. I suggest you look into using it if you are interested in small indel calling.

The new program is called FreeBayes. In addition to a host of improvements in interface, reliability, and algorithmic flexibility, FreeBayes should provide several orders of magnitude improvement in runtime performance over GigaBayes and BamBayes. It uses the same basic population-based Bayesian framework as its predecessors to segregate true variant events from sequencing and alignment artifacts. We have provided it under a liberal open source (MIT) license. We haven't yet submitted a publication describing the work, but we would love to provide the system to users for testing.

In its simplest operation, FreeBayes uses BAM alignment file(s) and the corresponding FASTA reference sequence to generate a VCF report describing variant individuals and sites:

% freebayes -f h.sapiens.fasta -v variants.vcf NA20504.bam

(This would analyze all positions in h.sapiens.fasta for which NA20504.bam has coverage.)

FreeBayes now detects insertion, deletion, MNP (multi-base mismatches), and "complex" allelic variants by default.

The core insight of the algorithm is that a neutral model of the likely distribution of alleles in a population can be used to improve our detection efficiency of true events within a population of individuals. (FreeBayes models this distribution using the Ewens Sampling Formula.) Thus, whenever possible, multiple samples from the same or closely related species should be used. Where such information is not available, the reference may be counted as an additional sample using the --use-reference-allele flag.

To my knowledge, FreeBayes is significantly different than other variant detection systems in common use in that it is not limited to the analysis of haploid or diploid individuals. The assumed ploidy or copy number of the samples is not fixed, and can be set to any number (via the --ploidy flag). This enables the extension of the algorithm to variant calling in species with more than 2 copies of each locus. Additionally, the results of pooled sequencing experiments may be analyzed by setting ploidy equal to the number of alleles per site in the pooled population. I plan to enable sequence and region-specific configuration of copy number in the very near future, as this is directly applicable to variant calling in the 1000 Genomes.

Another useful feature is that FreeBayes can read BAM on its standard input. This allows the application of custom input filters or base-quality adjustment methods to alignments without requiring that the alignment files be rewritten. For instance, this command will applysamtools BAQ adjustment to aln.bam and then call SNPs, writing VCF output to standard output:

% samtools fillmd -br aln.bam | freebayes -f reference.fasta

As a core component in the NCBI variant calling pipeline, FreeBayes is currently under very active development. Please contact me with any questions, feature requests, or bug reports. My email is listed on the FreeBayes github page.

I'd also love to collaborate with anyone working on interesting variant detection problems!

4.5 years ago by

lh3 ♦ 24k

United States

For cancer, SNVMix2 is probably better as it considers the specific issues in cancer resequencing. For other resequencing, both GATK and samtools (especially with the recent improvement) are good. For indels, Dindel. GATK is reimplementing Dindel, too. In addition, whatever SNP caller you use, remember to apply BAQ (the last example). GATK/FreeBayes/SAMtools calls will all be significantly improved. (I think Erik would also agree.)

ADD COMMENT • linkwritten 4.5 years ago by lh3 ♦ 24k

I didn't mention this, but FreeBayes can read BAM input on stdin. This allows the application of BAQ without having to rewrite the input files prior to SNP calling. This benefits from a fast BAQ algorithm :). (I'll edit my response to note this feature. At some point I need to update the README to reflect all these changes.)

ADD REPLY • linkwritten 4.5 years ago by Erik Garrison • 1.5k

Also, I agree that the base quality adjustment goes a long way to improving the quality variant detection.

ADD REPLY • linkwritten 4.5 years ago by Erik Garrison • 1.5k

So you mean that now we do not need to do BAQ before freebayes snp calling?

ADD REPLY • linkwritten 8 months ago by Chen Sun • 240

5.1 years ago by

Brad Chapman ♦ 8.6k

Boston, MA

Mosaik for alignment and GigaBayes(PbShort) has worked well for me in the past:

http://bioinformatics.bc.edu/marthlab/Mosaik

http://bioinformatics.bc.edu/marthlab/PbShort

For my next SNP project will be using Broad's Genome Analysis Toolkit (GATK):

http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit

ADD COMMENT • linkwritten 5.1 years ago by Brad Chapman ♦ 8.6k

GATK seems promising. The unified genotyper considers indels in the likelihood model? Finding variation is pretty straighforward, but calling it a SNP still is quite tricky.

ADD REPLY • linkwritten 5.1 years ago by Jarretinha ♦ 3.0k

As noted elsewhere on BioStar, FreeBayes is the successor to GigaBayes. GigaBayes is no longer actively maintained.

ADD REPLY • linkwritten 3.7 years ago by Erik Garrison • 1.5k

5.1 years ago by

Louis Letourneau •790

Montreal

We use samtools pileup+varfilter with SNVmix to filter 'valid' SNPs (it all depends on coverage and experiment.

ADD COMMENT • linkwritten 5.1 years ago by Louis Letourneau • 790

4.5 years ago by

Rm • 6.9k

Danville, PA

I use bowtie/BWA --> Samtools pileup > Varfilter

ADD COMMENT • linkwritten 4.5 years ago by Rm • 6.9k

3.3 years ago by

Malachi Griffith ♦ 10k

Washington University School of Medicine, St. Louis, USA

We use SomaticSniper and VarScan. In addition to SNVMix2, FreeBayes, and GATK mentioned in other answers I would also add SOAPindel, SOAPsnv, SOAPsnp, Atlas2, SNVer, TREAT, and SeqEM.

Some are geared towards SNPs others toward SNVs or Indels.

ADD COMMENT • linkwritten 3.3 years ago by Malachi Griffith ♦ 10k

Hello, I am novice on SomaticSniper, bam-somaticsniper -q 1 -Q 40 -f ucsc.hg19.fasta ERR031023.bam ERR031024.bam ERR031024.snp.vcf, this is the command line I used to call different snps between one pair of cancer and normal samples. Unfortunately, I got a large number of machine artifact. May you send me your command line of Somaticsniper? Any suggestion is appreciated.

ADD REPLY • linkwritten 2.3 years ago by jiagehao • 10

4.1 years ago by

Lhl • 660

United States

Have you compared the difference between variants called by different variants callers?

ADD COMMENT • linkwritten 4.1 years ago by Lhl • 660

I guess you should better open a new thread for this question, to optimize your chances of detailed answers.

ADD REPLY • linkwritten 4.1 years ago by toni • 2.0k

3.7 years ago by

Larry_Parnell ♦ 15k

Boston, MA USA

Don Conrad describes here tools and approaches he uses to identify CNVs from SNP genotype data. The program looks for stretches of homozygous genotypes interspersed with Mendelian errors, which might indicate the transmission of a large deletion. Details are at the above link.

ADD COMMENT • linkwritten 3.7 years ago by Larry_Parnell ♦ 15k

Please log in to add an answer.

Lord's Lamb

Thursday, May 28, 2015

snps calling

Wednesday, May 20, 2015

freebayes

Wednesday, May 13, 2015

RseqC

Tuesday, May 12, 2015

vcf to fasta

Freebayes

Calling variants with freebayes

Erik Garrison

erik.garrison@bc.edu / @erikgarrison

Summary

Getting started

Working with test data

Variant calling

Understanding our results

Comparing our results to the Broad NA12878 Knowledge-Base

Similar posts • Search »

Blog Archive

About Me