Tuesday, November 3, 2015

substitute multiple files

grep -rl 'Project_GrSe42_PG_OW_Eva_22OCT15' ./ | xargs sed -i 's/Project_GrSe42_PG_OW_Eva_22OCT15/Project_Eva_smallRNA_22OCT15/g'

Friday, October 23, 2015

cd Sample_AS-DKO-CD4T_a-CD3CD28
cat AS-DKO-LCD3_AGGTTCAT_L002_R1_001.fastq.gz AS-DKO-LCD3_AGGTTCAT_L002_R1_002.fastq.gz AS-DKO-LCD3_AGGTTCAT_L002_R1_003.fastq.gz AS-DKO-LCD3_AGGTTCAT_L002_R1_004.fastq.gz AS-DKO-LCD3_AGGTTCAT_L002_R1_005.fastq.gz AS-DKO-LCD3_AGGTTCAT_L002_R1_006.fastq.gz > tempR1.gz
cat AS-DKO-LCD3_AGGTTCAT_L002_R2_001.fastq.gz AS-DKO-LCD3_AGGTTCAT_L002_R2_002.fastq.gz AS-DKO-LCD3_AGGTTCAT_L002_R2_003.fastq.gz AS-DKO-LCD3_AGGTTCAT_L002_R2_004.fastq.gz AS-DKO-LCD3_AGGTTCAT_L002_R2_005.fastq.gz AS-DKO-LCD3_AGGTTCAT_L002_R2_006.fastq.gz > tempR2.gz
rm *.fastq.gz
mv tempR1.gz AS-DKO-LCD3_AGGTTCAT_L002_R1_001.fastq.gz
mv tempR2.gz AS-DKO-LCD3_AGGTTCAT_L002_R2_001.fastq.gz
cd ..

cd Sample_AS-DKO-CD4T_P+I
cat AS-DKO-CD4T_ATATAAGA_L002_R1_001.fastq.gz AS-DKO-CD4T_ATATAAGA_L002_R1_002.fastq.gz AS-DKO-CD4T_ATATAAGA_L002_R1_003.fastq.gz AS-DKO-CD4T_ATATAAGA_L002_R1_004.fastq.gz AS-DKO-CD4T_ATATAAGA_L002_R1_005.fastq.gz AS-DKO-CD4T_ATATAAGA_L002_R1_006.fastq.gz > tempR1.gz
cat AS-DKO-CD4T_ATATAAGA_L002_R2_001.fastq.gz AS-DKO-CD4T_ATATAAGA_L002_R2_002.fastq.gz AS-DKO-CD4T_ATATAAGA_L002_R2_003.fastq.gz AS-DKO-CD4T_ATATAAGA_L002_R2_004.fastq.gz AS-DKO-CD4T_ATATAAGA_L002_R2_005.fastq.gz AS-DKO-CD4T_ATATAAGA_L002_R2_006.fastq.gz > tempR2.gz
rm *.fastq.gz
mv tempR1.gz AS-DKO-CD4T_ATATAAGA_L002_R1_001.fastq.gz
mv tempR2.gz AS-DKO-CD4T_ATATAAGA_L002_R2_001.fastq.gz
cd ..

cd Sample_AS-WT-CD4T_a-CD3CD28
cat AS-WT-LCD3_AGCGCTGG_L002_R1_001.fastq.gz AS-WT-LCD3_AGCGCTGG_L002_R1_002.fastq.gz AS-WT-LCD3_AGCGCTGG_L002_R1_003.fastq.gz AS-WT-LCD3_AGCGCTGG_L002_R1_004.fastq.gz AS-WT-LCD3_AGCGCTGG_L002_R1_005.fastq.gz > tempR1.gz
cat AS-WT-LCD3_AGCGCTGG_L002_R2_001.fastq.gz AS-WT-LCD3_AGCGCTGG_L002_R2_002.fastq.gz AS-WT-LCD3_AGCGCTGG_L002_R2_003.fastq.gz AS-WT-LCD3_AGCGCTGG_L002_R2_004.fastq.gz AS-WT-LCD3_AGCGCTGG_L002_R2_005.fastq.gz > tempR2.gz
rm *.fastq.gz
mv tempR1.gz AS-WT-LCD3_AGCGCTGG_L002_R1_001.fastq.gz
mv tempR2.gz AS-WT-LCD3_AGCGCTGG_L002_R2_001.fastq.gz
cd ..

cd Sample_AS-WT-CD4T_P+I
cat AS-WT-CD4T_AGTTATAG_L002_R1_001.fastq.gz AS-WT-CD4T_AGTTATAG_L002_R1_002.fastq.gz AS-WT-CD4T_AGTTATAG_L002_R1_003.fastq.gz AS-WT-CD4T_AGTTATAG_L002_R1_004.fastq.gz AS-WT-CD4T_AGTTATAG_L002_R1_005.fastq.gz > tempR1.gz
cat AS-WT-CD4T_AGTTATAG_L002_R2_001.fastq.gz AS-WT-CD4T_AGTTATAG_L002_R2_002.fastq.gz AS-WT-CD4T_AGTTATAG_L002_R2_003.fastq.gz AS-WT-CD4T_AGTTATAG_L002_R2_004.fastq.gz AS-WT-CD4T_AGTTATAG_L002_R2_005.fastq.gz > tempR2.gz
rm *.fastq.gz
mv tempR1.gz AS-WT-CD4T_AGTTATAG_L002_R1_001.fastq.gz
mv tempR2.gz AS-WT-CD4T_AGTTATAG_L002_R2_001.fastq.gz
cd ..

Tuesday, October 20, 2015

QC External

/share/apps/perl/perl-5.18.1/bin/perl QCReads_External.pl -u zfu -e zfu@liai.org -i External -dir /Bioinformatics/NGS_raw_data/external/20150729_Bethany_SuperEnhancer_Merged -t Bethany_SuperEnhancer_Merged -p 0

/share/apps/perl/perl-5.18.1/bin/perl /home/zfu/NGS_Pipeline/src/QCReads_External.pl -u zfu -e zfu@liai.org -i External -dir /Bioinformatics/NGS_raw_data/LIAI/Hiseq2500/151019_External_C4NM1ANXX_10_19_15_CD4_DKO_YunCai -t CD4_DKO_YunCai -p 0 -r 151019_External_C4NM1ANXX_10_19_15_CD4_DKO_YunCai

Tuesday, October 6, 2015

https://www.biostars.org/p/65147/

Latest

Open

RNA-Seq

ChIP-Seq

SNP

Assembly

Tutorials

Tools

Jobs

Forum

Planet

All »

Welcome to Biostar! about • faq • rss

Community

Log In

Sign Up

Add New Post

Question: Binning Over Genes And Calculating The Coverage [Bedops/Bedmap]

2.6 years ago by

Sukhdeep Singh ♦ 6.3k

Germany

I have a list of genes and a coverage track (generated using BEDOPS-based binning script also called as binReads.sh), modified it to take bed file as input).
What I want do it to divide the each gene into 100 bins and then calculate the mean overlap from my coverage file, so that later all the respective bins can be added for all genes to give a average protein binding/coverage in that bin.
I couldn't understand that how I can make a specific number of bins for the genes though I can regulate the binSize using binReads.sh. I made a custom R code, for taking my list of genes and the dividing each of them into 100 bins.
Then, I used this list as a reference, (added a fake 4th column, which is also not very intuitive) to find the mean overlap for my coverage file.
Problems :
1) This method is not very intutitive (because of R code, it takes ~4-6 mins to complete the process for a list of 30.2K genes producing 30.2K*100 lines as a bed file to be used as a reference). It there a way in bedOps suite to generate the bins on the fly.
2) The output of bedmap --mean genes_bins.bed file.cov > means.tsv is valid numeric for first 42K reference lines out of 3.2M(30.2K*100) and for the rest it produces "NAN". To cross-validate, I took some specific binned genes and it produces a valid output. Is this because of the size of the files (shouldn't be).
Can you suggest some other way to approach this problem.

genes_bins.bed
chr1    3214481    3219051    Xkr4-bin1
chr1    3219051    3223621    Xkr4-bin2
chr1    3223621    3228192    Xkr4-bin3
chr1    3228192    3232762    Xkr4-bin4
chr1    3232762    3237332    Xkr4-bin5
chr1    3237332    3241902    Xkr4-bin6

coverage.bed
chr1    3001500    3001520    1    1
chr1    3001520    3001540    1    1
chr1    3001540    3001560    1    1
chr1    3001560    3001580    1    1
chr1    3001580    3001600    1    1
chr1    3001600    3001620    1    1
chr1    3001620    3001640    1    1

P.S. It was started as comment under Converting BAM to bedGraph for viewing on UCSC and I thought its better to ask as a seperate question. This probably be answered by Alex Reynolds himself. The final aim to make the composite profiles of your protein of interest from the chip-seq datasets.
Thanks

chipseq ngs coverage • 2.3k views

ADD COMMENT • link •

modified 6 weeks ago by James Ashmore • 470 • written 2.6 years ago by Sukhdeep Singh ♦ 6.3k

sorry, I have a couple of questions. In the genes_bins.bed example file, the gene Xkr4 is divided into 6 bins. Is this correct? Second question: if you want to split each gene into 100 bins, how do you handle the fact that different genes may have different length?

ADD REPLY • link written 2.6 years ago by Giovanni M Dall'Olio ♦ 21k

Giovanni, thats just the head of the file, I will mention that. Secondly, thats the whole point of binning, the genes have unequal length, if you want to make a Plotting Read Density/ Distribution/ Composite Profiles [Chip-Seq], then the genes should be be binned and avg binding per bin is calculated in the end to be summed up for per bin for all bins of all genes, to get an average binding over the whole length of the gene. This will effect, that sum genes have more density in a bin and some less. Can you suggest some other strategy!!

ADD REPLY • link written 2.6 years ago by Sukhdeep Singh ♦ 6.3k

2.6 years ago by

Alex Reynolds ♦ 12k

Seattle, WA USA

The bedops and bedmap utilities in the BEDOPS suite do not generate bins, but you could use awk to do this, which might be faster than using R.
To demonstrate how to solve your problem, let's start with a couple example genes foo and bar:

$ more genes.bed
chr1    1000    4000    foo
chr1    6000    9000    bar

Here's how we could partition them into, say, five bins for each gene:

$ awk ' \
BEGIN { \
    binNum=5; \
} \
{ \
    chr=$1; \
    start=$2; \
    stop=$3; \
    binIdx=1; \
    binSize=(stop-start)/binNum; \
    for (i=start;i<stop;i+=binSize) { \
        print chr"\t"i"\t"(i+binSize)"\t"binIdx; \
        binIdx++; \
    } \
}' \
genes.bed > bins.bed

Here's what our example bins.bed looks like:

$ more bins.bed    
chr1    1000    1600    1
chr1    1600    2200    2
chr1    2200    2800    3
chr1    2800    3400    4
chr1    3400    4000    5
chr1    6000    6600    1
chr1    6600    7200    2
chr1    7200    7800    3
chr1    7800    8400    4
chr1    8400    9000    5

Note that we're making the ID field the bin index — we're not including the name of the gene in the bin index.
If I understand your question correctly (and the rest of this answer depends on this understanding), the actual name of the gene doesn't matter for the final answer — we just want an average signal for a bin across all genes in the set, which are divided into equal numbers of partitions regardless of size.
(Also, for the purposes of demonstration, I'm "cheating" with my choice of gene BED regions and bins, so that everything partitions cleanly. I'll leave it to you to handle edge cases where things aren't so neatly divisible.)
Next, let's set up an example signal file (coverage data). It doesn't matter what's in the ID field — only that something is there, so that the score or signal value is in the fifth column, thus ensuring that our tools are working with a properly UCSC-formatted BED file:

$ more signal.bed
chr1    1000    1500    id-1    1
chr1    1500    2000    id-2    3
chr1    2000    2500    id-3    4
chr1    2500    3000    id-4    4
chr1    3000    3500    id-5    8
chr1    3500    4000    id-6    20
chr1    4000    4500    id-7    15
chr1    4500    5000    id-8    12
chr1    5000    5500    id-9    30
chr1    5500    6000    id-10   35
chr1    6000    6500    id-11   20
chr1    6500    7000    id-12   10
chr1    7000    7500    id-13   15
chr1    7500    8000    id-14   12
chr1    8000    8500    id-15   23
chr1    8500    9000    id-16   26

Let's first do a naïve bedmap with the --mean operator to see what we get. We'll include the --echo operator, so we can see what mean signal is associated with what reference element (bin):

$ bedmap --echo --mean bins.bed signal.bed
chr1    1000    1600    1|2.000000
chr1    1600    2200    2|3.500000
chr1    2200    2800    3|4.000000
chr1    2800    3400    4|6.000000
chr1    3400    4000    5|14.000000
chr1    6000    6600    1|15.000000
chr1    6600    7200    2|12.500000
chr1    7200    7800    3|13.500000
chr1    7800    8400    4|17.500000
chr1    8400    9000    5|23.000000

Firstly, the precision of the mean result needs adjusting, using the --prec operator. Secondly, we'll remove the --echo operator, because we don't need it for the next step:

$ bedmap --mean --prec 0 bins.bed signal.bed
2
4
4
6
14
15
12
14
18
23

As a first pass, to demonstrate how we'll get a final answer, we'll pass our bedmap-ed stream of mean values to another awk statement to track cumulative sums of means over an individual bin, while counting the number of genes we traverse:

$ bedmap --mean --prec 0 bins.bed signal.bed \
    | awk ' \
        BEGIN { \
            binNum=5; \
            binCount=0; \
            binCumulSums[$binNum]; \
            geneCount=1; \
        } \
        { \
            binIdx=(binCount%binNum)+1; \
            binCumulSums[binIdx]+=$0; \
            print binIdx"\t"$0"\t"binCumulSums[binIdx]"\t"geneCount; \
            binCount++; \
            if (binIdx==binNum) { \
                geneCount++; \
            } \
        }' \
        - > intermediateResults.tsv

Let's take a look at these results:

$ more intermediateResults.tsv
1   2   2   1
2   4   4   1
3   4   4   1
4   6   6   1
5   14  14  1
1   15  17  2
2   12  16  2
3   14  18  2
4   18  24  2
5   23  37  2

Let's break this table of intermediate values down by column:

The first column is the bin ID value (1 through 5).
The second column is the mean signal across that individual bin.
The third column is the cumulative sum of signal across bins of that bin ID (for example, for bin 4, the cumulative sum is 6 + 18, or 24.
The fourth column is gene index associated the listed bin, irrespective of the gene name. Here, we just have two genes foo and bar (see genes.bed), so we have two gene indices 1 and 2. (In your case, the total number of genes is 30.2K, so you'd see indices 1..30200.)

We now have enough information to calculate a final answer. We'll remove the print statement in the awk command above that shows these intermediate values, and replace it with an END block that calculates and prints a final result, the average signal over an individual bin:

$ bedmap --mean --prec 0 bins.bed signal.bed \
    | awk ' \
        BEGIN { \
            binNum=5; \
            binCount=0; \
            binCumulSums[$binNum]; \
            geneCount=1; \
        } \
        { \
            binIdx=(binCount%binNum)+1; \
            binCumulSums[binIdx]+=$0; \
            binCount++; \
            if (binIdx==binNum) { \
                geneCount++; \
            } \
        } \
        END { \
            for (idx=1;idx<=binNum;idx++) { \
                print idx"\t"binCumulSums[idx]/(geneCount-1); \
            } \
        }' \
        - > answer.tsv

Let's look at the answer we'd expect from our sample data:

$ more answer.tsv
1    8.5
2    8
3    9
4    12
5    18.5

How well will this scale to your dataset? I honestly don't know, but since we're streaming through bedmap results one line at a time, I think the memory overhead is going to be perfectly reasonable on a typical, modern workstation.
If you want to optimize this further, you could even avoid the file I/O expense of creating the intermediate bins.bed file and just pipe the initial bin-creating awk statement into bedmap (as well as taking out the bin index variable, which is unused by bedmap --mean) e.g.:

$ awk ' \
    BEGIN { \
        binNum=5; \
    } \
    { \
        chr=$1; \
        start=$2; \
        stop=$3; \
        binSize=(stop-start)/binNum; \
        for (i=start;i<stop;i+=binSize) { \
            print chr"\t"i"\t"(i+binSize); \
        } \
    }' genes.bed \
    | bedmap --mean --prec 0 - signal.bed \
    | awk ' \
        BEGIN { \
            binNum=5; \
            binCount=0; \
            binCumulSums[$binNum]; \
            geneCount=1; \
        } \
        { \
            binIdx=(binCount%binNum)+1; \
            binCumulSums[binIdx]+=$0; \
            binCount++; \
            if (binIdx==binNum) { \
                geneCount++; \
            } \
        } \
        END { \
            for (idx=1;idx<=binNum;idx++) { \
                print idx"\t"binCumulSums[idx]/(geneCount-1); \
            } \
        }' \
        - > probablyFasterAnswer.tsv

Here, we replaced bins.bed in the bedmap statement with the hyphen character (-), which denotes standard input. Eliminating disk-based I/O should make this analysis run even faster. Handling standard input and output in this way is one of the major strengths of BEDOPS tools.
Finally, this is a problem that is one of those "embarrassingly parallel" types, if you think about splitting bins by ID into separate analysis tasks (which I'll leave to you to implement).

ADD COMMENT • link modified 2.6 years ago • written 2.6 years ago by Alex Reynolds ♦ 12k

Thanks for a detailed answer Alex. Great tool. It works great and finishes under 2 mins. I played with the script to test afew things. Just a small query, in my refernece file some coordinates overlap (same genes with multiple isoforms), so bedmap returns the NAN for the second overlap set. Thats, what I was reporting earlier as well. How to control this default behaviour. There is coverage(reads) in mapped file for that region, I have checked seperately, just it doesn't report it together.
Another thing, I have noticed after plotting was while adding the bins, they should go strand specifically, as in the gene reference file, we always write small co-ordinate in the start(2nd column) and bigger at the nd (3rd column) but it necessarily doesn't means like that. 3rd col would be the start of a gene present on the -ve strand. I can make a script in R to do that or can just reverse the co-ordinates of my gene reference file with a bigger co-ordinate at start for the gene on -ve strand and vice versa. This was not possible with bedtools as it gives an error but with bedmap, its possible(actually not really- with the test below).
Tried with:

map.bed

chr1    5    15      1     1
chr1    25    44      2     2
chr1    46    64      3     3

ref.bed
chr1      45      14      a

bedmap --echo --mean map.bed ref.bed

chr1      45      14      a|NAN

So, I will just a script to add the bins strand specifically.
Cheers

ADD REPLY • link modified 2.6 years ago • written 2.6 years ago by Sukhdeep Singh ♦ 6.3k

So there are two problems with your test:

You need to reverse the order of map.bed and ref.bed in the bedmap call
You have a start coordinate that is greater than the stop coordinate in ref.bed

Here is an example of what comes from using the --ec option with bedmap, using your inputs:

$ bedmap --ec --echo --mean ref.bed map.bed
May use bedmap --help for more help.

Error: in ref.bed
End coordinates must be greater than start coordinates.
See row: 1

Using --ec on a small subset of your datasets is useful for catching these sorts of errors. It can double the runtime, however, so you generally don't want to use it on a full analysis, unless you can afford the added time expense.

ADD REPLY • link modified 2.6 years ago • written 2.6 years ago by Alex Reynolds ♦ 12k

After a week of analysis and script building, I will say everything is working fine. For cross sample comparison (different chip-seq profiles), I will just normalize the calculated avg column by the read density of the sequenced sample.
Another question was, why it reports NAN, answer is because the file was not sorted properly. --ec tag also helps to track the problem. So, I made two separate lists of the overlapping genes and then calculated the mean abundance in each bin.
Thanks Alex for the support. I am reading the table file in R, reversing the bins for the genes on -ve strands, adding the output from two different gene lists and the plotting using ggplot.
Just for example : This is how H3K4me3 looks over all genes enter image description here

Cheers

ADD REPLY • link modified 13 months ago • written 2.6 years ago by Sukhdeep Singh ♦ 6.3k

I don't think that is a valid BED file, where the start coordinate is greater than the stop coordinate. Use the --ec option to do error checking with bedmap and you should see an error message along these lines, I think.

ADD REPLY • link written 2.6 years ago by Alex Reynolds ♦ 12k

Hi Alex, I am coming back to same point (about overlapping co-ordinates in a same bed file) here in a different perspective.
I made a list of alternate isoforms of a gene (different transcription start or end but overlapping) as a text file ( splitting each of them into 100 bins). They are lined up according to their respective co-ordinates and thus not sorted.
When I try to calculate mean coverage using bedmap it reports NAN for the second overlapping co-ordinate, as in the first case it already reported it. Using --ec we know that it says the file is not sorted. This is making it hard for me.
Example (In form of 10 bin file) Using bedmap --echo --mean --prec 1

chr1    4854693    4854752    Tcea1-1    +|8.5
chr1    4854752    4854811    Tcea1-2    +|9.0
chr1    4854811    4854871    Tcea1-3    +|9.0
chr1    4854871    4854930    Tcea1-4    +|6.5
chr1    4854930    4854990    Tcea1-5    +|3.0
chr1    4854990    4855049    Tcea1-6    +|6.5
chr1    4855049    4855108    Tcea1-7    +|12.0
chr1    4855108    4855168    Tcea1-8    +|12.5
chr1    4855168    4855227    Tcea1-9    +|12.0
chr1    4855227    4855287    Tcea1-10    +|11.0
chr1    4855327    4855386    Tcea1.1-1    +|NAN
chr1    4855386    4855445    Tcea1.1-2    +|NAN
chr1    4855445    4855505    Tcea1.1-3    +|NAN
chr1    4855505    4855564    Tcea1.1-4    +|NAN
chr1    4855564    4855624    Tcea1.1-5    +|NAN
chr1    4855624    4855683    Tcea1.1-6    +|NAN
chr1    4855683    4855742    Tcea1.1-7    +|NAN
chr1    4855742    4855802    Tcea1.1-8    +|NAN
chr1    4855802    4855861    Tcea1.1-9    +|NAN
chr1    4855861    4855921    Tcea1.1-10    +|NAN

My possible solution would be to either split each line as a separate file, or loop over each line using while in shell or something, but simply can't we change the default behaviour of bedmap of not caring about input file sorting but just reporting the overlap as demanded. Thanks

ADD REPLY • link modified 18 months ago • written 18 months ago by Sukhdeep Singh ♦ 6.3k

BEDOPS tools require sorted input, but it is not clear to me why you can't sort to get the answer you want, simply by getting the --mean values from sorted bins (regardless of ID value), then post-processing to split into groups-by-ID. I must be missing something — can you explain in more detail why this is an issue?

ADD REPLY • link written 18 months ago by Alex Reynolds ♦ 12k

Thanks Alex for your reply, actually It was a small problem with the way I am doing. If I sort the bed file, it messes up my unique id's and then they start overlapping because of the overlapping co-ordinates. What I do is split this whole file after calculating mean, into a list of matrices with 100 lines each. As the file is not synchronised (each 100 lines are not referring to same gene), so I wanted not to sort the file. But as a workaround, I added an serial column to my original file and sorted using bed-sort, calculated mean and then re-sorted the file according to the serial number. It working, Thanks !!

ADD REPLY • link modified 18 months ago • written 18 months ago by Sukhdeep Singh ♦ 6.3k

6 weeks ago by

James Ashmore • 470

UK, Edinburgh

This can also be done using BEDTools.

# Split each gene into 100 windows and label windows by window number
bedtools makewindows -b genes.bed -n 100 -i winnum  > windows.bed

# Compute coverage for each window, BAM input...
bedtools coverage -a windows.bed -b sample.bam -counts > coverage.bedgraph

# Compute coverage for each window, bedGraph input...
bedtools map -a windows.bed -b sample.bedgraph  -c 4 -o mean -null 0 > coverage.bedgraph

# Sort bedGraph by window number
sort -n -k4 coverage.bedgraph > coverage.sorted.bedgraph

# Calculate average coverage for each window number
bedtools groupby -i coverage.sorted.bedgraph -g 4 -c 5 -o mean > average.tsv

ADD COMMENT • link modified 6 weeks ago • written 6 weeks ago by James Ashmore • 470

Thanks, very functional code, haven't tested it yet.

ADD REPLY • link written 6 weeks ago by Sukhdeep Singh ♦ 6.3k

Please log in to add an answer.

Friday, October 2, 2015

find . -type f -name "*.err" -exec rm {} \;
find . -type f -name "*.out" -exec rm {} \;
find . -type f -name "*.content" -exec rm {} \;
find . -type f -name "*.header" -exec rm {} \;
find . -type f -name "*_bed2wig2bw*" -exec rm {} \;
find . -type f -name "*.wig" -exec rm {} \;

Tuesday, September 8, 2015

heatmap

data<-read.csv("/Bioinformatics/NGS_analyses/automated/RNA-Seq/Differential_Expression/20150905_DE_Analysis_ShinnyCho/comparison_1/NM_Vs_N_Filtered_Heatmap.csv",header=TRUE, row.names=1)

head(data)
data_matrix<-data.matrix(data)
head(data_matrix)
library("gplots")
library("RColorBrewer")

pdf("/Bioinformatics/NGS_analyses/automated/RNA-Seq/Differential_Expression/20150905_DE_Analysis_ShinnyCho/comparison_1/NM_Vs_N_Filtered_Heatmap.pdf")
heatmap.2(data_matrix, col=brewer.pal(11,"RdBu"),scale="row",trace="none")
dev.off()

data<-read.csv("/Bioinformatics/NGS_analyses/automated/RNA-Seq/Differential_Expression/20150905_DE_Analysis_ShinnyCho/comparison_1/NM_Vs_N_All_Heatmap.csv",header=TRUE, row.names=1)

head(data)
data_matrix<-data.matrix(data)
head(data_matrix)
library("gplots")
library("RColorBrewer")

pdf("/Bioinformatics/NGS_analyses/automated/RNA-Seq/Differential_Expression/20150905_DE_Analysis_ShinnyCho/comparison_1/NM_Vs_N_All_Heatmap.pdf")
heatmap.2(data_matrix, col=brewer.pal(11,"RdBu"),scale="row",trace="none")
dev.off()

Friday, September 4, 2015

kill user jobs







	Delete all your jobs (either will work, check different versions of PBS) qstat -u sudipto \| awk -F. '/sudipto/{print $1}' \| xargs qdel qstat \| grep sudipto \| awk -F. '{print $1}' \| xargs qdel Delete all your queued jobs only. Leaves all runnings jobs alone. qstat -u sudipto \| awk -F. '/00 Q /{printf $1" "}' \| xargs qdel qstat -u sudipto \| awk -F. '/ Q --/{printf $1" " }' \| xargs qdel List all working nodes in the queue pbsnodes \| egrep '^node\|state =' \| grep -B 1 'state = free' \| grep ^node set NODELIST=`pbsnodes \| egrep '^node\|state =' \| grep -B 1 'state = free' \| grep ^node` foreach node ($NODELIST) /usr/local/torque-2.1.6/bin/qsub -l nodes=${NODE} ${HOME}/get.nodes.stats.csh done Run on a particular type of node qsub -l nodes=1:beta:ppn=1 ${script}

Monday, June 29, 2015

Convert NCBI feature file

python ExtractingFeature.py /Bioinformatics/Users/zfu/2015_Nadine_Strep/00.RawData/D39_Feature.txt /Bioinformatics/Users/zfu/2015_Nadine_Strep/00.RawData/D39_Feature.bed 'gi|116075884|gb|CP000410.1|'

python ExtractingFeature.py /Bioinformatics/Users/zfu/2015_Nadine_Strep/00.RawData/TIGR4_Feature.txt /Bioinformatics/Users/zfu/2015_Nadine_Strep/00.RawData/TIGR4_Feature.bed 'gi|193804931|gb|AE005672.3|'

python ExtractingFeature.py /Bioinformatics/Users/zfu/2015_Nadine_Strep/00.RawData/R6_Feature.txt /Bioinformatics/Users/zfu/2015_Nadine_Strep/00.RawData/R6_Feature.bed 'gi|15902044|ref|NC_003098.1|'

Monday, June 22, 2015

QC strep reads

/share/apps/FastQC/FastQC-0.11.2/fastqc --contaminants /share/apps/FastQC/FastQC-0.11.2/Configuration/contaminants_LJI.txt --casava -o /Bioinformatics/Users/zfu/2015_Nadine_Strep/ -t 2 /Bioinformatics/Users/zfu/2015_Nadine_Strep/URF918-2/02.Mapping/URF918-2.unaligned.fastq.gz &

/share/apps/FastQC/FastQC-0.11.2/fastqc --contaminants /share/apps/FastQC/FastQC-0.11.2/Configuration/contaminants_LJI.txt --casava -o /Bioinformatics/Users/zfu/2015_Nadine_Strep/ -t 2 /Bioinformatics/Users/zfu/2015_Nadine_Strep/URF918-2/02.Mapping/URF918-2.aligned.fastq &

/share/apps/FastQC/FastQC-0.11.2/fastqc --contaminants /share/apps/FastQC/FastQC-0.11.2/Configuration/contaminants_LJI.txt --casava -o /Bioinformatics/Users/zfu/2015_Nadine_Strep/ -t 2 /Bioinformatics/Users/zfu/2015_Nadine_Strep/URF918-1/02.Mapping/URF918-1.unaligned.fastq.gz &

/share/apps/FastQC/FastQC-0.11.2/fastqc --contaminants /share/apps/FastQC/FastQC-0.11.2/Configuration/contaminants_LJI.txt --casava -o /Bioinformatics/Users/zfu/2015_Nadine_Strep/ -t 2 /Bioinformatics/Users/zfu/2015_Nadine_Strep/URF918-1/02.Mapping/URF918-1.aligned.fastq &

/share/apps/FastQC/FastQC-0.11.2/fastqc --contaminants /share/apps/FastQC/FastQC-0.11.2/Configuration/contaminants_LJI.txt --casava -o /Bioinformatics/Users/zfu/2015_Nadine_Strep/ -t 2 /Bioinformatics/Users/zfu/2015_Nadine_Strep/9-2/02.Mapping/9-2.unaligned.fastq.gz &

/share/apps/FastQC/FastQC-0.11.2/fastqc --contaminants /share/apps/FastQC/FastQC-0.11.2/Configuration/contaminants_LJI.txt --casava -o /Bioinformatics/Users/zfu/2015_Nadine_Strep/ -t 2 /Bioinformatics/Users/zfu/2015_Nadine_Strep/9-2/02.Mapping/9-2.aligned.fastq &

/share/apps/FastQC/FastQC-0.11.2/fastqc --contaminants /share/apps/FastQC/FastQC-0.11.2/Configuration/contaminants_LJI.txt --casava -o /Bioinformatics/Users/zfu/2015_Nadine_Strep/ -t 2 /Bioinformatics/Users/zfu/2015_Nadine_Strep/9-1/02.Mapping/9-1.unaligned.fastq.gz &

/share/apps/FastQC/FastQC-0.11.2/fastqc --contaminants /share/apps/FastQC/FastQC-0.11.2/Configuration/contaminants_LJI.txt --casava -o /Bioinformatics/Users/zfu/2015_Nadine_Strep/ -t 2 /Bioinformatics/Users/zfu/2015_Nadine_Strep/9-1/02.Mapping/9-1.aligned.fastq &

Monday, June 8, 2015

VCF UNIQ DECOMPOSITION

cd /Bioinformatics/Users/zfu/2015_Nadine_Strep

/Bioinformatics/Users/zfu/packages/vt/vt decompose ./9-1/03.Freebayes/9-1_sorted.vcf -o ./9-1/03.Freebayes/9-1.decomposed.vcf

/Bioinformatics/Users/zfu/packages/vt/vt normalize ./9-1/03.Freebayes/9-1.decomposed.vcf -r ./00.RawData/Strep.fasta | /Bioinformatics/Users/zfu/packages/vt/vt uniq - -o ./9-1/03.Freebayes/9-1.normalized_uniq.vcf

/Bioinformatics/Users/zfu/packages/vt/vt decompose_blocksub ./9-1/03.Freebayes/9-1.normalized_uniq.vcf -o ./9-1/03.Freebayes/9-1.decomposed_blocksub.vcf

***********************************************

/Bioinformatics/Users/zfu/packages/freebayes/bin/freebayes -F 0.01 -p 1 --fasta-reference /Bioinformatics/Users/zfu/2015_Stephanie_BAC/00.RawData/clone.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/2014/04.Clone/2014_clone_sorted.bam > /Bioinformatics/Users/zfu/2015_Stephanie_BAC/2014/06.Freebayes/2014_clone_sorted.vcf

/Bioinformatics/Users/zfu/packages/vcflib/bin/vcffilter -f "QUAL > 20" /Bioinformatics/Users/zfu/2015_Stephanie_BAC/2014/06.Freebayes/2014_clone_sorted.vcf > /Bioinformatics/Users/zfu/2015_Stephanie_BAC/2014/06.Freebayes/2014_clone_filtered.vcf

cd /Bioinformatics/Users/zfu/2015_Stephanie_BAC/2014/src/

python generateConsensusData_Ori.py ../06.Freebayes/2014_clone_sorted.vcf ../../00.RawData/oneLine.fasta/clone_oneLine.fasta ../07.Consensus.Fasta/2014_clone_Consensus.fa ../07.Consensus.Fasta/2014_clone_noGap_Consensus.fa ../06.Freebayes/2014_clone_snpInfo.tsv

python generateConsensusData_Freq.py ../06.Freebayes/2014_clone_filtered.vcf ../../00.RawData/oneLine.fasta/clone_oneLine.fasta ../07.Consensus.Fasta/2014_clone_Consensus.filtered.fa ../07.Consensus.Fasta/2014_clone_noGap_Consensus.filtered.fa ../06.Freebayes/2014_clone_snpInfo.filtered.tsv ../06.Freebayes/2014_clone_snpInfo.filtered_Freq.vcf

merge many files and remove the first line.

find . -name "*.extension" | xargs -n 1 tail -n +2

Friday, June 5, 2015

BIIB and ARE and INTC

BIIB:

测试200均线以后迅速反弹，以后止损应该设置在20/50/200均线下方一点的地方。

INTC:

好消息出来以后买进，结果迅速下跌。

ARE:

好消息出来以后没有买进，结果迅速下跌。

Thursday, May 28, 2015

snps calling

/Bioinformatics/Users/zfu/packages/freebayes/bin/freebayes -F 0.01 -p 1 --fasta-reference /Bioinformatics/Users/zfu/2015_Stephanie_BAC/00.RawData/clone_GFP.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/05.Clone_GFP/L80_clone_GFP_sorted.bam > /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/06.Freebayes/L80_clone_GFP_sorted.vcf

/Bioinformatics/Users/zfu/packages/vcflib/bin/vcffilter -f "QUAL > 20" /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/06.Freebayes/L80_clone_GFP_sorted.vcf > /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/06.Freebayes/L80_clone_GFP_filtered.vcf

cd /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L51/src/

python generateConsensusData_Ori.py /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/06.Freebayes/L80_clone_GFP_sorted.vcf /Bioinformatics/Users/zfu/2015_Stephanie_BAC/00.RawData/oneLine.fasta/clone_oneLine_GFP.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/07.Consensus.Fasta/L80_clone_GFP_Consensus.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/07.Consensus.Fasta/L80_clone_GFP_Consensus_noGap.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/06.Freebayes/L80_clone_GFP_snpInfo.tsv

python generateConsensusData_Freq.py /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/06.Freebayes/L80_clone_GFP_filtered.vcf /Bioinformatics/Users/zfu/2015_Stephanie_BAC/00.RawData/oneLine.fasta/clone_oneLine_GFP.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/07.Consensus.Fasta/L80_clone_GFP_Consensus.filtered.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/07.Consensus.Fasta/L80_clone_GFP_Consensus_noGap.filtered.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/06.Freebayes/L80_clone_GFP_snpInfo.filtered.tsv /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L80/06.Freebayes/L80_clone_GFP_snpInfo_filtered_Freq.vcf

Wednesday, May 20, 2015

freebayes

/Bioinformatics/Users/zfu/packages/freebayes/bin/freebayes -F 0.01 -p 1 --fasta-reference /Bioinformatics/Users/zfu/2015_Stephanie_BAC/00.RawData/clone_GFP.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/05.Clone_GFP/L74_clone_GFP_sorted.bam > /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/06.Freebayes/L74_clone_GFP_sorted.vcf

/Bioinformatics/Users/zfu/packages/vcflib/bin/vcffilter -f "QUAL > 20" /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/06.Freebayes/L74_clone_GFP_sorted.vcf > /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/06.Freebayes/L74_clone_GFP_filtered.vcf

cd /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L51/src/

python generateConsensusData_Ori.py /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/06.Freebayes/L74_clone_GFP_sorted.vcf /Bioinformatics/Users/zfu/2015_Stephanie_BAC/00.RawData/oneLine.fasta/clone_oneLine_GFP.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/07.Consensus.Fasta/L74_clone_GFP_Consensus.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/07.Consensus.Fasta/L74_clone_GFP_Consensus_noGap.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/06.Freebayes/L74_clone_GFP_snpInfo.tsv

python generateConsensusData_Freq.py /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/06.Freebayes/L74_clone_GFP_filtered.vcf /Bioinformatics/Users/zfu/2015_Stephanie_BAC/00.RawData/oneLine.fasta/clone_oneLine_GFP.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/07.Consensus.Fasta/L74_clone_GFP_Consensus.filtered.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/07.Consensus.Fasta/L74_clone_GFP_Consensus_noGap.filtered.fasta /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/06.Freebayes/L74_clone_GFP_snpInfo.filtered.tsv /Bioinformatics/Users/zfu/2015_Stephanie_BAC/L74/06.Freebayes/L74_clone_GFP_snpInfo_filtered_Freq.vcf

Wednesday, May 13, 2015

RseqC

/Bioinformatics/Users/zfu/packages/RSeQC/share/apps/python/python-2.7.6/bin/RPKM_saturation.py -r Clone.bed -i ../L74/04.Clone/L74_clone_sorted.bam -o L74 &
/Bioinformatics/Users/zfu/packages/RSeQC/share/apps/python/python-2.7.6/bin/RPKM_saturation.py -r Clone.bed -i ../L80/04.Clone/L80_clone_sorted.bam -o L80 &

Tuesday, May 12, 2015

vcf to fasta

4.1 years ago by

Francisco Roque •110

Bergen, Norway

This will not apply directly to the vcf file, but using extra annotation from bcftools (assuming you are using it for snp and indel calling):

samtools mpileup -uf reference.fasta alignment.bam | \
bcftools view -cg - | vcfutils.pl vcf2fq > consensus.fastq

After that just convert the fastq into a consensus fasta using seqtk, converting all bases with quality lower than 20 to lowercase

seqtk fq2fa consensus.fastq 20 > consensus.fasta

Freebayes

Calling variants with freebayes

Erik Garrison

erik.garrison@bc.edu / @erikgarrison

Summary

This tutorial is a brief walkthrough demonstrating how to use FreeBayes to detect short sequence variants in read alignments generated from a resequencing process.

Getting started

You've sequenced a sample. You've aligned your reads to a reference genome. You've sorted your alignments. You've marked PCR duplicates. You've indexed your alignments.

Now, you want to determine the sequence variants present in your sample. FreeBayes lets you do this very easily!

You first need to download and install freebayes:

git clone --recursive git://github.com/ekg/freebayes.git
cd freebayes
make

Alternatively, you can use an existing tarball:

wget http://clavius.bc.edu/~erik/freebayes/freebayes-5d5b8ac0.tar.gz
tar xzvf freebayes-5d5b8ac0.tar.gz
cd freebayes
make

(These tarballs aren't built regularly, so your best bet is to set up and use git to download the code. It's really useful!)

Now that freebayes is built, we can run it:

bin/freebayes        # default summary
bin/freebayes -h     # verbose help text, summary of options

If all is well we should see a description of the method. You can install freebayes by moving it into your path, or adding freebayes/bin to your path.

Working with test data

To call variants we need an alignment file (in BAM format). We also need a reference genomein (uncompressed) FASTA format.

An easy way to get started is to run tests on publicly-available data sets. For this we can use the ever-popular CEU hapmap sample NA12878. This sample is a good target because it provides us a putative truth set, currently in development by the NIST Genome in a Bottle Project.

The 1000 Genomes Project has incorporated a down-sampled alignment set into the project's low-coverage (~5-6x) analyses. As such, we have readily-available alignments processed using the same pipeline used for the alignment of all the 1000 Genomes samples.

We first need to download the reference used for the alignment. Our analysis will be limited to chr20, so we'll just get a copy of that chromosome.

wget http://bioinformatics.bc.edu/marthlab/download/gkno-cshl-2013/chr20.fa

Now, we should grab a piece of the low-coverage NA12878 alignments using samtools:

wget ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
wget ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam.bai

These alignments are just for human chromosome 20, which amounts to about 1/50th of the genome. This will be plenty for testing.

To have a truth set to compare to, we should download the NIST Genome in a Bottle variants. Note that we are obtaining both the VCF (variants) and a target list in which the project determined it is possible to confidently make genotype calls:

wget ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/variant_calls/NIST/NISTIntegratedCalls_13datasets_130719_allcall_UGHapMerge_HetHomVarPASS_VQSRv2.17_all_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs.vcf.gz
wget ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/variant_calls/NIST/union13callableMQonlymerged_addcert_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs_v2.17.bed.gz
gunzip *.vcf.gz
gunzip *.bed.gz

And, we can also grab a curated set of these variants shared by the Broad Genome Sequencing and Analysis Group:

wget http://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/working/20130806_broad_na12878_truth_set/NA12878.wgs.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz
wget http://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/working/20130806_broad_na12878_truth_set/NA12878.wgs.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz.tbi

Now we have a reference, alignment data, and a set of truth sets available for NA12878.

Variant calling

To call variants, we can just use freebayes in its default configuration:

freebayes -f chr20.fa \
    NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam >NA12878.chr20.freebayes.vcf

(This should take 5-10 minutes on a typical system.)

You could target a particular region as well (for instance, to speed up testing) by providing freebayes the --region parameter:

freebayes -f chr20.fa --region 20:1000000-2000000 \
    NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam >NA12878.chr20.1mb_2mb.freebayes.vcf

Understanding our results

Take a look at the output.

less -S NA12878.chr20.freebayes.vcf

There is a header:

##fileformat=VCFv4.1
##fileDate=20131121
##source=freeBayes v0.9.9.2-23-g8a98f11-dirty
##reference=chr20.fa
##phasing=none
##commandline="freebayes -f chr20.fa NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam"
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
....

And then there are variants:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
20      61795   .       G       T       127.882 .       AB=0.454545;ABP=3.20771;AC=1;AF=0.5
20      63244   .       A       C       90.7756 .       AB=0;ABP=0;AC=2;AF=1
....

We can quickly examine our aggregate results using tools from vcflib, such as vcfstats:

vcfstats NA12878.chr20.freebayes.vcf

This will provide us a lot of information about small variants, such as the counts of SNPs, indels, and other variant types, the transition/transversion ratio, and a summary of the indel length-frequency spectrum. An interesting thing to note is that the ts/tv ratio is different for high and low-quality variants:

vcffilter -f "QUAL > 20" NA12878.chr20.freebayes.vcf | vcfstats | grep "ts\|bial"
# compare to
vcffilter -f "QUAL < 20" NA12878.chr20.freebayes.vcf | vcfstats | grep "ts\|bial"

Comparing our results to the Broad NA12878 Knowledge-Base

First, we should either use tabix to extract just chr20 from the NA12878-K.B.:

tabix -h NA12878.chr20.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz 20

Or, you can download a pre-made set:

wget http://clavius.bc.edu/~erik/NA12878/NA12878.chr20.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz
wget http://clavius.bc.edu/~erik/NA12878/NA12878.chr20.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz.tbi

And, for this comparison we want to both exclude regions which weren't considered confidently-callable by the GiaB set, and keep only variants which are considered TRUE_POSITIVE in the Broad/GiaB NA12878 truth set:

zcat NA12878.chr20.broad_truth_set.20131119.snps_and_indels.genotypes.vcf.gz \
    | vcfintersect -b union13callableMQonlymerged_addcert_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs_v2.17.bed \
    | grep "^#\|TRUE_POSITIVE" >NA12878.chr20.broad_truth_set.TP.callable_regions.vcf

We'll do the same for our calls, restricting them to confidently-callable regions:

cat NA12878.chr20.freebayes.vcf \
    | vcfintersect -b \
        union13callableMQonlymerged_addcert_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs_v2.17.bed \
        >NA12878.chr20.freebayes.callable_regions.vcf

Using this result, we can generate a kind of receiver-operator characteristic using a tool in vcflib, vcfroc:

vcfroc -r chr20.fa \
    -t NA12878.chr20.broad_truth_set.callable_regions.vcf \
    NA12878.chr20.freebayes.vcf >broad_truth_set.roc.tsv

We can generate a plot of this using some tools in vcflib:

( vcfroc -r chr20.fa -t \
    NA12878.chr20.broad_truth_set.TP.callable_regions.vcf \
    NA12878.chr20.broad_truth_set.TP.callable_regions.vcf \
    | prependcol set answers; \
    cat broad_truth_set.roc.tsv \
    | prependcol set freebayes | grep -v set ) >roc.tsv
plot_roc.r freebayesNA12878 answers roc.tsv 0 1 0 1

Here, prependcol is a script that prepends a column named "set" with each row set to "answers" or "freebayes."

The resulting image freebayesNA12878.both.png, shows that we can achieve < 5% FDR in this sample (assuming perfect sensitivity in the truth set--- which is unlikely), and that our overall sensitivity is rather low. This matches our expectations given the low coverage of the sample.