Question: Binning Over Genes And Calculating The Coverage [Bedops/Bedmap]
3
I have a list of genes and a coverage track (generated using BEDOPS-based binning script also called as
What I want do it to divide the each gene into 100 bins and then calculate the mean overlap from my coverage file, so that later all the respective bins can be added for all genes to give a average protein binding/coverage in that bin.
I couldn't understand that how I can make a specific number of bins for the genes though I can regulate the binSize using
Then, I used this list as a reference, (added a fake 4th column, which is also not very intuitive) to find the mean overlap for my coverage file.
Problems :
1) This method is not very intutitive (because of R code, it takes ~4-6 mins to complete the process for a list of 30.2K genes producing 30.2K*100 lines as a bed file to be used as a reference). It there a way in bedOps suite to generate the bins on the fly.
2) The output of
Can you suggest some other way to approach this problem.
Thanks
binReads.sh), modified it to take bed file as input).What I want do it to divide the each gene into 100 bins and then calculate the mean overlap from my coverage file, so that later all the respective bins can be added for all genes to give a average protein binding/coverage in that bin.
I couldn't understand that how I can make a specific number of bins for the genes though I can regulate the binSize using
binReads.sh. I made a custom R code, for taking my list of genes and the dividing each of them into 100 bins.Then, I used this list as a reference, (added a fake 4th column, which is also not very intuitive) to find the mean overlap for my coverage file.
Problems :
1) This method is not very intutitive (because of R code, it takes ~4-6 mins to complete the process for a list of 30.2K genes producing 30.2K*100 lines as a bed file to be used as a reference). It there a way in bedOps suite to generate the bins on the fly.
2) The output of
bedmap --mean genes_bins.bed file.cov > means.tsv
is valid numeric for first 42K reference lines out of 3.2M(30.2K*100)
and for the rest it produces "NAN". To cross-validate, I took some
specific binned genes and it produces a valid output. Is this because of
the size of the files (shouldn't be).Can you suggest some other way to approach this problem.
genes_bins.bed
chr1 3214481 3219051 Xkr4-bin1
chr1 3219051 3223621 Xkr4-bin2
chr1 3223621 3228192 Xkr4-bin3
chr1 3228192 3232762 Xkr4-bin4
chr1 3232762 3237332 Xkr4-bin5
chr1 3237332 3241902 Xkr4-bin6
coverage.bed
chr1 3001500 3001520 1 1
chr1 3001520 3001540 1 1
chr1 3001540 3001560 1 1
chr1 3001560 3001580 1 1
chr1 3001580 3001600 1 1
chr1 3001600 3001620 1 1
chr1 3001620 3001640 1 1
P.S. It was started as comment under Converting BAM to bedGraph for viewing on UCSC and I thought its better to ask as a seperate question. This probably be answered by Alex Reynolds himself. The final aim to make the composite profiles of your protein of interest from the chip-seq datasets.Thanks
head of the file, I will mention that.
Secondly, thats the whole point of binning, the genes have unequal length, if you want to make a Plotting Read Density/ Distribution/ Composite Profiles [Chip-Seq],
then the genes should be be binned and avg binding per bin is
calculated in the end to be summed up for per bin for all bins of all
genes, to get an average binding over the whole length of the gene.
This will effect, that sum genes have more density in a bin and some
less. Can you suggest some other strategy!!
3
The
To demonstrate how to solve your problem, let's start with a couple example genes
If I understand your question correctly (and the rest of this answer depends on this understanding), the actual name of the gene doesn't matter for the final answer — we just want an average signal for a bin across all genes in the set, which are divided into equal numbers of partitions regardless of size.
(Also, for the purposes of demonstration, I'm "cheating" with my choice of gene BED regions and bins, so that everything partitions cleanly. I'll leave it to you to handle edge cases where things aren't so neatly divisible.)
Next, let's set up an example signal file (coverage data). It doesn't matter what's in the ID field — only that something is there, so that the score or signal value is in the fifth column, thus ensuring that our tools are working with a properly UCSC-formatted BED file:
If you want to optimize this further, you could even avoid the file I/O expense of creating the intermediate
Finally, this is a problem that is one of those "embarrassingly parallel" types, if you think about splitting bins by ID into separate analysis tasks (which I'll leave to you to implement).
Thanks for a detailed answer Alex. Great tool. It works great and
finishes under 2 mins. I played with the script to test afew things.
Just a small query, in my refernece file some coordinates overlap (same
genes with multiple isoforms), so
Another thing, I have noticed after plotting was while adding the bins, they should go strand specifically, as in the gene reference file, we always write small co-ordinate in the start(2nd column) and bigger at the nd (3rd column) but it necessarily doesn't means like that. 3rd col would be the start of a gene present on the -ve strand. I can make a script in R to do that or can just reverse the co-ordinates of my gene reference file with a bigger co-ordinate at start for the gene on -ve strand and vice versa. This was not possible with
Tried with:
Cheers
So there are two problems with your test:
After a week of analysis and script building, I will say
everything is working fine. For cross sample comparison (different
chip-seq profiles), I will just normalize the calculated avg column by
the read density of the sequenced sample.
Another question was, why it reports
Thanks Alex for the support. I am reading the table file in R, reversing the bins for the genes on -ve strands, adding the output from two different gene lists and the plotting using ggplot.
Just for example : This is how H3K4me3 looks over all genes
Cheers
Hi Alex,
I am coming back to same point (about overlapping co-ordinates in a same bed file) here in a different perspective.
I made a list of alternate isoforms of a gene (different transcription start or end but overlapping) as a text file ( splitting each of them into 100 bins). They are lined up according to their respective co-ordinates and thus not sorted.
When I try to calculate mean coverage using
Example (In form of 10 bin file) Using
BEDOPS tools require sorted input, but it is not clear to me why
you can't sort to get the answer you want, simply by getting the
Thanks Alex for your reply, actually It was a small problem with
the way I am doing. If I sort the bed file, it messes up my unique id's
and then they start overlapping because of the overlapping co-ordinates.
What I do is split this whole file after calculating mean, into a list
of matrices with 100 lines each. As the file is not synchronised (each
100 lines are not referring to same gene), so I wanted not to sort the file. But as a workaround, I added an serial column to my original file and sorted using
bedops and bedmap utilities in the BEDOPS suite do not generate bins, but you could use awk to do this, which might be faster than using R.To demonstrate how to solve your problem, let's start with a couple example genes
foo and bar:$ more genes.bed
chr1 1000 4000 foo
chr1 6000 9000 bar
Here's how we could partition them into, say, five bins for each gene:$ awk ' \
BEGIN { \
binNum=5; \
} \
{ \
chr=$1; \
start=$2; \
stop=$3; \
binIdx=1; \
binSize=(stop-start)/binNum; \
for (i=start;i<stop;i+=binSize) { \
print chr"\t"i"\t"(i+binSize)"\t"binIdx; \
binIdx++; \
} \
}' \
genes.bed > bins.bed
Here's what our example bins.bed looks like:$ more bins.bed
chr1 1000 1600 1
chr1 1600 2200 2
chr1 2200 2800 3
chr1 2800 3400 4
chr1 3400 4000 5
chr1 6000 6600 1
chr1 6600 7200 2
chr1 7200 7800 3
chr1 7800 8400 4
chr1 8400 9000 5
Note that we're making the ID field the bin index — we're not including the name of the gene in the bin index. If I understand your question correctly (and the rest of this answer depends on this understanding), the actual name of the gene doesn't matter for the final answer — we just want an average signal for a bin across all genes in the set, which are divided into equal numbers of partitions regardless of size.
(Also, for the purposes of demonstration, I'm "cheating" with my choice of gene BED regions and bins, so that everything partitions cleanly. I'll leave it to you to handle edge cases where things aren't so neatly divisible.)
Next, let's set up an example signal file (coverage data). It doesn't matter what's in the ID field — only that something is there, so that the score or signal value is in the fifth column, thus ensuring that our tools are working with a properly UCSC-formatted BED file:
$ more signal.bed
chr1 1000 1500 id-1 1
chr1 1500 2000 id-2 3
chr1 2000 2500 id-3 4
chr1 2500 3000 id-4 4
chr1 3000 3500 id-5 8
chr1 3500 4000 id-6 20
chr1 4000 4500 id-7 15
chr1 4500 5000 id-8 12
chr1 5000 5500 id-9 30
chr1 5500 6000 id-10 35
chr1 6000 6500 id-11 20
chr1 6500 7000 id-12 10
chr1 7000 7500 id-13 15
chr1 7500 8000 id-14 12
chr1 8000 8500 id-15 23
chr1 8500 9000 id-16 26
Let's first do a naïve bedmap with the --mean operator to see what we get. We'll include the --echo operator, so we can see what mean signal is associated with what reference element (bin):$ bedmap --echo --mean bins.bed signal.bed
chr1 1000 1600 1|2.000000
chr1 1600 2200 2|3.500000
chr1 2200 2800 3|4.000000
chr1 2800 3400 4|6.000000
chr1 3400 4000 5|14.000000
chr1 6000 6600 1|15.000000
chr1 6600 7200 2|12.500000
chr1 7200 7800 3|13.500000
chr1 7800 8400 4|17.500000
chr1 8400 9000 5|23.000000
Firstly, the precision of the mean result needs adjusting, using the --prec operator. Secondly, we'll remove the --echo operator, because we don't need it for the next step:$ bedmap --mean --prec 0 bins.bed signal.bed
2
4
4
6
14
15
12
14
18
23
As a first pass, to demonstrate how we'll get a final answer, we'll pass our bedmap-ed stream of mean values to another awk statement to track cumulative sums of means over an individual bin, while counting the number of genes we traverse:$ bedmap --mean --prec 0 bins.bed signal.bed \
| awk ' \
BEGIN { \
binNum=5; \
binCount=0; \
binCumulSums[$binNum]; \
geneCount=1; \
} \
{ \
binIdx=(binCount%binNum)+1; \
binCumulSums[binIdx]+=$0; \
print binIdx"\t"$0"\t"binCumulSums[binIdx]"\t"geneCount; \
binCount++; \
if (binIdx==binNum) { \
geneCount++; \
} \
}' \
- > intermediateResults.tsv
Let's take a look at these results:$ more intermediateResults.tsv
1 2 2 1
2 4 4 1
3 4 4 1
4 6 6 1
5 14 14 1
1 15 17 2
2 12 16 2
3 14 18 2
4 18 24 2
5 23 37 2
Let's break this table of intermediate values down by column:- The first column is the bin ID value (
1through5). - The second column is the mean signal across that individual bin.
- The third column is the cumulative sum of signal across bins of that bin ID (for example, for bin
4, the cumulative sum is6+18, or24. - The fourth column is gene index associated the listed bin, irrespective of the gene name. Here, we just have two genes
fooandbar(seegenes.bed), so we have two gene indices1and2. (In your case, the total number of genes is 30.2K, so you'd see indices1..30200.)
print statement in the awk command above that shows these intermediate values, and replace it with an END block that calculates and prints a final result, the average signal over an individual bin:$ bedmap --mean --prec 0 bins.bed signal.bed \
| awk ' \
BEGIN { \
binNum=5; \
binCount=0; \
binCumulSums[$binNum]; \
geneCount=1; \
} \
{ \
binIdx=(binCount%binNum)+1; \
binCumulSums[binIdx]+=$0; \
binCount++; \
if (binIdx==binNum) { \
geneCount++; \
} \
} \
END { \
for (idx=1;idx<=binNum;idx++) { \
print idx"\t"binCumulSums[idx]/(geneCount-1); \
} \
}' \
- > answer.tsv
Let's look at the answer we'd expect from our sample data:$ more answer.tsv
1 8.5
2 8
3 9
4 12
5 18.5
How well will this scale to your dataset? I honestly don't know, but since we're streaming through bedmap results one line at a time, I think the memory overhead is going to be perfectly reasonable on a typical, modern workstation. If you want to optimize this further, you could even avoid the file I/O expense of creating the intermediate
bins.bed file and just pipe the initial bin-creating awk statement into bedmap (as well as taking out the bin index variable, which is unused by bedmap --mean) e.g.:$ awk ' \
BEGIN { \
binNum=5; \
} \
{ \
chr=$1; \
start=$2; \
stop=$3; \
binSize=(stop-start)/binNum; \
for (i=start;i<stop;i+=binSize) { \
print chr"\t"i"\t"(i+binSize); \
} \
}' genes.bed \
| bedmap --mean --prec 0 - signal.bed \
| awk ' \
BEGIN { \
binNum=5; \
binCount=0; \
binCumulSums[$binNum]; \
geneCount=1; \
} \
{ \
binIdx=(binCount%binNum)+1; \
binCumulSums[binIdx]+=$0; \
binCount++; \
if (binIdx==binNum) { \
geneCount++; \
} \
} \
END { \
for (idx=1;idx<=binNum;idx++) { \
print idx"\t"binCumulSums[idx]/(geneCount-1); \
} \
}' \
- > probablyFasterAnswer.tsv
Here, we replaced bins.bed in the bedmap statement with the hyphen character (-),
which denotes standard input. Eliminating disk-based I/O should make
this analysis run even faster. Handling standard input and output in
this way is one of the major strengths of BEDOPS tools.Finally, this is a problem that is one of those "embarrassingly parallel" types, if you think about splitting bins by ID into separate analysis tasks (which I'll leave to you to implement).
bedmap returns the NAN
for the second overlap set. Thats, what I was reporting earlier as
well. How to control this default behaviour. There is coverage(reads) in
mapped file for that region, I have checked seperately, just it doesn't
report it together.Another thing, I have noticed after plotting was while adding the bins, they should go strand specifically, as in the gene reference file, we always write small co-ordinate in the start(2nd column) and bigger at the nd (3rd column) but it necessarily doesn't means like that. 3rd col would be the start of a gene present on the -ve strand. I can make a script in R to do that or can just reverse the co-ordinates of my gene reference file with a bigger co-ordinate at start for the gene on -ve strand and vice versa. This was not possible with
bedtools as it gives an error but with bedmap, its possible(actually not really- with the test below).Tried with:
map.bed
chr1 5 15 1 1
chr1 25 44 2 2
chr1 46 64 3 3
ref.bed
chr1 45 14 a
bedmap --echo --mean map.bed ref.bedchr1 45 14 a|NAN
So, I will just a script to add the bins strand specifically.Cheers
1
- You need to reverse the order of
map.bedandref.bedin thebedmapcall - You have a start coordinate that is greater than the stop coordinate in
ref.bed
--ec option with bedmap, using your inputs:$ bedmap --ec --echo --mean ref.bed map.bed
May use bedmap --help for more help.
Error: in ref.bed
End coordinates must be greater than start coordinates.
See row: 1
Using --ec on a small subset of your datasets is useful
for catching these sorts of errors. It can double the runtime, however,
so you generally don't want to use it on a full analysis, unless you can
afford the added time expense.
1
Another question was, why it reports
NAN, answer is because the file was not sorted properly. --ec
tag also helps to track the problem. So, I made two separate lists of
the overlapping genes and then calculated the mean abundance in each
bin.Thanks Alex for the support. I am reading the table file in R, reversing the bins for the genes on -ve strands, adding the output from two different gene lists and the plotting using ggplot.
Just for example : This is how H3K4me3 looks over all genes
Cheers
I made a list of alternate isoforms of a gene (different transcription start or end but overlapping) as a text file ( splitting each of them into 100 bins). They are lined up according to their respective co-ordinates and thus not sorted.
When I try to calculate mean coverage using
bedmap it reports NAN for the second overlapping co-ordinate, as in the first case it already reported it. Using --ec we know that it says the file is not sorted. This is making it hard for me. Example (In form of 10 bin file) Using
bedmap --echo --mean --prec 1chr1 4854693 4854752 Tcea1-1 +|8.5
chr1 4854752 4854811 Tcea1-2 +|9.0
chr1 4854811 4854871 Tcea1-3 +|9.0
chr1 4854871 4854930 Tcea1-4 +|6.5
chr1 4854930 4854990 Tcea1-5 +|3.0
chr1 4854990 4855049 Tcea1-6 +|6.5
chr1 4855049 4855108 Tcea1-7 +|12.0
chr1 4855108 4855168 Tcea1-8 +|12.5
chr1 4855168 4855227 Tcea1-9 +|12.0
chr1 4855227 4855287 Tcea1-10 +|11.0
chr1 4855327 4855386 Tcea1.1-1 +|NAN
chr1 4855386 4855445 Tcea1.1-2 +|NAN
chr1 4855445 4855505 Tcea1.1-3 +|NAN
chr1 4855505 4855564 Tcea1.1-4 +|NAN
chr1 4855564 4855624 Tcea1.1-5 +|NAN
chr1 4855624 4855683 Tcea1.1-6 +|NAN
chr1 4855683 4855742 Tcea1.1-7 +|NAN
chr1 4855742 4855802 Tcea1.1-8 +|NAN
chr1 4855802 4855861 Tcea1.1-9 +|NAN
chr1 4855861 4855921 Tcea1.1-10 +|NAN
My possible solution would be to either split each line as a separate file, or loop over each line using while in shell or something, but simply can't we change the default behaviour of bedmap of not caring about input file sorting but just reporting the overlap as demanded.
Thanks--mean
values from sorted bins (regardless of ID value), then post-processing
to split into groups-by-ID. I must be missing something — can you
explain in more detail why this is an issue?
1
bed-sort, calculated mean and then re-sorted the file according to the serial number. It working, Thanks !!
1
This can also be done using BEDTools.
# Split each gene into 100 windows and label windows by window number bedtools makewindows -b genes.bed -n 100 -i winnum > windows.bed # Compute coverage for each window, BAM input... bedtools coverage -a windows.bed -b sample.bam -counts > coverage.bedgraph # Compute coverage for each window, bedGraph input... bedtools map -a windows.bed -b sample.bedgraph -c 4 -o mean -null 0 > coverage.bedgraph # Sort bedGraph by window number sort -n -k4 coverage.bedgraph > coverage.sorted.bedgraph # Calculate average coverage for each window number bedtools groupby -i coverage.sorted.bedgraph -g 4 -c 5 -o mean > average.tsv
Please log in to add an answer.
Use of this site constitutes acceptance of our User
Agreement
and Privacy
Policy.
Powered by Biostar
version 2.3.0
Traffic: 1215 users visited in the last hour

No comments:
Post a Comment