Thursday, September 26, 2013

Explanation of different options for normalizations in Cufflinks

This is the best explanation that I have seen so far on the different normalization schema available in Cufflinks and how it affects calculation for FPKM. I am copying it directly from the thread which can be seen here

“With cufflinks you can have three different normalizations: fragments mapped to genome (in millions), fragments mapped to transcriptome (in millions: --compatable-hits-norm) or upper quartile (-N). Regardless of the normalization the same number of reads is quantified at each gene. I've looked into it myself. If you run cufflinks using all three of those normalizations then look at each of the separate isoforms.fpkm_tracking files you can confirm it. Check for the coverage and FPKM columns. You should see different FPKMs but identical coverages across the three quantifications. Furthermore if you divide the FPKMs by each other you should see that at each gene there's a constant ratio between the FPKMs.

If you calculate FPKMs yourself you can see why the numbers shift around. To be honest the "FPKM" designation is misleading when you're using any normalization other than "mapped reads in millions". Right? Fragments per kilobase per million mapped reads is what you're used to.
So say we have a gene that's 2500 bases long. We've got 121 fragments that mapped to it and we've got 34.7 million fragments mapped to the genome. We can get the FPKM like so..

FPKM = 121/(34.7*2.5) = 1.394813

Say only 27.4 million fragments mapped to the transcriptome. So if you used --compatible-hits-norm then the calculation looks like this:

FPKM = 121/(27.4*2.5) = 1.766423

Those aren't that different from one another. Now if you use upper quartile we're talking about the upper quartile value of fragments mapped to genes in the sample. That number might be something like 12,000. Divide this value by 1e6 to put it into "millions" like you do with mapped fragments it becomes 0.012. So now the calculation looks like this:

FPKM = 121/(0.012*2.5) = 4033.333

So maybe it makes sense to scale the upper quartile normalization value by 1000 so that the "FPKM" comes out as 4.033 instead of 4033. That's reasonable. But it really shouldn't be called an FPKM because if you think about it it's like someone telling you there are 14 cars outside and you assume they mean 14...but they actually told you 14 in base 16 which would be 20 in base 10 (or maybe like expecting a measurement to be in cm but you're given the measurement in inches with a cm designation). It's not fragments per kilobase per million mapped reads, it's fragments per kilobase per upper quartile of read counts @ genes. So FPKPUQRCG. That name sucks.

The point of these different normalizations is only applicable to when you're comparing samples to each other. So if you're goal is to see if gene X is expressed higher in Sample A verses B then regardless of the normalization used (as long as you use the same one on both samples) you'll find your answer. The upper quartile normalization has been showing to be more robust so maybe it's better to use it for comparing samples to one another. Also, obviously, for the expression levels to make sense to other people we all need to be using the same normalization. We should probably all be using upper quartile normalization but that puts the numbers on a different scale than we used to seeing.”