Benford’s Law and NGS Gene Counts

This may be a piece of trivia more than anything else, but anyway, let’s begin with the surprising part. Take any RNASeq sample and assign to each gene a count corresponding to the number of reads aligning to that gene (what we call Quantification in Avadis NGS). For the purposes of this post, ignore all genes with 0 counts; this is roughly half the genes usually. For each of the remaining genes, consider the most significant digit in its count (i.e., if the count is 35, the most significant digit is 3). Over all genes, there are 9 choices for this significant digit; how are these 9 choices distributed? Are all choices equally likely? Or are some digits more likely than others? The distribution computed on actual data is shown in blue below (we’ve tried several datasets and the distribution doesn’t change much from one to another).

Surprisingly (though this won’t be surprising any more by the time you reach the end of this post), the digits are not all equally frequent. The digit 1 appears approximately 30% of the time and is more frequent than the other digits, which have successively reduced frequencies!

This is not a new phenomenon; it’s been known for a while as Benford’s Law and was first discovered by the rather alert and curious observation that pages in a book of log-tables corresponding to numbers beginning with 1 appeared to show more signs of wear and tear as compared with those dealing with other digits. The theoretical predictions of this law are shown in red in the picture above; clearly, a very close match. Benford’s Law appears in many contexts, including heights of structures and populations of countries. Why does it appear in RNASeq gene counts though?

Let us take a look at the distribution of gene counts. For simplicity, consider only frequencies of counts 1 through 99, i.e., how many genes get a count of 1, how many get a count of 2, etc through to 99. The picture is as follows.

The profile of red dots is the actual data. The profile of green dots is the curve y=1/[ 2x ln(10) ] , so essentially a hyperbola that fits the data quite well. Given this fit, how frequently does the most significant digit take on value d. We simply need to integrate as below. For instance, for d=3, we need to integrate from 30 to 40, and from 3 to 4.

And the expression on the right is exactly the frequency prescribed by Benford’s Law. For d=1, this is log10(2), which is about 0.3! As simple as that!!

About Ramesh Hariharan

Dr. Ramesh Hariharan is an academic entrepreneur responsible for the software-based technology development and implementation at Strand Life Sciences. He is also the chief architect for all of Strand's products, including the award winning Avadis® platform. He is a recipient of the TR100 Award (2002) of Young Innovators by MIT's Technology Review Magazine and in 2003 received the Global Indus Technovator Award from MIT, instituted to recognize the top 20 Indian technology innovators worldwide. Ramesh is an IIT - Delhi Computer Science alumnus, has a Ph.D. in Computer Science from the Courant Institute of Mathematical Science, New York University and a postdoctoral degree in Computer Science from the Max Planck Institute, Saarbrücken, Germany. His research interests are in sequence analysis, string algorithms, computational biology, computational geometry and foundations of computing.