## How Identical are Identical Twins?

We’re looking at *exome sequencing data* on whole peripheral blood DNA of *monozygotic twins* (this data was generated by our collaborators, Jan Dumanskiand his group at Uppsala University in Sweden). Monozygotic twins were earlier thought to be genetically identical; now we know that isn’t completely true. How does one identify small mutations (SNPs and small InDels) that are present in one of the twins but not in the other? Or in general, how does one compare two different samples, for instance, to find *somatic mutations* that are present in a tumor sample but not present in the paired normal sample.

The obvious approach is to call SNPs in each sample first, and then identify loci where an allele is called in *exactly one* of the two samples, e.g., allele A is called in one twin but not in the other. The number of variant alleles called in each of the two twins is about 150,000. The number of differential allele calls between the two twins is about 27,000. Seems like too many. A quick look at the results reveals many instances with the following pattern.

Here, A is the reference allele. The fraction of reads supporting variant allele G barely increases from 34.48% in twin 1 to 36.36% in twin 2; yet, that is enough to tip the SNP call from AA in twin 1 to AG in twin 2. This is clearly a false positive, resulting from the cut-offs used by the SNP calling procedure. This example suggests that instead of comparing SNP calls, one should look for a substantial difference in the fraction of reads supporting the variant allele between the two samples. Let ** DiffSuppReads** denote this difference.

Suppose, we insist on `DiffSuppReads`>50%; then the number of differential allele calls reduces to just about 450, down from 27,000! One could ask, why 50%? A rough reason: look at the distribution of the number of supporting reads over all alleles called for a particular twin; this appears to be a mixture of three separate distributions, with 25% and 75% being the transition points between successive distributions. So `DiffSuppReads`>50% ensures that the two samples are unlikely to come from the same distribution.

Can we define a cut-off for `DiffSuppReads` more rigorously? What is the probability that `DiffSuppReads` is large, given that the genotype call is the same in both twins ? The heterozygous genotype will challenge this probability the most, so let us assume that both twins are heterozygous. The distribution of `DiffSuppReads` is given by the following formula (here, n1 and n2 are the number of reads in the two samples, respectively, at this locus of interest, and `DiffSuppReads` is a fraction between 0 and 1).

We use this to calculate the critical value of `DiffSuppReads` at which the above probability becomes less than, say 10e-5, so we can offset multiple testing effects for the approximately 100,000 variant alleles being tested. This critical value of `DiffSuppReads` varies from locus to locus, because of the dependence of n1 and n2. So we compute this value for each variant allele and see which variant alleles pass. This yields about 400 differential allele calls. Another interesting phenomenon strikes though, as illustrated in the example below.

The difference in supporting reads fractions is just 3.3%, but the critical value for `DiffSuppReads` is even smaller, because of very large read counts (9317 and 8943) at this locus. For very large read counts, an underlying heterozygous genotype would yield very close to 50% of the reads on each allele, and even slight deviations from this become highly improbable. This suggests that even a 3% value of `DiffSuppReads` is very unlikely to arise if both twins had the same genotype, and therefore one can conclude that the twins have different genotypes. This conclusion is however not very believable, and likely suggests that the assumption of reads being drawn independently is at risk of violation in areas with high coverage.

To summarise, just comparing genotype calls yields many false positives where the call is different due to cut-off effects, and using a statistical approach to generate a cut-off for `DiffSuppReads` yields false positives in high coverage regions. Can one combine these approaches? Here is table with various combinations.

Which of these is the right answer? The 1000 Genomes Project estimated that a child has only around 50 new mutations relative to its parents. Monozygotic twins ought to be closer than that. And we are observing only the exomes (and some neighborhood) of these twins. So the real answer probably lies close to the bottom of the above table. However, as Jan Dumanski points out, much of the 1000 Genomes effort involved sequencing of oligoclonal/monoclonal lymphoblastoid cell lines, not quite directly comparable with whole peripheral blood.

We’ve experimented with some of the above methods for differential SNP calling in Avadis NGS. Avadis NGS 1.0 used an approach based on A) above. Avadis NGS 1.2 used an approach based on C). For the above reasons, Avadis NGS 1.3 is moving towards approach B). Hopefully, this one will last.