Ultra-sensitive variant calling and transcript quantification using Unique Molecular Identifiers


Next Generation Sequencing (NGS) technologies have remarkably revolutionized the medical and genomics research. The incremental cost reductions and size of the throughput at molecular resolution helped penetration and acceptance of the NGS methodologies into worldwide labs and clinics. The third generation wave of NGS technologies are knocking the doors to provide impetus to the dream of preventive, predictive, personalized, and precision (P4) medicine initiative.

At the core of NGS technologies lies the fine tuned, optimised, sensitive molecular biology and chemistry protocols, which helps to accurately snapshot the response of cells at molecular resolution under varying genotypic conditions and environmental impacts. To enable the understanding of genotype-phenotype relationships, accurate quantification of sequenced reads plays a key role before arriving at conclusions and deriving actionable insights from the NGS data.

Challenges and opportunities

An accurate quantification of NGS data requires discriminating PCR duplicate reads from identical molecules that are of unique origin. Computationally, PCR duplicates are identified as sequence reads that align to the same genomic coordinates using reference genome guided alignment. However, identical molecules can be independently generated during library preparation and can have unique cellular origins. Thus, false identification of these molecules as PCR duplicates can lead to erroneous analysis and interpretation of NGS data.

On the other hand, it is unclear how much noise or bias PCR amplification introduces and its effect on accuracy of quantification. Generally RNA-Seq methods work with small starting amounts of RNA that require PCR amplification to generate sequenceable sized libraries. To assess the effects of amplification, reads that originated from the same RNA molecule (PCR-duplicates) need to be identified.

In case of variant calling methods, sequencing errors makes it further difficult to distinguish actual variant calls from the sequencing artifacts. This problem is more prominent during the detection of low frequency somatic variations, which are expected to be called in liquid biopsy samples, wherein the proportion of circulating tumor DNA (ctDNA) is very low as compared to normal cell free DNA (cfDNA). Accurate detection of such variants holds key in early stage cancer diagnosis and monitoring.

Unique Molecular Identifiers

In order to overcome these challenges, protocols offering assignment of Unique Molecular Identifiers (UMIs) to DNA molecules while preparing NGS sequencing libraries are gaining wide attention (Figure 1). Unique Molecular Identifiers (UMIs) are short random nucleotide sequences (4 – 20 bases) and are increasingly being used in high-throughput next generation sequencing (NGS) experiments to distinguish individual DNA/cDNA molecules 1, 2 .
Following terms are synonymous with UMIs:

  • Unique Molecular Tags (UMTs)
  • Random Molecular Tags (RMTs)
  • Molecular Barcode

Figure 1: UMI-assignment to DNA fragments

Note: Click on image to enlarge.

The UMI-tagged NGS data allow users to 1) accurately quantify the expression levels of genes in different cells using single cell RNA-Seq experiments and 2) detect low frequency variants with better sensitivity and specificity using UMI based DNA-Seq experiments (Figure 2 and 3). These applications of UMI are of particular interest in areas like liquid biopsy based cancer diagnosis and monitoring using cfDNA 3, differential expression of transcriptome at cellular levels instead of a tissue to study cell-to-cell heterogeneity 4 etc.

Figure 2: UMI-based PCR duplicate removal and accurate read quantification

Note: Click on image to enlarge.

Figure 3: UMI-based ultra-sensitive variant calling

Note: Click on image to enlarge.

However, there is hardly any Bioinformatics software which provides “end-to-end solution” supporting UMI-aware custom data import, QC metrics, consensus alignment, quantification, variant calling, and a genome browser to explore and analyse UMI-tagged NGS data.

Bioinformatics Solution

Strand NGS v3.1 which was released during ASHG 2017 5, 6 supports the necessary features to explore, analyze and visualize the UMI-tagged NGS data allowing researchers to harness the potential of big data to gain deeper insights.

In case of open source pipelines, users have to make use of multiple third party software and scripts to perform alignment, filtering, and post-processing of BAM files. And for variant inspection users have to use another software to visualize and verify the variant reads. Most of the open source software also require users to have basic understanding and knowledge of scripting and linux operating systems to run command line tools.

Strand NGS, on the other hand provides user-friendly end-to-end solutions for all the necessary steps for data import, pre-processing, alignment, filtering, variant calling, and genome browser based visualizations and variant verification (Figure 4).

Figure 4: UMI-based data analysis support in Strand NGS

Note: Click on image to enlarge.

More wonderful features are in pipeline to empower researchers in the journey from reads to discoveries.

Webinar and resources

Recently, we delivered a webinar on “Unique Molecular Identifier-powered Ultra-sensitive Variant Calling using Strand NGS”. The recording for the webinar is available at http://www.strand-ngs.com/learn/webinar-recordings. Please feel free to share it with your friends, colleagues and connections, who might be interested in this topic and recent trends in NGS.

Visit the Strand NGS website at http://www.strand-ngs.com/ to learn more about NGS data analyses with Strand NGS through recorded webinars, tutorials and reference manuals.


  1. Kivioja et al. Nature Methods 9, 72–74 (2012) doi:10.1038/nmeth.1778
  2. Islam et al. Nature Methods 11, 163–166 (2014) doi:10.1038/nmeth.2772
  3. Phallen et al. Science Translational Medicine 9(403): eaan2415 (2017) doi:10.1126/scitranslmed.aan2415
  4. Ofengeim et al. Trends in Molecular Medicine, 23(6), 563-576, (2017) doi:10.1016/j.molmed.2017.04.006
  5. ASHG 2017 [http://www.ashg.org/2017meeting/ ]
  6. Strand Life Sciences Announces the Release of Strand NGS v3.1 at ASHG 2017 [https://www.prnewswire.com/news-releases/strand-life-sciences-announces-the-release-of-strand-ngs-v31-at-ashg-2017-300538443.html]

Contact author:

Dr. Pandurang Kolekar
Bioinformatics Engineer
Strand Life Sciences Pvt. Ltd., Bengaluru, INDIA

Send Mail to pandurang [at] strandls.com