Webinar of the Month Series: Configuring SNP detection pipelines in Avadis NGS for accurate analysis of clinical samples
Presenter: Dr Vamsi Veeramachaneni, Vice President, Strand Life Sciences
Abstract: Running a SNP detection pipeline and identifying high quality variant calls quickly is challenging. This is especially true in the case of clinical labs where multiple panels are used and kit-specific biases can result in false positive SNP predictions.
In this webinar, Dr Vamsi will show, how one can use the powerful visualization features of Avadis NGS to quickly detect false positive SNP predictions, identify the cause of the errors, and fine-tune the detection pipeline for accurate analysis
Session 1: Feb 26; Europe +Asia; 10 AM Central European
Session 2: Feb 26; North + South America; 9 AM Pacific Standard Time
Confirm your participation on or before 24th Feb 2014 by Registering online for free
Transcript annotations are extensively used in NGS data analysis. In RNA-Seq, they are used at every step of the pipeline – to map spliced reads against the genome, perform quantification, detect novel exons etc. In DNA-Seq, they are used to predict the effect of variants detected in the sample. Clearly accurate transcript annotations are vital for NGS work.
Many researchers prefer to work with RefSeq transcripts because they are manually curated. But there is a problem. The RefSeq transcript project provides the transcript sequence and the location of exons on the transcript sequence but does not provide the genomic coordinates for the exons. So one common strategy is to obtain the genomic coordinates from UCSC. The folks at UCSC routinely align the RefSeq transcript sequences against the genome using BLAT and make the results available as a “refFlat” files in their download site.
Unfortunately, these BLAT alignment are sometimes wrong.
Shown below is the transcript track for TNNI3 which is a gene on the negative strand of chromosome 19. Note that the coding region of the first exon in the “RefSeq genes” picture occupies 22bp while the USCC track at the top shows only 11bp.
The RefSeq transcript that was used by UCSC for alignment can be obtained by clicking on the TNNI3 word in the RefSeq gene track and it is NM_000363.4. A portion of the transcript entry is shown below.
The RefSeq entry clearly indicates that only 11 bases (144-154) at the end of the first exon represent coding bases. Moreover, the transcript has a CCDS entry indicating that there is a genomic alignment which translates to the protein sequence shown.
To get a better understanding of the problem, we looked at the UCSC and the RefSeq transcripts in more detail in the Elastic Genome Browser. The introns have been compressed so that exonic and essential splice site sequences can be seen in more detail.
Some of the observations from the above picture are:
- the alignment for the RefSeq transcript leads to a premature stop-codon very early on,
- the essential splice site signals are correct in the UCSC transcript but wrong in the RefSeq transcript alignment
These are sanity checks that any researcher using the UCSC alignments of RefSeq transcripts should incorporate before carrying out analysis.
And, finally, the picture also suggests why this error happened. The incorrect extension to exon 1 in the RefSeq transcript alignment (GCATCACTCAC) is very similar to the sequence of the small exon 2 present in the UCSC transcript (GCATCGCTGCTC). It is possible that the BLAT alignment is not well suited for detecting small intermediate exons especially if there is an alternate alignment which is very similar.
We’ve been using Avadis NGS to analyze a number of clinical samples and ran into an interesting case for InDel detection that could lead to false interpretation unless handled properly. The case at hand involves a child who suffered from Pulmonary Hypertension, Pulmonary Infections, and a few other abnormalities, all at birth or in the first year of life, and subsequently died in the second year. We had DNA from this child as well as from his (consanguinious) parents. All three were subjected to Exome sequencing.
We used Avadis NGS to align the data and then performed variant calling. Next, we used the Find Significant Variants functionality in Avadis NGS to identify variants which were heterozygous in the parents but homozygous in the child, and then restricted these variants to those with allele frequencies below 1% using the Find Damaging Variants operation. This yielded a hundred or so gene candidates. We then picked out genes known to cause rare childhood diseases.
One such gene was ALMS1, which is involved in Alstrom Syndrome, a disease that is known to be associated with the occurrence of recurrent pulmonary infections. The mutation in ALMS1 was a CTC insertion, heterozygous in both parents and homozygous in the child. This mutation causes the insertion of an extra amino acid P (=CCT). It also appeared to be a novel mutation not reported in dbSNP, warranting further investigation.
By habit, we also look at dbSNP variants in the vicinity of our mutation of interest just so we can explain the presence of common polymorphisms, if any, close to our potentially deleterious mutation. In this case, there was indeed a dbSNP insertion close-by, as shown in the picture below.
This dbSNP variant had the exact same insertion sequence, i.e., CTC, but appeared 3 bases away from our mutation. And it so happened that the 3 intervening characters were CTC as well! In other words, our mutation and the dbSNP variant both were different ways of just calling the same variant; CTC was repeated twice and this could be called as an insertion to the left (our mutation) or an insertion to the right (dbSNP). Further, it so happens that the dbSNP variant is a very common one. Therefore our mutation, far from being novel, was quite common and not a candidate for the case at hand!!
This tells us that in addition to local realignment of an InDel across multiple overlapping reads, we also need to realign to dbSNP to identify the true allele frequency of our mutation, an alteration that we will definitely seek to add to Avadis NGS in the near future.
This may be a piece of trivia more than anything else, but anyway, let’s begin with the surprising part. Take any RNASeq sample and assign to each gene a count corresponding to the number of reads aligning to that gene (what we call Quantification in Avadis NGS). For the purposes of this post, ignore all genes with 0 counts; this is roughly half the genes usually. For each of the remaining genes, consider the most significant digit in its count (i.e., if the count is 35, the most significant digit is 3). Over all genes, there are 9 choices for this significant digit; how are these 9 choices distributed? Are all choices equally likely? Or are some digits more likely than others? The distribution computed on actual data is shown in blue below (we’ve tried several datasets and the distribution doesn’t change much from one to another).
Surprisingly (though this won’t be surprising any more by the time you reach the end of this post), the digits are not all equally frequent. The digit 1 appears approximately 30% of the time and is more frequent than the other digits, which have successively reduced frequencies!
This is not a new phenomenon; it’s been known for a while as Benford’s Law and was first discovered by the rather alert and curious observation that pages in a book of log-tables corresponding to numbers beginning with 1 appeared to show more signs of wear and tear as compared with those dealing with other digits. The theoretical predictions of this law are shown in red in the picture above; clearly, a very close match. Benford’s Law appears in many contexts, including heights of structures and populations of countries. Why does it appear in RNASeq gene counts though?
Let us take a look at the distribution of gene counts. For simplicity, consider only frequencies of counts 1 through 99, i.e., how many genes get a count of 1, how many get a count of 2, etc through to 99. The picture is as follows.
The profile of red dots is the actual data. The profile of green dots is the curve y=1/[ 2x ln(10) ] , so essentially a hyperbola that fits the data quite well. Given this fit, how frequently does the most significant digit take on value d. We simply need to integrate as below. For instance, for d=3, we need to integrate from 30 to 40, and from 3 to 4.
And the expression on the right is exactly the frequency prescribed by Benford’s Law. For d=1, this is log10(2), which is about 0.3! As simple as that!!
This post is the first of a series where we describe some interesting algorithms used in our product Avadis NGS. The first post in this series is about string matching using FM index, a technique that forms the core of our aligner COBWeb. We have intentionally kept the usage of bioinformatics jargon minimum to benefit a wider audience.
This post describes an elegant and fast algorithm to perform exact string match. Why another string matching algorithm? To answer the question, let’s first understand the problem we are trying to solve.
In short, the problem is to match billions of short strings (about 50-100 characters long) to a text which is 3 billion characters long. The 3 billion character string (also called reference) is known ahead and is fixed (at least for a species). The shorter strings (also called reads) are generated as a result of an experiment. The problem arises due to the way the sequencing technology works, which in its current form, breaks the DNA into small fragments and ‘reads’ them. The information about where the fragments came from is lost and hence the need to ‘map’ them back to the reference sequence.
We need an algorithm that allows repeatedly searching on a text as fast as possible. We are allowed to perform some preprocessing on the text once if that will help us achieve this goal. FM index is one such algorithm. It requires a one-time preprocessing of the reference to build an index, after which the query time is of the order of the length of the query (instead of the reference).
Burrows Wheeler transform is a reversible string transformation that has been widely used in data compression. However the application of BWT to perform string matching was discovered fairly recently in these papers – http://dl.acm.org/citation.cfm?doid=1082036.1082039, http://www.computer.org/csdl/proceedings/focs/2000/0850/00/08500390-abs.html, by Paolo Ferragina and Giovanni Manzini. This technique is the topic of this post. Before we get to the searching application, a little background on how BWT is constructed and some properties of BWT.
BWT for a given text T is constructed as follows:
- Append an end of string character (say ‘$’) to the text. This character should not be present in the text and is treated as the lexicographically largest character.
- Collect all the cyclic permutations of the text and sort them.
- Arrange all the sorted strings one below the other. The last column forms the BWT.
The image below shows the BWT transform for banana$.
A few things to note from the image
- The red column, as indicated, is the BWT of BANANA$’
- The yellow column, as indicated, is the Suffix Array. For every row, the element in the suffix array is the index into the original text of the suffix starting in that row. Note that if you have the suffix array, constructing BWT is straightforward. The above algorithm of performing a full sort on all suffixes to find the suffix array and BWT is very inefficient. Efficiently finding the suffix array will be the topic for our next post in this series.
- The green column is just the sorted characters in the text.
Before proceeding we need to understand one interesting property of the BWT. The rank of a character in the BWT column is the same as that of the corresponding character in the first column. This is illustrated in the image above using colors. The 2nd N in the BWT column correponds to the 2nd N in the first column, the 3rd A in the BWT column corresponds to the 3rd A in the first column and so on. To see why this is the case, remember that all the rows in the grid are cyclic permutations of a single string. Let AX1, AX2, AX3 be three consecutive rows in the grid where A is the first character and X1, X2, X3 denote the rest of the row. As the rows are sorted lexicographically, X1, X2, X3 are sorted too. On cyclic shifts, we get X1A, X2A, X3A which are again sorted, hence in the same relative order in the grid. Note that now the A’s are entries in the BWT in the same relative order. This is illustrated in the figure below.
Search using FM Index
Our index data structure has the following
- BWT – Indexed access. We will denote this by BWT(i).
- Suffix Array – Indexed access. We will denote this by SA(i).
- First column – Queries : Start index, end index of an alphabet and index given alphabet and a rank. For example, find the index of letter ‘G’ with rank 3. This can be achieved by just storing the frequency of each alphabet in order. For example, for BANANA$, its enough to store 3A1B2N1$. This query can be answered by a simple walk over this compressed structure. We will denote the queries by Start(alphabet), End(alphabet) and IndexOf(alphabet, rank).
- Ranks in BWT – We need fast access to the following query : Get number of occurences of a particular character above a particular row in the BWT. For example, get the number of ‘A’s above row 5. One way to achieve this is to store at each row the frequency of each alphabet till that row. This can be achieved by a single pass over the BWT. In practice though, we don’t store this at every row, but store at regular intervals. We will assume here that it is stored at every row for simplicity. We will denote this query by RankOf(index, alphabet)
Let T be the text searched on and Q be the query string. The algorithm is as follows
index = len(Q) # Walk the query backwords nextChar = Q[index--] (start, end) = (Start(nextChar), End(nextChar)) while index >= 0: nextChar = Q[index--] (startrank, endrank) = (RankOf(start, nextChar), RankOf(end, nextChar)) (start, end) = (IndexOf(nextChar, startrank), IndexOf(nextChar, endrank)) for i in range(start, end): print SA(i)
The output of this algorithm are the indices into the original text where the query is present. Note that error condition handling has been skipped in the pseudocode above for brevity. It involves checking if the range found at every iteration of the loop is a valid range.
The visualization below explains the algorithm. Few points about the visualization
- Yellow text at the top indicates the text to search on and the yellow text at the bottom, the query
- Red column indicates BWT, orange the suffix array
- The grey part of the grid is not stored as part of the index. It is shown here just for clarity
- Rows highlighted in blue indicates the range where search till that point is successful. This appears once the search animation starts.
Let’s step through the visualization for text ABRACADABRA and query DAB. Follow the points below by hitting on ‘Step’ button after each point.
- Remember that the query is walked from the last to first. The first character is ‘B’, which results in the range shown
- Next character is ‘A’. Get the rank of ‘A’ on the BWT for the first and last rows on the current range. The ranks as shown in the tooltips are 0 and 2. The new range is determined by the rows where ‘A’ has rank 0 and rank 2 on the first column. This is shown by the new range
- Now the final character ‘D’. Performing the the same steps as above, we get the final range as the row ’9′. The corresponding element on the suffix array is ’6′ which is the index in the original text where ‘DAB’ starts
Play around with the visualization with different texts and queries to understand the algorithm better.
How does this search work? Let’s first understand what the loop invariant here is. Let ‘i’ be the index into the query string at some point of time when the range in the grid is (s,e). Remember that we are walking the search query backwards. The loop invariant is that Q[i:] is the prefix of all strings in the range (s,e). To prove that that is indeed the case, let’s move one more character on the query. Consider the characters on the BWT for the indices between s and e. All these characters precede various occurences of the prefix Q[i:] in the text. The rank data structure enables us to quickly check if any of those characters are the character we want to match. If yes we get the start and end rank of the character. Note that if the character is not present in that range, the start and end rank of the character would be same, ending our search. Then we use the property of BWT described above to jump to the corresponding character on the first column and continue our search. Thus when we exhaust all the characters in the query, we are left with the range on the grid where the prefix of every row is the query.
On the next post in this series, we will describe how to find the BWT index of a very long string efficiently.
In this month’s webinar, we will show you how to use the Find Significant SNPs analysis workflow in Avadis NGS 1.3.
SNP verification and prioritization is one of the most time consuming tasks in variant analysis. The workflows and algorithms built-into Avadis NGS 1.3 make end-to-end SNP analysis easy, intuitive, and accurate. These workflows support multiple experimental setups and can be used for quickly identifying population-specific variants, somatic mutations, and tumor-specific markers with the aid of a user-friendly graphical interface.
Session 1: 18th April – 10 AM Pacific Standard Time (San Francisco)
Session 2: 19th April – 11 AM Central European Time (Berlin)
Register online for free!
Here is publication from Sanford-Burnham Medical Research Institute that uses Avadis NGS: http://www.sciencedirect.com/science/article/pii/S000292971200095X
The goal in this paper is to find the underlying genomic cause of CHIME syndrome, a disease so rare that there are only 8 reported cases in the world. CHIME is an abbreviation derived from the main symptoms: Colobomas (hole in the eye), congenital Heart defects, Ichthyosiform dermatosis (scaling of skin), Mental retardation, and either Ear anomalies or Epilepsy.
They performed Exome sequencing (using the Nimblegen target enrichment kit) on 5 of the 8 reported cases; they also ran CGH arrays on a 6th reported case to identify gross copy number changes, and found a 1MB deletion on Chr 17. They focused on the 17 genes in this region for SNP analysis on the 5 Exome sequencing samples.
They found that only the PIGL gene in this region carried compound heterozygous mutations in all 5 samples. They also sequenced parents of these 5 cases and found that both parents were carriers in all cases (each parent had a mutation in one chromosomal copy).
This method provides an interesting glimpse into the future when genome sequencing is routine and the database of disease causing variants is comprehensive: the parents of these 8 cases would have realized that they are both carriers, and with a little investigation on the yet to be born babies, these cases may have been spared of their misery.
We just read a recent publication by Shankar et al that uses Avadis NGS to analyze RNA-Seq data. They study cells involved in rat placenta formation. The placenta is an organ that connects the foetus to the wall of the uterus, thus allowing for nutrients/oxygen to flow from mother to foetus, and waste to flow in the opposite direction. The rat placentation site is apparently made up of 3 very distinct functional areas. The goal of this study is to understand gene expression profiles in each of these 3 areas, and use these profiles to characterize the functions of each area. This paper shows that GO analysis on RNA-Seq data yields functions for each zone consistent with what is known (i.e., one zone specializes in transport of oxygen and nutrients, another drives the maternal hormonal response, and the third drives immune response). The paper then goes further to identify sets of transcription factors that are expressed differentially in each zone, and novel transcription factor networks (shown below, and created using the Avadis Protein-Protein Interactions Database) that may be at the heart of this differential behaviour.
We released a patch update for Avadis NGS to address some of the feedback that we received, as well as fix some minor bugs in version 1.3. The updates are documented in the release notes. Below is a quick summary of the key features:
* A new analysis step Find Significant Regions to compare generic regions across groups of conditions
* Ability to trim poorly aligned portion of a read at the 3′ end
* Creation of Multi-Sample Reports from the import of MiSeq results
* Improvements to the novel small RNA type detection
How to Update
For current users: Use the menu options Help > Update Product > From Strand Server and choose to update your installation. If you are behind a firewall or proxy and cannot access the Update Server, please contact support.
For new users: Download the latest installer for your operating system and follow the on-screen instructions.
We are hosting an online seminar series on the alignment and analysis of genomics data from “benchtop” sequencers, i.e. MiSeq and Ion Torrent. Our webinar panelists will give a tour of various bioinformatics functions in Avadis NGS that will enable researchers and clinicians to derive biological insights from their benchtop sequencing data. Webinar attendees are welcome to pose questions to our panelists and get real-time feedback and answers!
Seminar #1: MiSeq Data Analysis
Avadis NGS 1.3 provides special support for analyzing data generated by MiSeq™ sequencers. In this webinar, we will describe how the data in a MiSeq generated “run folder” is automatically loaded into the Avadis NGS software during small RNA alignment and DNA variant analysis. This is especially helpful in processing the large number of files generated when the TruSeq™ Amplicon Kits are used. We will describe how to use the Quality Control steps in Avadis NGS to check if the amplicons have sufficient coverage in all the samples. Regions with unexpected coverages can easily be identified using the new region list clustering feature. Webinar attendees will learn how to use the “Find Significant SNPs” feature to quickly identify high-confidence SNPs present in a majority of the samples, rare variants, etc.
22nd February, 2012 – 10:00 AM Pacific Standard Time (Americas)
23rd February, 2012 – 11:00 AM Central European Time (Europe/Asia)
Seminar #2: Ion Torrent Data Analysis
Avadis NGS 1.3 includes a new aligner – COBWeb – that is fully capable of aligning the long, variable-length reads generated by Ion Torrent sequencers. In this webinar, we will show the pre-alignment QC plots and illustrate how they can be used to set appropriate alignment parameters for aligning Ion Torrent reads. For users who choose to import the BAM format files generated by the Ion Torrent Server, we will describe the steps needed for importing amplicon sequencing data into Avadis NGS. Users of the Ion AmpliSeq™ Cancer Panel will learn how to easily import the targeted mutation list and verify the genotype call at the mutation sites. We will also show the new “Find Significant SNPs” feature which helps quickly identify high-confidence SNPs present in a majority of the samples, rare variants, etc.
29th February, 2012 – 10:00 AM Pacific Standard Time (Americas)
1st March, 2012 – 11:00 AM Central European Time (Europe/Asia)
Note: All webinars will be hosted on WebEx – please be sure to logon a few minutes before the meeting starts, in case you need to download any software updates in order to join the online meeting.