The year 2014 was a great year with exciting features and enhancements updated in Strand NGS. We take this opportunity to thank our clientele and well-wishers for their support and feedback. We wish you all a happy and prosperous new year and look forward to a more fruitful engagement in 2015.
Here are two recent publications that analyse ChIP-Seq and DNA-Seq data using Strand NGS.
1. A Comprehensive Profile of ChIP-Seq-Based PU.1/Spi1 Target Genes in Microglia by Satoh et al.
2. Localised Dominant Dystrophic Epidermolysis Bullosa with a Novel de Novo Mutation in COL7A1 Diagnosed by Next-generation sequencing by Nagai et al.
The paper by Satoh et al. investigates the biological role of the transcription factor PU.1 in regulation of microglial functions. Though PU.1 plays vital role in microgliogenesis, the comprehensive profile of PU.1/Spi1 target genes in microglia is unknown. In this paper, Strand NGS was used to analyse SRP036026 ChIP-Seq data set and identify the role of PU.1/ Spi1 in microglial gene regulation. Using Strand NGS, around 5,264 ChIP-Seq-based Spi1 target protein coding genes (Spi1, Irf8, Runx1, Csf1r, Csf1, Il34, Aif1 (Iba1), Cx3cr1, Trem2, and Tyrobp) were identified in BV2 mouse microglial cells. Motif analysis by GADEM revealed the PU-box consensus sequences (5’-GAGGAA-3’) were located on 80.3% of the peaks detected by MACS. By downstream pathway analysis, the ChIP-Seq-based Spi1 target genes were found to show significant relationship with diverse pathways essential for normal function of monocytes/ macrophages (like endocytosis, phagocytosis, lysosomal degradation). Hence PU.1/Spi1 was found to have an important role in microglial gene regulation and any aberrant regulation of these target genes would contribute to neurodegenerative diseases by activated microglia accumulation.
The second paper by Nagai et al. is a case study of a 10-month-old female infant with local Dominant Dystrophic Epidermolysis Bullosa. A targeted next generation sequencing was performed with the proband’s peripheral blood for 16 genes associated with Dystrophic Epidermolysis Bullosa. The sequenced data was aligned and analysed using the DNA variant analysis workflow in Strand NGS (formerly Avadis NGS). A heterozygous single nucleotide variation on chr.3: g.48616827C>T (negative strand) that corresponds to a missense mutation of p.Gly1761Asp in the triple helix domain of COL7A1 was detected. This de novo high confidence mutation was also confirmed by Sanger sequencing. This mutation was detected only in the proband and was not found in the parents, in 100 healthy Japanese alleles as well as in dbSNP.
Webinar on ‘Integrated Multi-Omics Analysis with Strand NGS 2.1 and Agilent GeneSpring 13 – Case Study’ on 19 and 20 November
Presented by ‘Agilent Technologies’ and ‘Strand Life Sciences’
Integrating Next Generation Sequencing data with other omics- studies is now possible with release of GeneSpring 13 and Strand NGS 2.1, opening up newer avenues for analysis and interpretation of NGS experiments. In this webinar, we will demonstrate the new integrated analysis workflow using high throughput microarray and next generation sequencing data.
Using a case study the following functionality of the multi-omics approach would be highlighted:
• Export of relevant information (reads, region lists, entity lists) from Strand NGS 2.1, for import into GeneSpring 13.0.
• Create an experiment in GeneSpring using the Strand NGS data.
• Perform correlation study and pathway analysis in a multi-omics context.
Dr. Pramila Tata, Director – Applied Science, Strand Life Sciences
Dr. Carolina Livi, Segment Marketing Manager, Agilent Technologies
Session 1 for SAPK/ APFO: November 18, 2014; 8:00 PM PST ( that is 19 November, 9:30 AM IST)
Session 2 for EMEA and India: November 19, 6:00 AM PST (that is 19 November, 7:30 PM IST)
Session 3 for AFO: November 20, 8:00 AM PST (that is 20 November, 9:30 PM IST)
Register on or before November 18, 2014 at http://www.strand-ngs.com/webinar_registration
Dr. Pramila Tata, Director – Applied Science, Strand Life Sciences, has over 15 years experience in cancer research, software support, product development and training technical support teams and field application scientists. Pramila, has earned her Ph.D in Molecular biology from Indian Institute of Science, Bangalore. Prior to joining Strand, she was with Fred Hutchinson Cancer Research Center at Seattle working in cancer biology using budding yeast as a model system. At Strand, Pramila leads the application science team.
Dr. Carolina Livi, Bioinformatics Segment Manager, Agilent Technologies, has over 8 years experience in the field of bioinformatics, regulatory biology and research in cancer and aging. Carolina Livi, has a Ph.D in Molecular developmental biology, from California Institute of Technology. Prior to joining Agilent, Dr Livi has worked with University of Texas Health Science at San Anotonio (UTHSCSA) as a Research Assistant Professor, in the department of Molecular medicine. At Agilent, Dr Livi is Bioinformatics Segment Manager for Life Sciences Research in Academia and Government within the Segment Marketing group.
For more information, please write to email@example.com
Strand is excited to be an exhibitor at the 64th Annual Meeting of the American Society of Human Genetics, San Diego from 18 – 22 October 2014, one of the world’s largest human genetics meetings. At ASHG, Strand will demonstrate the latest version of its state-of-the-art NGS data analysis tool ‘Strand NGS’ and interpretation and reporting platform ‘StrandOmics‘.
At ASHG 2013, Strand had demoed Avadis NGSv1.5 and presented posters on ‘Aneuploidy and Normal Cell Contamination Aware Approach to Detect Copy Number Variations in Cancer Using Next Generation Sequencing Data’ and ‘Shortening the Diagnostic Odyssey: Integrating Genomic, Structural, and Phenotypic Information to Reduce Time of Rare Disease Diagnosis’. This year we are thrilled to present Avadis NGS with its new brand name ‘Strand NGS’ and ‘StrandOmics‘ tool for the first time at ASHG. Loads of new features and utilities with respect to visualizations and interpretations will be highlighted by our representatives at the booth # 338 at ASHG 2014.
This year, we are also presenting four new posters in the sessions on ‘Bioinformatics and Genomic Technology’ and ‘Clinical Genetic Testing’. The posters highlight benchmarking studies we conducted to compare our variant calling and alignment algorithms with some of the other options available to scientists, present case studies from our clinical genetic testing practice in India, and illustrate our variant interpretation and reporting platform, StrandOmics. Detailed information about each of these scientific posters and agenda is mentioned below. For more information visit our ASHG webpage
Come and meet our experts at ASHG 2014, booth #338, to learn more about Strand’s solutions.
To schedule a meeting or demo request of Strand NGS, please write to:
Vinay Paramasivan at firstname.lastname@example.org
About Strand NGS
Strand NGS, an integrated desktop software that enables biomedical researchers to manage, analyze, and visualize data from next-generation sequencing (NGS) experiments. The software is designed to enable biologists to make sense of NGS data by providing a rich, visual environment for QC, analysis, and interpretation of ChIP-Seq, RNA-Seq, small-RNA-Seq, DNA-Seq and Methyl-Seq data. The enterprise version of Strand NGS (server edition) supports multi-member teams to collaborate, share data, and speed up analysis, while the easy backup-restore option allows safe and secure data transfer. For more information visit http://strand-ngs.com/
StrandOmics, is a variant calling and interpretation tool designed to support sequencing lab workflows. The tool reduces variant interpretation and reporting time from days to a few hours. StrandOmics, is based on first-hand experience interpreting variants from hundreds of genomes. To know more visit http://strandomics.com/
Presenter: Dr Veena Hedatale, Senior Application Scientist, Strand Life Sciences
Abstract: Strand NGS (formerly Avadis NGS) supports functional analysis of entities from diverse experiment types to understand their role in a biological process. This webinar will illustrate various ways of integrating next generation sequencing data from different experiments. With focus on visualization of biological data, analysis steps showing the use of an entity list to find statistically significant pathways will be discussed. The pathways can either be derived from literature (like NLP, MeSH) or curated pathways (like Wikipathways or BioCyc). The webinar will also provide more insights into how one can overlay data from single or multiple sequencing experiments onto the same pathways simultaneously.
Session 1: August 27; Europe +Asia; 11 AM Central European Time (2:30 PM IST)
Session 2: August 27; North + South America; 9 AM Pacific Standard Time (9:30 PM IST)
About presenter: Dr. Veena Hedatale, has a PhD in Plant Genetics from The Radboud University, Netherlands focused on meiosis and recombination. Her prior academic experience at Cornell University was on genetic mapping and gene transformation in Rice. She has worked with Monsanto, and contributed to data mining, database development as well as gene/promoter/pathway discovery for traits related to yield and stress in crop species. At Strand, Veena has worked on Pharmacogenomic analysis of targets and Gene family analysis projects. Currently, she is part of the Strand NGS Application Science team and is involved in the analysis of next generation sequencing data.
For more information, please write to email@example.com
Webinar of the Month Series: Configuring SNP detection pipelines in Avadis NGS for accurate analysis of clinical samples
Presenter: Dr Vamsi Veeramachaneni, Vice President, Strand Life Sciences
Abstract: Running a SNP detection pipeline and identifying high quality variant calls quickly is challenging. This is especially true in the case of clinical labs where multiple panels are used and kit-specific biases can result in false positive SNP predictions.
In this webinar, Dr Vamsi will show, how one can use the powerful visualization features of Avadis NGS to quickly detect false positive SNP predictions, identify the cause of the errors, and fine-tune the detection pipeline for accurate analysis
Session 1: Feb 26; Europe +Asia; 10 AM Central European
Session 2: Feb 26; North + South America; 9 AM Pacific Standard Time
Confirm your participation on or before 24th Feb 2014 by Registering online for free
Transcript annotations are extensively used in NGS data analysis. In RNA-Seq, they are used at every step of the pipeline – to map spliced reads against the genome, perform quantification, detect novel exons etc. In DNA-Seq, they are used to predict the effect of variants detected in the sample. Clearly accurate transcript annotations are vital for NGS work.
Many researchers prefer to work with RefSeq transcripts because they are manually curated. But there is a problem. The RefSeq transcript project provides the transcript sequence and the location of exons on the transcript sequence but does not provide the genomic coordinates for the exons. So one common strategy is to obtain the genomic coordinates from UCSC. The folks at UCSC routinely align the RefSeq transcript sequences against the genome using BLAT and make the results available as a “refFlat” files in their download site.
Unfortunately, these BLAT alignment are sometimes wrong.
Shown below is the transcript track for TNNI3 which is a gene on the negative strand of chromosome 19. Note that the coding region of the first exon in the “RefSeq genes” picture occupies 22bp while the USCC track at the top shows only 11bp.
The RefSeq transcript that was used by UCSC for alignment can be obtained by clicking on the TNNI3 word in the RefSeq gene track and it is NM_000363.4. A portion of the transcript entry is shown below.
The RefSeq entry clearly indicates that only 11 bases (144-154) at the end of the first exon represent coding bases. Moreover, the transcript has a CCDS entry indicating that there is a genomic alignment which translates to the protein sequence shown.
To get a better understanding of the problem, we looked at the UCSC and the RefSeq transcripts in more detail in the Elastic Genome Browser. The introns have been compressed so that exonic and essential splice site sequences can be seen in more detail.
Some of the observations from the above picture are:
- the alignment for the RefSeq transcript leads to a premature stop-codon very early on,
- the essential splice site signals are correct in the UCSC transcript but wrong in the RefSeq transcript alignment
These are sanity checks that any researcher using the UCSC alignments of RefSeq transcripts should incorporate before carrying out analysis.
And, finally, the picture also suggests why this error happened. The incorrect extension to exon 1 in the RefSeq transcript alignment (GCATCACTCAC) is very similar to the sequence of the small exon 2 present in the UCSC transcript (GCATCGCTGCTC). It is possible that the BLAT alignment is not well suited for detecting small intermediate exons especially if there is an alternate alignment which is very similar.
We’ve been using Avadis NGS to analyze a number of clinical samples and ran into an interesting case for InDel detection that could lead to false interpretation unless handled properly. The case at hand involves a child who suffered from Pulmonary Hypertension, Pulmonary Infections, and a few other abnormalities, all at birth or in the first year of life, and subsequently died in the second year. We had DNA from this child as well as from his (consanguinious) parents. All three were subjected to Exome sequencing.
We used Avadis NGS to align the data and then performed variant calling. Next, we used the Find Significant Variants functionality in Avadis NGS to identify variants which were heterozygous in the parents but homozygous in the child, and then restricted these variants to those with allele frequencies below 1% using the Find Damaging Variants operation. This yielded a hundred or so gene candidates. We then picked out genes known to cause rare childhood diseases.
One such gene was ALMS1, which is involved in Alstrom Syndrome, a disease that is known to be associated with the occurrence of recurrent pulmonary infections. The mutation in ALMS1 was a CTC insertion, heterozygous in both parents and homozygous in the child. This mutation causes the insertion of an extra amino acid P (=CCT). It also appeared to be a novel mutation not reported in dbSNP, warranting further investigation.
By habit, we also look at dbSNP variants in the vicinity of our mutation of interest just so we can explain the presence of common polymorphisms, if any, close to our potentially deleterious mutation. In this case, there was indeed a dbSNP insertion close-by, as shown in the picture below.
This dbSNP variant had the exact same insertion sequence, i.e., CTC, but appeared 3 bases away from our mutation. And it so happened that the 3 intervening characters were CTC as well! In other words, our mutation and the dbSNP variant both were different ways of just calling the same variant; CTC was repeated twice and this could be called as an insertion to the left (our mutation) or an insertion to the right (dbSNP). Further, it so happens that the dbSNP variant is a very common one. Therefore our mutation, far from being novel, was quite common and not a candidate for the case at hand!!
This tells us that in addition to local realignment of an InDel across multiple overlapping reads, we also need to realign to dbSNP to identify the true allele frequency of our mutation, an alteration that we will definitely seek to add to Avadis NGS in the near future.
This may be a piece of trivia more than anything else, but anyway, let’s begin with the surprising part. Take any RNASeq sample and assign to each gene a count corresponding to the number of reads aligning to that gene (what we call Quantification in Avadis NGS). For the purposes of this post, ignore all genes with 0 counts; this is roughly half the genes usually. For each of the remaining genes, consider the most significant digit in its count (i.e., if the count is 35, the most significant digit is 3). Over all genes, there are 9 choices for this significant digit; how are these 9 choices distributed? Are all choices equally likely? Or are some digits more likely than others? The distribution computed on actual data is shown in blue below (we’ve tried several datasets and the distribution doesn’t change much from one to another).
Surprisingly (though this won’t be surprising any more by the time you reach the end of this post), the digits are not all equally frequent. The digit 1 appears approximately 30% of the time and is more frequent than the other digits, which have successively reduced frequencies!
This is not a new phenomenon; it’s been known for a while as Benford’s Law and was first discovered by the rather alert and curious observation that pages in a book of log-tables corresponding to numbers beginning with 1 appeared to show more signs of wear and tear as compared with those dealing with other digits. The theoretical predictions of this law are shown in red in the picture above; clearly, a very close match. Benford’s Law appears in many contexts, including heights of structures and populations of countries. Why does it appear in RNASeq gene counts though?
Let us take a look at the distribution of gene counts. For simplicity, consider only frequencies of counts 1 through 99, i.e., how many genes get a count of 1, how many get a count of 2, etc through to 99. The picture is as follows.
The profile of red dots is the actual data. The profile of green dots is the curve y=1/[ 2x ln(10) ] , so essentially a hyperbola that fits the data quite well. Given this fit, how frequently does the most significant digit take on value d. We simply need to integrate as below. For instance, for d=3, we need to integrate from 30 to 40, and from 3 to 4.
And the expression on the right is exactly the frequency prescribed by Benford’s Law. For d=1, this is log10(2), which is about 0.3! As simple as that!!
This post is the first of a series where we describe some interesting algorithms used in our product Avadis NGS. The first post in this series is about string matching using FM index, a technique that forms the core of our aligner COBWeb. We have intentionally kept the usage of bioinformatics jargon minimum to benefit a wider audience.
This post describes an elegant and fast algorithm to perform exact string match. Why another string matching algorithm? To answer the question, let’s first understand the problem we are trying to solve.
In short, the problem is to match billions of short strings (about 50-100 characters long) to a text which is 3 billion characters long. The 3 billion character string (also called reference) is known ahead and is fixed (at least for a species). The shorter strings (also called reads) are generated as a result of an experiment. The problem arises due to the way the sequencing technology works, which in its current form, breaks the DNA into small fragments and ‘reads’ them. The information about where the fragments came from is lost and hence the need to ‘map’ them back to the reference sequence.
We need an algorithm that allows repeatedly searching on a text as fast as possible. We are allowed to perform some preprocessing on the text once if that will help us achieve this goal. FM index is one such algorithm. It requires a one-time preprocessing of the reference to build an index, after which the query time is of the order of the length of the query (instead of the reference).
Burrows Wheeler transform is a reversible string transformation that has been widely used in data compression. However the application of BWT to perform string matching was discovered fairly recently in these papers – http://dl.acm.org/citation.cfm?doid=1082036.1082039, http://www.computer.org/csdl/proceedings/focs/2000/0850/00/08500390-abs.html, by Paolo Ferragina and Giovanni Manzini. This technique is the topic of this post. Before we get to the searching application, a little background on how BWT is constructed and some properties of BWT.
BWT for a given text T is constructed as follows:
- Append an end of string character (say ‘$’) to the text. This character should not be present in the text and is treated as the lexicographically largest character.
- Collect all the cyclic permutations of the text and sort them.
- Arrange all the sorted strings one below the other. The last column forms the BWT.
The image below shows the BWT transform for banana$.
A few things to note from the image
- The red column, as indicated, is the BWT of BANANA$’
- The yellow column, as indicated, is the Suffix Array. For every row, the element in the suffix array is the index into the original text of the suffix starting in that row. Note that if you have the suffix array, constructing BWT is straightforward. The above algorithm of performing a full sort on all suffixes to find the suffix array and BWT is very inefficient. Efficiently finding the suffix array will be the topic for our next post in this series.
- The green column is just the sorted characters in the text.
Before proceeding we need to understand one interesting property of the BWT. The rank of a character in the BWT column is the same as that of the corresponding character in the first column. This is illustrated in the image above using colors. The 2nd N in the BWT column correponds to the 2nd N in the first column, the 3rd A in the BWT column corresponds to the 3rd A in the first column and so on. To see why this is the case, remember that all the rows in the grid are cyclic permutations of a single string. Let AX1, AX2, AX3 be three consecutive rows in the grid where A is the first character and X1, X2, X3 denote the rest of the row. As the rows are sorted lexicographically, X1, X2, X3 are sorted too. On cyclic shifts, we get X1A, X2A, X3A which are again sorted, hence in the same relative order in the grid. Note that now the A’s are entries in the BWT in the same relative order. This is illustrated in the figure below.
Search using FM Index
Our index data structure has the following
- BWT – Indexed access. We will denote this by BWT(i).
- Suffix Array – Indexed access. We will denote this by SA(i).
- First column – Queries : Start index, end index of an alphabet and index given alphabet and a rank. For example, find the index of letter ‘G’ with rank 3. This can be achieved by just storing the frequency of each alphabet in order. For example, for BANANA$, its enough to store 3A1B2N1$. This query can be answered by a simple walk over this compressed structure. We will denote the queries by Start(alphabet), End(alphabet) and IndexOf(alphabet, rank).
- Ranks in BWT – We need fast access to the following query : Get number of occurences of a particular character above a particular row in the BWT. For example, get the number of ‘A’s above row 5. One way to achieve this is to store at each row the frequency of each alphabet till that row. This can be achieved by a single pass over the BWT. In practice though, we don’t store this at every row, but store at regular intervals. We will assume here that it is stored at every row for simplicity. We will denote this query by RankOf(index, alphabet)
Let T be the text searched on and Q be the query string. The algorithm is as follows
index = len(Q) # Walk the query backwords nextChar = Q[index--] (start, end) = (Start(nextChar), End(nextChar)) while index >= 0: nextChar = Q[index--] (startrank, endrank) = (RankOf(start, nextChar), RankOf(end, nextChar)) (start, end) = (IndexOf(nextChar, startrank), IndexOf(nextChar, endrank)) for i in range(start, end): print SA(i)
The output of this algorithm are the indices into the original text where the query is present. Note that error condition handling has been skipped in the pseudocode above for brevity. It involves checking if the range found at every iteration of the loop is a valid range.
The visualization below explains the algorithm. Few points about the visualization
- Yellow text at the top indicates the text to search on and the yellow text at the bottom, the query
- Red column indicates BWT, orange the suffix array
- The grey part of the grid is not stored as part of the index. It is shown here just for clarity
- Rows highlighted in blue indicates the range where search till that point is successful. This appears once the search animation starts.
Let’s step through the visualization for text ABRACADABRA and query DAB. Follow the points below by hitting on ‘Step’ button after each point.
- Remember that the query is walked from the last to first. The first character is ‘B’, which results in the range shown
- Next character is ‘A’. Get the rank of ‘A’ on the BWT for the first and last rows on the current range. The ranks as shown in the tooltips are 0 and 2. The new range is determined by the rows where ‘A’ has rank 0 and rank 2 on the first column. This is shown by the new range
- Now the final character ‘D’. Performing the the same steps as above, we get the final range as the row ’9′. The corresponding element on the suffix array is ’6′ which is the index in the original text where ‘DAB’ starts
Play around with the visualization with different texts and queries to understand the algorithm better.
How does this search work? Let’s first understand what the loop invariant here is. Let ‘i’ be the index into the query string at some point of time when the range in the grid is (s,e). Remember that we are walking the search query backwards. The loop invariant is that Q[i:] is the prefix of all strings in the range (s,e). To prove that that is indeed the case, let’s move one more character on the query. Consider the characters on the BWT for the indices between s and e. All these characters precede various occurences of the prefix Q[i:] in the text. The rank data structure enables us to quickly check if any of those characters are the character we want to match. If yes we get the start and end rank of the character. Note that if the character is not present in that range, the start and end rank of the character would be same, ending our search. Then we use the property of BWT described above to jump to the corresponding character on the first column and continue our search. Thus when we exhaust all the characters in the query, we are left with the range on the grid where the prefix of every row is the query.
On the next post in this series, we will describe how to find the BWT index of a very long string efficiently.
In this month’s webinar, we will show you how to use the Find Significant SNPs analysis workflow in Avadis NGS 1.3.
SNP verification and prioritization is one of the most time consuming tasks in variant analysis. The workflows and algorithms built-into Avadis NGS 1.3 make end-to-end SNP analysis easy, intuitive, and accurate. These workflows support multiple experimental setups and can be used for quickly identifying population-specific variants, somatic mutations, and tumor-specific markers with the aid of a user-friendly graphical interface.
Session 1: 18th April – 10 AM Pacific Standard Time (San Francisco)
Session 2: 19th April – 11 AM Central European Time (Berlin)
Register online for free!