Elegant exact string match using BWT

This post is the first of a series where we describe some interesting algorithms used in our product Avadis NGS. The first post in this series is about string matching using BWT, a technique that forms the core of our aligner COBWeb. We have intentionally kept the usage of bioinformatics jargon minimum to benefit a wider audience.

This post describes an elegant and fast algorithm to perform exact string match. Why another string matching algorithm? To answer the question, let’s first understand the problem we are trying to solve.

In short, the problem is to match billions of short strings (about 50-100 characters long) to a text which is 3 billion characters long. The 3 billion character string (also called reference) is known ahead and is fixed (at least for a species). The shorter strings (also called reads) are generated as a result of an experiment. The problem arises due to the way the sequencing technology works, which in its current form, breaks the DNA into small fragments and ‘reads’ them. The information about where the fragments came from is lost and hence the need to ‘map’ them back to the reference sequence.

We need an algorithm that allows repeatedly searching on a text as fast as possible. We are allowed to perform some preprocessing on the text once if that will help us achieve this goal. BWT search is one such algorithm. It requires a one-time preprocessing of the reference to build an index, after which the query time is of the order of the length of the query (instead of the reference).

Burrows Wheeler transform is a reversible string transformation that has been widely used in data compression. However the application of BWT to perform string matching was discovered fairly recently in this paper. This technique is the topic of this post. Before we get to the searching application, a little background on how BWT is constructed and some properties of BWT.

BWT for a given text T is constructed as follows:

  • Append an end of string character (say ‘$’) to the text. This character should not be present in the text and is treated as the lexicographically largest character.
  • Collect all the cyclic permutations of the text and sort them.
  • Arrange all the sorted strings one below the other. The last column forms the BWT.

The image below shows the BWT transform for banana$.

A few things to note from the image

  • The red column, as indicated, is the BWT of BANANA$’
  • The yellow column, as indicated, is the Suffix Array. For every row, the element in the suffix array is the index into the original text of the suffix starting in that row. Note that if you have the suffix array, constructing BWT is straightforward. The above algorithm of performing a full sort on all suffixes to find the suffix array and BWT is very inefficient. Efficiently finding the suffix array will be the topic for our next post in this series.
  • The green column is just the sorted characters in the text.

Before proceeding we need to understand one interesting property of the BWT. The rank of a character in the BWT column is the same as that of the corresponding character in the first column. This is illustrated in the image above using colors. The 2nd N in the BWT column correponds to the 2nd N in the first column, the 3rd A in the BWT column corresponds to the 3rd A in the first column and so on. To see why this is the case, remember that all the rows in the grid are cyclic permutations of a single string. Let AX1, AX2, AX3 be three consecutive rows in the grid where A is the first character and X1, X2, X3 denote the rest of the row. As the rows are sorted lexicographically, X1, X2, X3 are sorted too. On cyclic shifts, we get X1A, X2A, X3A which are again sorted, hence in the same relative order in the grid. Note that now the A’s are entries in the BWT in the same relative order. This is illustrated in the figure below.

Search using BWT

Our index data structure has the following

  1. BWT – Indexed access. We will denote this by BWT(i).
  2. Suffix Array – Indexed access. We will denote this by SA(i).
  3. First column – Queries : Start index, end index of an alphabet and index given alphabet and a rank. For example, find the index of letter ‘G’ with rank 3. This can be achieved by just storing the frequency of each alphabet in order. For example, for BANANA$, its enough to store 3A1B2N1$. This query can be answered by a simple walk over this compressed structure. We will denote the queries by Start(alphabet), End(alphabet) and IndexOf(alphabet, rank).
  4. Ranks in BWT – We need fast access to the following query : Get number of occurences of a particular character above a particular row in the BWT. For example, get the number of ‘A’s above row 5. One way to achieve this is to store at each row the frequency of each alphabet till that row. This can be achieved by a single pass over the BWT. In practice though, we don’t store this at every row, but store at regular intervals. We will assume here that it is stored at every row for simplicity. We will denote this query by RankOf(index, alphabet)

Let T be the text searched on and Q be the query string. The algorithm is as follows

          index = len(Q) # Walk the query backwords
          nextChar = Q[index--]
          (start, end) = (Start(nextChar), End(nextChar))
          while index >= 0:
            nextChar = Q[index--]
            (startrank, endrank) = (RankOf(start, nextChar), RankOf(end, nextChar))
            (start, end) = (IndexOf(nextChar, startrank), IndexOf(nextChar, endrank))
          for i in range(start, end):
            print SA(i)

The output of this algorithm are the indices into the original text where the query is present. Note that error condition handling has been skipped in the pseudocode above for brevity. It involves checking if the range found at every iteration of the loop is a valid range.

The visualization below explains the algorithm. Few points about the visualization

  • Yellow text at the top indicates the text to search on and the yellow text at the bottom, the query
  • Red column indicates BWT, orange the suffix array
  • The grey part of the grid is not stored as part of the index. It is shown here just for clarity
  • Rows highlighted in blue indicates the range where search till that point is successful. This appears once the search animation starts.

Let’s step through the visualization for text ABRACADABRA and query DAB. Follow the points below by hitting on ‘Step’ button after each point.

  • Remember that the query is walked from the last to first. The first character is ‘B’, which results in the range shown
  • Next character is ‘A’. Get the rank of ‘A’ on the BWT for the first and last rows on the current range. The ranks as shown in the tooltips are 0 and 2. The new range is determined by the rows where ‘A’ has rank 0 and rank 2 on the first column. This is shown by the new range
  • Now the final character ‘D’. Performing the the same steps as above, we get the final range as the row ’9′. The corresponding element on the suffix array is ’6′ which is the index in the original text where ‘DAB’ starts

Play around with the visualization with different texts and queries to understand the algorithm better.



How does this search work? Let’s first understand what the loop invariant here is. Let ‘i’ be the index into the query string at some point of time when the range in the grid is (s,e). Remember that we are walking the search query backwards. The loop invariant is that Q[i:] is the prefix of all strings in the range (s,e). To prove that that is indeed the case, let’s move one more character on the query. Consider the characters on the BWT for the indices between s and e. All these characters precede various occurences of the prefix Q[i:] in the text. The rank data structure enables us to quickly check if any of those characters are the character we want to match. If yes we get the start and end rank of the character. Note that if the character is not present in that range, the start and end rank of the character would be same, ending our search. Then we use the property of BWT described above to jump to the corresponding character on the first column and continue our search. Thus when we exhaust all the characters in the query, we are left with the range on the grid where the prefix of every row is the query.

On the next post in this series, we will describe how to find the BWT index of a very long string efficiently.

Webinar: Find Significant SNPs in your Samples

In this month’s webinar, we will show you how to use the Find Significant SNPs analysis workflow in Avadis NGS 1.3.

SNP verification and prioritization is one of the most time consuming tasks in variant analysis. The workflows and algorithms built-into Avadis NGS 1.3 make end-to-end SNP analysis easy, intuitive, and accurate. These workflows support multiple experimental setups and can be used for quickly identifying population-specific variants, somatic mutations, and tumor-specific markers with the aid of a user-friendly graphical interface.

Session 1: 18th April – 10 AM Pacific Standard Time (San Francisco)
Session 2: 19th April – 11 AM Central European Time (Berlin)

Register online for free!

Mutations Causing CHIME Disease

Here is publication from Sanford-Burnham Medical Research Institute that uses Avadis NGS: http://www.sciencedirect.com/science/article/pii/S000292971200095X

The goal in this paper is to find the underlying genomic cause of CHIME syndrome, a disease so rare that there are only 8 reported cases in the world. CHIME is an abbreviation derived from the main symptoms: Colobomas (hole in the eye), congenital Heart defects, Ichthyosiform dermatosis (scaling of skin), Mental retardation, and either Ear anomalies or Epilepsy.

They performed Exome sequencing (using the Nimblegen target enrichment kit) on 5 of the 8 reported cases; they also ran CGH arrays on a 6th reported case to identify gross copy number changes, and found a 1MB deletion on Chr 17. They focused on the 17 genes in this region for SNP analysis on the 5 Exome sequencing samples.

They found that only the PIGL gene in this region carried compound heterozygous mutations in all 5 samples. They also sequenced parents of these 5 cases and found that both parents were carriers in all cases (each parent had a mutation in one chromosomal copy).

This method provides an interesting glimpse into the future when genome sequencing is routine and the database of disease causing variants is comprehensive: the parents of these 8 cases would have realized that they are both carriers, and with a little investigation on the yet to be born babies, these cases may have been spared of their misery.

RNA-Seq Analysis of Functional Compartments in the Rat Placentation Site

We just read a recent publication by Shankar et al that uses Avadis NGS to analyze RNA-Seq data. They study cells involved in rat placenta formation. The placenta is an organ that connects the foetus to the wall of the uterus, thus allowing for nutrients/oxygen to flow from mother to foetus, and waste to flow in the opposite direction. The rat placentation site is apparently made up of 3 very distinct functional areas. The goal of this study is to understand gene expression profiles in each of these 3 areas, and use these profiles to characterize the functions of each area. This paper shows that GO analysis on RNA-Seq data yields functions for each zone consistent with what is known (i.e., one zone specializes in transport of oxygen and nutrients, another drives the maternal hormonal response, and the third drives immune response). The paper then goes further to identify sets of transcription factors that are expressed differentially in each zone, and novel transcription factor networks (shown below, and created using the Avadis Protein-Protein Interactions Database) that may be at the heart of this differential behaviour.

1.3.1 Patch Update

We released a patch update for Avadis NGS to address some of the feedback that we received, as well as fix some minor bugs in version 1.3. The updates are documented in the release notes. Below is a quick summary of the key features:


* A new analysis step Find Significant Regions to compare generic regions across groups of conditions
* Ability to trim poorly aligned portion of a read at the 3′ end
* Creation of Multi-Sample Reports from the import of MiSeq results
* Improvements to the novel small RNA type detection




How to Update
For current users: Use the menu options Help > Update Product > From Strand Server and choose to update your installation. If you are behind a firewall or proxy and cannot access the Update Server, please contact support.

For new users: Download the latest installer for your operating system and follow the on-screen instructions.



Webinar Series: Benchtop Sequencing Data Analysis

We are hosting an online seminar series on the alignment and analysis of genomics data from “benchtop” sequencers, i.e. MiSeq and Ion Torrent. Our webinar panelists will give a tour of various bioinformatics functions in Avadis NGS that will enable researchers and clinicians to derive biological insights from their benchtop sequencing data. Webinar attendees are welcome to pose questions to our panelists and get real-time feedback and answers!


Seminar #1: MiSeq Data Analysis

MiSeq Data Region Clustering Using Avadis NGS

Clustering amplicon regions based on average coverage

Avadis NGS 1.3 provides special support for analyzing data generated by MiSeq™ sequencers. In this webinar, we will describe how the data in a MiSeq generated “run folder” is automatically loaded into the Avadis NGS software during small RNA alignment and DNA variant analysis. This is especially helpful in processing the large number of files generated when the TruSeq™ Amplicon Kits are used. We will describe how to use the Quality Control steps in Avadis NGS to check if the amplicons have sufficient coverage in all the samples. Regions with unexpected coverages can easily be identified using the new region list clustering feature. Webinar attendees will learn how to use the “Find Significant SNPs” feature to quickly identify high-confidence SNPs present in a majority of the samples, rare variants, etc.

22nd February, 2012 – 10:00 AM Pacific Standard Time (Americas)
23rd February, 2012 – 11:00 AM Central European Time (Europe/Asia)
Register Here


Seminar #2: Ion Torrent Data Analysis

Ion Torrent Data Base Quality Distribution Plot Using Avadis NGS

Base quality distribution by position in read

Avadis NGS 1.3 includes a new aligner – COBWeb – that is fully capable of aligning the long, variable-length reads generated by Ion Torrent sequencers. In this webinar, we will show the pre-alignment QC plots and illustrate how they can be used to set appropriate alignment parameters for aligning Ion Torrent reads. For users who choose to import the BAM format files generated by the Ion Torrent Server, we will describe the steps needed for importing amplicon sequencing data into Avadis NGS. Users of the Ion AmpliSeq™ Cancer Panel will learn how to easily import the targeted mutation list and verify the genotype call at the mutation sites. We will also show the new “Find Significant SNPs” feature which helps quickly identify high-confidence SNPs present in a majority of the samples, rare variants, etc.

29th February, 2012 – 10:00 AM Pacific Standard Time (Americas)
1st March, 2012 – 11:00 AM Central European Time (Europe/Asia)
Register Here


Note: All webinars will be hosted on WebEx – please be sure to logon a few minutes before the meeting starts, in case you need to download any software updates in order to join the online meeting.

Diagnosing a Neuro-muscular Disorder

We just read a recent publication that used Avadis NGS in the process of diagnosing the case of a Japanese boy suffering from a neuromuscular disorder. Here is a link to the paper. The case is summarized below:

The boy was born in a family with no history of neuromuscular disease. He experienced transient breath-holding episodes shortly after birth. At 11 months of age, 30-second-long episodes of suspended breathing with increase in muscle tension appeared. These episodes were  severe enough to suspect epileptic seizures, but EEG recordings did not support this hypothesis. At the age of two, the boy started having myotonic attacks (slow relaxation of the muscles after voluntary contraction, so you have trouble releasing your grip after gripping something, or rising from a sitting position etc) that persisted for several minutes, hours, or even days, and created difficulties in standing, walking and mobility. After 7 years and 8 months of age, paralytic episodes appeared that occurred several times a year thereafter.

Based on these and various other physical characteristics, he was suspected to have something called Schwartz-Jampel Syndrome; this is a recessive disorder and caused when both copies of the HSPG2 gene go bad. The authors sequenced the exonic regions of this gene (first enriching using SureSelect, then sequencing on SOLiD, then using Avadis to call SNPs) to check for mutations, and found only 3 homozygous SNPs, all of which were registered in dbSNP 134 with no knowledge of clinical significance and fairly high allele frequencies. On this basis, they ruled out Schwartz-Jampel Syndrome conclusively (it is unclear whether or not the authors checked for compound heterozygosity). Here is a snapshot of the Avadis NGS Genome Browser showing the HSPG2 gene.

They then investigated other genes known to be involved in neuromuscular signaling, and found a Non-Synonymous SNP in the SCN4A gene; neither of the parents had this SNP, so it was a de-novo mutation. Based on this a diagnosis of a type of Sodium Channelopathy was made and some appropriate therapies were applied with some improvement.

Release Announcement – Version 1.3

We are really happy to announce that version 1.3 of Avadis NGS is now available. Special thanks to everyone who participated in the Beta program and helped us by providing valuable feedback!

What’s New
Avadis NGS 1.3 includes several new features and enhancements, which are documented in the release notes. Below is a quick summary of the key features:

* Comprehensive small RNA analysis workflow
* Integrated COBWeb alignment algorithm
* Enhanced SNP Detection and Differential SNP analysis
* Seamless support for benchtop sequencers, like Ion Torrent and MiSeq
* Streamlined import of custom annotations
* Improved performance, better usability, new visualizations, and a lot more…

How to Update
For current users: The upgrade to v1.3 should be seamless. You do NOT need to uninstall your installation to get v1.3. When you launch Avadis NGS, you should see a message giving you the option to update. If not, select the following menu options Help > Update Product > From Strand Server and choose to update your installation. If you are behind a firewall or proxy and cannot access the Update Server, please contact the support team, who will guide you with the update.

For new users: Download the latest installer for your platform and follow the on-screen instructions.

Genes That Smell

The Lab for Neurogenetics and Behavior at Rockefeller University has an ongoing Smell Study that states:

Given almost any odor, some people will find it pleasant, others unpleasant. Ripe cheese or garlic may smell delicious to some, but repulsive to others. The scientific basis of this variation has not been well-studied. Despite the clear evidence for culture-based preferences for food and aromas, the nature-versus-nurture debate for smell remains unresolved. We believe there may be a genetic basis for our unique senses of smell.

Inspired by the above, we took several Exome Sequencing and Whole Genome Sequencing datasets obtained from Human subjects, imported the associated reads  into Avadis NGS, filtered these reads to retain only reads with very good Mapping Quality, called SNPs, and ran SNP Effect Analysis on these SNPs to identify Non-Synonymous SNPs and their associated genes. Running GO Analysis on these genes identified either Olfactory Receptor Activity or Sensory Perception of Chemical Stimulus/Smell as the most significantly enriched category for every single subject without fail. So genes involved in the perception of smell and their associated proteins vary substantially from person to person, possibly laying the ground for the above variation.

Which genes show the maximum variation? The family of Olfactory Receptors is among the largest of the gene families comprising more than 700 receptors.  About 300 of these have at least one Non-Synonymous variant locus in at least one of our samples.  The following graph shows those receptors which have  6 or more Non-Synonymous variant loci in our samples; there are 29 such receptors.

Overall, the mutation rates in the Olfactory Receptor family are only slightly higher than the expected genome-wide rate. What is surprising though is the relatively high rate of Non-Synonymous SNPs here. Could it be the myriad combinatorial possibilities generated by these variant loci that yield many different phenotypes with different perception profiles?


Also, a quick preview to the new Heatmap and Clustering functions for SNPs in Avadis NGS 1.3; the above heatmap shows the percentage of Variant Reads at various loci in the OR4C3 receptor (columns) for 7 Human samples (rows).

How Identical are Identical Twins?

We’re looking at exome sequencing data on whole peripheral blood DNA of monozygotic twins (this data was generated by our collaborators, Jan Dumanski and his group at Uppsala University in Sweden). Monozygotic twins were earlier thought to be genetically identical; now we know that isn’t completely true. How does one identify small mutations (SNPs and small InDels)  that are present in one of the twins but not in the other? Or in general, how does one compare two different samples, for instance, to find somatic mutations that are present in a tumor sample but not present in the paired normal sample.

The  obvious approach is to call SNPs in each sample first, and then identify loci where an allele is called in exactly one of the two samples, e.g., allele A is called in one twin but not in the other. The number of variant alleles called in each of the two twins is about 150,000. The number of  differential allele calls between the two twins is about 27,000.  Seems like too many.  A quick look at the results reveals many instances with the following pattern.



Here, A is the reference allele. The fraction of reads supporting variant allele G barely increases from 34.48% in twin 1 to 36.36% in twin 2; yet, that is enough to tip the SNP call from AA in twin 1 to AG in twin 2. This is clearly a false positive, resulting from the cut-offs used by the SNP calling procedure. This example suggests that instead of comparing SNP calls, one should look for a substantial difference in the fraction of reads supporting the variant allele between the two samples. Let DiffSuppReads denote this difference.

Suppose, we insist on DiffSuppReads>50%; then the number of differential allele calls reduces to  just about 450, down from 27,000! One could ask, why 50%? A rough reason: look at the distribution of the number of supporting reads over all alleles called for a particular twin; this appears to be a mixture of   three separate distributions, with 25% and 75% being the transition points between successive distributions.  So DiffSuppReads>50% ensures that the two samples are unlikely to come from the same distribution.


Can we define a cut-off for DiffSuppReads more rigorously?  What is the probability that DiffSuppReads is large, given that the genotype call is the same in both twins ? The heterozygous genotype will challenge this probability the most, so let us assume that both twins are heterozygous. The distribution of DiffSuppReads is given by the following formula (here, n1 and n2 are the number of reads in the two samples, respectively, at this locus of interest, and DiffSuppReads is a fraction between 0 and 1).

We use this to calculate the critical value of DiffSuppReads at which the above probability becomes  less than, say 10e-5, so we can offset multiple testing effects for the approximately 100,000 variant alleles being tested. This critical value of DiffSuppReads varies from locus to locus, because of the dependence of n1 and n2. So we compute this value for each variant allele and see which variant alleles pass. This yields about 400 differential allele calls. Another interesting phenomenon strikes though, as illustrated in the example below.

The difference in supporting reads fractions is just 3.3%, but the critical value for DiffSuppReads is even smaller, because of very large read counts (9317 and 8943) at this locus. For very large read counts, an underlying heterozygous genotype would yield very close to 50% of the reads on each allele, and even slight deviations from this become highly improbable. This suggests that even a 3% value of DiffSuppReads is very unlikely to arise if both twins had the same genotype, and therefore one can conclude that the twins have different genotypes. This conclusion is however not very believable, and likely suggests that the assumption of reads being drawn independently is at risk of violation in areas with high coverage.

To summarise, just comparing genotype calls yields many false positives where the call is different due to cut-off effects, and using a statistical approach to generate a cut-off for DiffSuppReads yields false positives in high coverage regions. Can one combine these approaches? Here is table with various combinations.

Which of these is the right answer? The 1000 Genomes Project estimated that a child has only around 50 new mutations relative to its parents. Monozygotic twins ought to be closer than that. And we are observing only the exomes (and some neighborhood) of these twins. So the real answer probably lies close to the bottom of the above table. However, as Jan Dumanski points out, much of the 1000 Genomes effort involved sequencing of oligoclonal/monoclonal lymphoblastoid cell lines, not quite directly comparable with whole peripheral blood.

We’ve experimented with some of the above methods for differential SNP calling in Avadis NGS. Avadis NGS 1.0 used an approach based on A) above. Avadis NGS 1.2 used an approach based on C). For the above reasons, Avadis NGS 1.3 is moving towards approach B). Hoepfully, this one will last.

← Older posts