A Case for Long CIGARs: Achieving 50% Compression of BAM Files

I am studying NA06984.454.MOSAIK.SRP000033.2009_11.bam. Not to find SNPs, structural variations, or derive any biological insights into the CEU population – but simply from an IT storage perspective. How much disk space is really needed to store the entire information of the 1.23 million matches present in this 151 Mb file? If I can get to 50% of the BAM file size (painlessly), I’ll be a happy man.

BAM internally uses a gzip like compression, but the format is optimized for indexed retrievals of the form “Find all matches to chr2:8000-9000.” So my first attempt was to convert the binary BAM file into its text equivalent (SAM) and try the various compression tools.Compressed file sizesZPAQ (run with the maximum compression configuration) performs astonishingly well and almost takes us below the 50% mark. But it took more than 10 hours to accomplish that on my laptop so I wouldn’t call the process painless.

We have done all we can by treating the file as a black box. So let us now open up the 940Mb SAM file and take a look at the space occupied by the individual columns.

Column Sizes

Because of the long 454 reads, 88% of the file is taken up by the read sequence and base qualities columns which each occupy 412 Mb. So compressing the information in these columns is likely to yield the highest benefits.

Compressing the base qualities:

The first read in the file has the quality string as: "99:::::::::::999999::::::::::::5:::::::999::::::999::::::::::9999:::::::::::: ::::::::::::::::::::99999:997:::7999::999999::::::::;;;;;;;;;:::;;;;;:8488:::; ;9:;;;;;;;;;;;;;;888888888:99;;;;;;;;;;;;;;;;;;;::;:9999::;;;;;;:;;;;;;;;;;;;; ;;;;;;;;;;;;;;;;;;;;;;;;;9:::;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;::;;;;;;;; ;:::;;;;;;;;;;;:;;:::::::;:::::;;;;;;;;;;;;"

Each character represents the quality of one base (see the FASTQ format specification for a description of how to convert between Phred quality values and ASCII characters). Since quality changes gradually, we often see long stretches of identical quality. This makes the quality sequence an ideal candidate for run-length encoding where we could replace a stretch of 11 consecutive “:“s with, say, the string "11X:". Encoding the quality string shown above in this manner would result in: "2X911X:6X912X:1X57X:3X96X:3X910X:4X932X:5X91X:2X91X73X:1X73X92X:6X98X:9X;3X: 5X;1X:1X81X42X"83X:2X;1X91X:14X;9X81X:2X919X;2X:1X;1X:4X92X:6X;1X:38X;1X93X:39X;2X:9X;3X: 11X;1X:2X;7X:1X;5X:12X;”

Encoding all the quality strings in the file, reduces the size of the quality string column by 50% to 205 Mb.

Compressing the Read Sequence:

The standard method for depicting a pairwise alignment shows the location of the sequences, and the positions of indels as:

chr10 62751 GATATTATTA--GGCCTTGGCCAGCCGCAAATGAATAATCTCTATCATCAAC 62800
            ||||||||||  ||||||||||||||||||||| |||||||||||||| |||
read1     1 GAAAATATTATTGGCCTTGGCCAGCCGCAAATG-ATAATCTCTATCAT-CAC    50

The information regarding such an alignment is spread across multiple columns in the SAM file as:

  • chr: chr10
  • POS: 62751
  • CIGAR: "10M 2I 21M 1D 14M 1D 3M"(spaces added for clarity only)
  • SEQ: GAAAATATTATTGGCCTTGGCCAGCGCGCAAATGTAATCTCTATCATCAC
  • NM tag: NM:i:3 (number of mismatches)

If the error rate is low, then we expect the CIGAR to have large "M" events and the portion of the read sequence corresponding to these events will have high similarity with the reference sequence. In the extreme case, where the CIGAR is "50M" and there are 0 mismatches, the read sequence exactly corresponds to the reference sequence and its presence in the file is redundant. Based on this insight, we can extend the notion of the CIGAR so that it also serves as a “directed delta” i.e. it encodes a series of string operations which when applied on the reference sequence result in the read sequence. One example of such an extended CIGAR that encodes the mismatch and insertion bases explicitly is: "2M 1XA 1M 1XA 5M 2ITT 21M 1D 14M 1D 1XC 2M" (spaces added for clarity only). Part of a picture showing the delta operations is shown below. Note that the original CIGAR can be derived from the extended CIGAR by simply ignoring the explicitly listed bases and treating the X events as M events.

Delta transformation

After replacing the CIGAR column with the extended CIGAR we can safely remove the SEQ and NM columns from the SAM file resulting in an overall savings of 405 Mb. The final file is approximately 330Mb in size, and can be further compressed using bzip2 or zpaq. This time, we safely make it below the 50% mark using bzip2!

Compresed file sizes