Food – Primary source of essential nutrients to the whole world. When enquired on for one’s favorite food, we get to hear a wide variety of preferences from worldly cuisines. Most of these dish recipes would have wheat as a common ingredient. Wheat is a grass variety cultivated for its edible grains and is widely used as a dietary staple. Though it is cultivated in large amounts, there is an increased demand to increase productivity due to a diminishing land size and an increasing population. Agronomic challenges, climate change, pests, microbial infections, natural calamities, etc. are some of the inhibitory forces affecting wheat crop yields.
Enhanced yields of wheat having superior qualitative and quantitative traits with biotic and abiotic resistance is of utmost importance to improve food security. Genomic studies have improved the understanding of wheat genome and certain loci have been identified to be imperative in better survivability and growth factors. Better understanding of the wheat genome is also required for improved scientific interventions towards increasing yields. Identifying Single Nucleotide Polymorphisms (SNPs) in the wheat genome causative for enhanced adaptability of certain wheat cultivars in varied climatic conditions has been widely studied to improve the crop yields. Change of one nucleotide (base) in certain positions of the genome is called a SNP and some SNPs help the crop to adapt better to biotic or abiotic stresses.
Powdery mildew is a fungal disease that affects wheat crops in cold and humid conditions. Pm4b is a gene in the wheat genome that has resistance towards the fungal disease. Only certain breeds of wheat crop have Pm4b. Use of molecular markers to identify the presence of Pm4b in a wheat plant can help us in selective breeding and cultivation of powdery mildew resistant crop varieties. Molecular markers are DNA sequences or SNPs in the desired gene (Pm4b) and it’s surrounding sequences that can be used to identify the presence of the gene of interest. Most molecular markers of the Pm4b gene were not well known1. A study1 identified these molecular markers of Pm4b by identifying SNPs in sequenced samples from wheat crops with and without Pm4b genes. 4 samples from this study1 were taken and the SNP analysis was carried out in Strand NGS to identify SNPs present in these samples. This SNP analysis study is broadly categorized into Quality Control (QC), alignment with a reference genome, and identifying SNPs.
Sequencing is a process of determining the order of bases in the genome. A sequencer sequences a small piece of the genome called a read. Read sizes can vary from 50 to 1000 bases depending on the sequencing machine being used. Sequencers generate a large number of reads and are saved as FASTQ files. A FASTQ file is a text-based format file that has the bases and its corresponding base quality (stored in ASCII values). Base quality is the probability score of a base being called out correctly. The FASTQ files along with the sequenced data can contain data resulting from instrumental errors. It is important to check for the quality of these files as these errors can lead to erroneous results from our analysis. Quality Control plots aid in identifying these errors
The base quality by position plot displays a box whisker plot with the base quality score of each base at a given position in all the reads. The blue line illustrated the mean for every base at a given position. The Base quality histogram is a histogram of the distribution of base quality scores of all bases in all reads. In Figure 1, no whiskers are seen for most of the bases and long whiskers are seen at the end of the read. This implies the probability of sequencing error at the end of the read as the sequencer might not have removed the previous base or an appropriate base did not get attached for reading. The base quality score of 30 or above has very little probability of error. All the bases in the reads have a mean base quality score above 30 and most bases have a base quality of 42. So, it can be concluded that the samples have fewer sequencing errors.
Base composition by position plots the frequency of each base at all the base positions in all reads. All the bases have an equal likelihood of being at a base. Since there are 4 bases, it would be 25 % for all reads. But, a read can slightly vary with the percentage as each base is not distributed evenly in our genome. If there are higher deviations in between bases then there is a chance of bias in preparation of these reads. In Figure 2, all 4 bases lie within a probability between 20 to 30. This shows there is no bias in samples.
Adapter content and Overrepresented sequences:
Adapter sequences are small nucleotide sequences that are attached to the reads. These adapters get attached or ligated to the flowcell of the sequencers and enables sequencing of read. It is important to trim out these sequences from the reads during analysis as they disrupt the reading frame and also can be misaligned with the genome. In Figure 3, the commonly used adapter sequences (adapter sequences of Illumina, Nextera and SOLID sequencers) are not seen and hence the samples are devoid of any adapter sequences. Sequence contamination can occur due to sampling contamination or the presence of some oligos during sequencing. The Overrepresented sequences QC showed no contaminants present in the sample.
The samples had a good base quality score and had no contamination sequences as concluded by the Pre-Alignment QC analysis. Once the quality control is done, the samples are aligned to a reference. Alignment is the process of aligning the sequenced reads to their genomic positions with the help of a reference genome. This alignment helps in identifying the genes or genomic position(s) to which the reads belong. Bases that have a mismatch with the reference genome or deletion or insertion of bases in the read are called SNPs.
Strand NGS uses its own algorithm called COBWeb for performing alignment. This algorithm can handle both longer and shorter reads and also allows for a different number of gaps and mismatches. Alignment was done for these samples against the IWGSC wheat reference sequence from Ensembl3 with a 90% percentage identity and 5% gaps.
Once the alignment was done, the aligned reads were available as “Aligned Reads List ” in the Experiment Navigator. The Alignment percentage of each sample can be seen in the Alignment Report.
Nearly 97% of the reads have been mapped to the reference genome in all the samples. From this report, the aligned reads aligned only to; one unique region of the reference genome, matched to multiple regions, and only partially aligned or split can be known. The reads that were not aligned to the reference genome due to no match or many matches could also be found. The aligned read distribution to each particular chromosome could also be known.
The quality of alignment can be checked by the Post-Alignment QC plots. Alignment Score plots illustrate the distribution of reads based on the alignment score. The alignment score is the percent identity found between read and reference. In Figure 5, most reads have an alignment score of 100. Mapping quality plots a histogram of mapping quality for all reads. Mapping quality is the probability score that calculates the probability of mapping being wrong. Mapping quality with scores more than 40 indicates a low percentage (0.01%) of mapping error. In Figure 5, some reads have a mapping quality of 0 and most reads have a mapping quality >40.
This Aligned Read List could now be used to identify variants!
The identification of SNPs using Aligned Read List is continued in Part 2.
- Wu, P., Xie, J., Hu, J., Qiu, D., Liu, Z., Li, J., Li, M., Zhang, H., Yang, L., Liu, H. and Zhou, Y., 2018. Development of molecular markers linked to powdery mildew resistance gene Pm4b by combining SNP discovery from transcriptome sequencing data with bulked segregant analysis (BSR-Seq) in wheat. Frontiers in plant science, 9, p.95.