Molecular SNP markers are regions of DNA with variations. They help in selecting desired traits for plant breeding. Pm4b is a gene of interest that imparts the desired resistance against powdery mildew, a fungal disease in certain breeds of wheat. Samples with and without the Pm4b gene were taken for the identification of SNP markers that might assist in selective breeding. In Part1, the quality control and alignment of samples with the reference wheat genome using StrandNGS were explained.
The SNPs are identified using the Aligned Reads List. The SNP detection of Strand NGS uses the Bayesian SNP calling algorithm1. A SNP is a change of a base in a certain part of the genome from the reference base. In the Aligned reads list, multiple changes or mismatches of base(s) can be seen. These mismatches could be a variant or alignment errors. Some reads are misaligned due to sequence errors or consequent variations present in a read are called mapping errors. The Bayesian algorithm detects SNPs using base quality scores. There is a possibility of Base quality being under or overestimated. These errors could lead to detection of an erroneous SNP. These artefacts can be removed by performing Local Realignment and Base Quality Recalibration steps using the SNP Pre-processing steps in the Strand NGS workflow.
In Figure 2, the All Aligned Reads have certain reads with base G as an insertion (purple) and a mismatch and certain reads have both Gs as mismatches. If we locally realign the first mismatch G as an insertion, we could see that all reads align within the same region. Local realignment of reads help in finding the right variants. The locally realigned reads are saved as Local Realigned reads, a node under All Aligned reads.
Strand NGS identifies these regions with multiple variations located closely. These regions are identified based on parameters like the minimum number of variations nearby and the possibility of these variations to be merged. Our software then identifies the type of variations within these regions and scores each variant based on the supporting reads, probability of variation, etc. The highest scoring variation is inferred as the most likely variation. All the reads are then realigned according to the most likely variation. The most likely variation is based on the probability scores of each of these variations with other variations based on error rate and offset in alignment.
Base Quality Recalibration:
When a certain base repeats itself in a DNA sequence, the sequencer can sequence the next base(s) erroneously due to this repetition. Thus, a base being wrong is dependent on the adjacent base. Different sequencers (Illumina, PacBio, etc.) use different types of sequence cycle to sequence reads and this contributes to the base quality.
Strand NGS recalibrates the base quality based on these factors to prevent base quality errors. It identifies all bases having a mismatch and are separated into bins based on the above factors. Based on mismatches and the total number of bases, the base scores are recalibrated.
In Figure 3, the yellow bins represent the recalibrated base quality score and it is similar to an identity line ( x-axis value = y-axis value) as the reported (x-axis) and empirical (y-axis) qualities tend to be the same after recalibration.
The SNPs are identified by the workflow SNP Detection under Sequence Analysis. This identifies the probability of a variant to occur given the observed data. The probability of the variant and the observed data are calculated based on the number of supporting reads, base quality, error rates, etc. These parameters can be customized in the SNP Detection wizard. The output of SNP Detection is the SNP Multi Sample Report under the read list. The Recalibrated Local Realigned Reads of wheat samples had 115,810 SNPs.
SNP markers should be present only in the samples with resistance to Pm4b and should have high confidence (supporting reads) and coverage. The SNP Multi Sample Report was filtered based on high confidence, coverage, and presence in the group of Resistant samples using Find Significant SNPs. 3425 variants were found to be significant.
The Pm4b gene is located in the distal region of the 2A chromosome2 of wheat and the 2A chromosome has 242 SNPs as observed from Figure 5. The SNPs in chromosome 2 were observed using the genome browser. The genome view in the genome browser was used to mark the distal region of chromosome 2.
Figure 6 shows the coverage of all 4 samples, the significant SNPs found in the 2A chromosome, and the transcripts. In Figure 7, The Genome view shows the region in which the genome browser is looked at. The distal region was selected in Genome view to identify SNPs in that region. 55 SNPs were present in the distal region. These SNPs could be considered as potential markers for selective breeding.
Now, these SNPs should be experimentally validated to find out if they could be considered as biomarkers for Pm4b selection. This way, potential biomarkers can be easily identified and help in mitigating the agronomic challenges faced.
- Wu, Peipei, et al. “Development of molecular markers linked to powdery mildew resistance gene Pm4b by combining SNP discovery from transcriptome sequencing data with bulked segregant analysis (BSR-Seq) in wheat.” Frontiers in plant science 9 (2018): 95.