An algorithm for fast and accurate haplotyping using next-generation sequencing data
Abstract
Haplotypes provide a better picture of the genome compared to genotypes, as they provide information about the presence of the particular allele on a chromosome. Haplotyping is a challenging task because the standard methods for genotyping only provide information about the individual's alleles at the specific genetic locus and not across the full length of a chromosome. In addition, they traditionally require analyzing the polymorphic markers in pedigree members for humans or performing experiments to study the segregation of alleles in other organisms. With the advancement in sequencing technologies, a plethora of draft genomes has become available in recent years, requiring a better approach to haplotyping. Most of the currently available tools for haplotyping that use high-throughput sequencing data are either slow or are unable to deal with noise in the data. In this study, we have developed a novel algorithm for fast and accurate haplotyping using next-generation sequencing data. Implemented in Python, it assigns haplotypes across the full length of a chromosome by systematically analyzing the heterozygous variant positions along the sequencing reads and creating haplotype graph. We use our algorithm to assign haplotypes in Mangifera indica (mango) cv. Kala Chaunsa. Results show that our algorithm outperforms widely used haplotyping tools such as HapCUT2 and WhatsHap.
Evaluation Committee:
- Dr. Aziz Mithani (advisor)
- Dr. Amir Faisal