Event date:
May 26 2022 3:00 pm

An Integrated In silico Approach to Characterize Proteoforms using Sequence and Structure Data

Kanzal Iman (LUMS ID: 2017-14-0016)
Modern day proteomics involves integrative investigation of protein sequences, structures, function, and interactions in normal as well as perturbed conditions such as disease or treatment. Specifically, mass spectrometry-based proteomics provides for protein identification, characterization, and quantitation at both the intact protein and peptide levels. Recent advancements in proteomics protocols and instrumentation have enabled precise mass measurements of large proteins by employing soft ionization techniques coupled with high-resolution mass analyzers. This has led to the emergence of Top-Down Proteomics (TDP) protocol, which is becoming increasingly popular for analyzing intact proteins. TDP offers an enhanced sequence coverage as compared to other proteomics protocols along with improved identification and characterization of proteoforms (proteins and their variants). The sequence information retrieved using TDP data analysis tools coupled with recent advances in molecular and structural biology has also provided impetus to structural proteomics. Importantly, information on the three-dimensional structure of proteins is crucial in elucidating the functional role of proteins. In order to predict the sequence-structure-function relationship of proteins, in the first phase of our research, we have developed an integrated proteomics pipeline for in silico sequence and structure analysis of proteins. Towards enhancing proteoform identification and characterization in top-down proteomics, we have developed SPECTRUM, an open-source and open-architecture MATLAB toolbox. The state-of-the-art top-down proteomics platform provides significantly enhanced proteoform identification and characterization rates (91% to 177%) as compared to the contemporary TDP tools. Next, we developed a GPU-accelerated, web-based TDP search engine, PERCEPTRON that allows for the analysis of complex TDP spectra in large-scale whole proteome studies. PERCEPTRON outperforms all contemporary tools by up to 135% in terms of reported proteins and 10-fold in terms of runtime. We further enhanced the specificity of the aforementioned tools by developing a novel algorithm, COINS, for employment in proteoform search pipeline. COINS enhanced proteoform search by identifying up to 86% more proteoforms. An increase of up to 98% more proteoform spectral matches (PrSMs) and 3.6 folds increase in peptide sequence tags (PSTs) was reported offering high confidence proteoform assignment. Moreover, a significant increase in post-translational modifications (PTMs) characterization (1.5 folds) was also observed.

In the second phase of this work, we have employed sequence-based information for developing a novel protein structural proteomics workflow. For that, we have obtained ~200 sequences of Hepatitis C Virus (HCV) NS3/4A of Genotype 3 and modeled each mutational variant for onwards pharmacophore-based virtual screening (PBVS) followed by covalent docking. We then targeted the predicted structure using >100 ligands which were selected after screening small molecule databases including MolPort, ChEMBL, DrugBank, ZINC, PubChem, and Mcule. Mutations at 14 positions within the HCV NS3/4A G3 ligand-binding pocket including F43L, H57R, Q80K, R123T/S, I132L, Y134C/R/S, S139P, R155G, A156T, V158A, C159V, D168Q, C525W/Y, and Q526H/R were evaluated. Two mutations were identified within the catalytic triad (H57R and S139P). We applied several in silico methods to investigate the mutagenic variations in the binding pocket of Genotype 3 (G3) HCV NS3/4A and evaluated ligands towards its efficacious inhibition. cpd-217 (CHEMBL569970; PubChem45485999) was identified as a potential covalent inhibitor of Ser139 containing a chemical warhead. The hit established a covalent bond (C–S) with the reactive Ser139 forming favorable interactions with ligand-binding residues. The binding stability of cpd-217 was then confirmed by molecular dynamic simulation followed by MM/GBSA binding free energy calculation. The free energy decomposition analysis indicated that the resistant mutants alter the HCV NS3/4A-ligand interaction, resulting in an unbalanced energy distribution within the binding site leading to drug resistance. cpd-217 was identified to interact with all NS3/4A G3 variants with significant covalent docking scores ranging between -6.5 to -4.1 kcal/mol. We concluded that cpd-217 is a potential inhibitor of HCV NS3/4A G3 variants that warrants further in vitro and in vivo studies. The study will pave the way for drug design and development of HCV G3 NS3/4A, which has a high prevalence in developing countries, including Pakistan.

Towards a translational application of the pipeline, a metaproteomics case study was developed for profiling microbial and heavy metal contamination of Hudiara drain (a large wastewater channel) in Lahore, Pakistan. Profiling of microbiota within these water samples revealed the presence of bacterial and fungal species including Bacillus, Exiguobacterium, Aspergillus, and Penicillium. These species have been previously reported to be resistant to environmental stresses including high levels of heavy metal concentrations.

In conclusion, this thesis proposes a multi-level proteomics pipeline that brings together different computational proteomics approaches towards the sequence-structure-function analysis of proteins. The proposed pipeline paves way for developing a next-generation integrative in silico platform for an enhanced identification, and characterization of proteoforms.

Final Defense Committee Members:

1. Dr. Basit Shafiq (Chair & Associate Professor, External Thesis Committee Member)

Syed Babar Ali School of Science and Engineering, Department of Computer Science, Lahore University of Management Sciences (LUMS), Pakistan.

2. Dr. Shamshad Zarina (Visiting Professor, External Thesis Committee Member)

Dr. Zafar H. Zaidi Center for Proteomics, University of Karachi, Pakistan.

3. Dr. Shaper Mirza (Associate Professor, Thesis Committee Member)

Syed Babar Ali School of Science and Engineering, Department of Life Sciences, Lahore University of Management Sciences (LUMS), Pakistan.

4. Dr. Muhammad Tariq (Associate Professor, Thesis Committee Member)

Syed Babar Ali School of Science and Engineering, Department of Life Sciences, Lahore University of Management Sciences (LUMS), Pakistan.

5. Dr. Safee Ullah Chaudhary (Associate Professor, PhD Supervisor)

Syed Babar Ali School of Science and Engineering, Department of Life Sciences, Lahore University of Management Sciences (LUMS), Pakistan.


1. Basharat, Abdul Rehman, Kanzal Iman, Muhammad Farhan Khalid, Zohra Anwar, Rashid Hussain, Humnah Gohar Kabir, Maria Tahreem et al. "SPECTRUM–A MATLAB toolbox for proteoform identification from top-down proteomics data." Scientific reports 9, no. 1 (2019): 1-14.

2. Khalid, Muhammad Farhan, Kanzal Iman, Amna Ghafoor, Mujtaba Saboor, Ahsan Ali, Urwa Muaz, Abdul Rehman Basharat et al. "PERCEPTRON: An open-source GPU-accelerated proteoform identification pipeline for top-down proteomics." Nucleic Acids Research (2021).

3. Kanzal Iman, Kyung-Hoon Kwon, Sung Hwan Kim, Kyu Hwan Park, Manhoi Hur, Yong Seong Cho, Hyun Sik Kim et al. “De novo-based Complementary Ion Search (COINS) Algorithm for Enhanced Proteoform Identification and Characterization in Top-Down Proteomics.” Proteomics. (In Review)

4. Ashraf, Muhammad Usman, Kanzal Iman, Muhammad Farhan Khalid, Hafiz Muhammad Salman, Talha Shafi, Momal Rafi, Nida Javaid et al. "Evolution of efficacious pangenotypic hepatitis C virus therapies." Medicinal research reviews 39, no. 3 (2019): 1091-1136.

5. Kanzal Iman, Muhammad Usman Mirza, Fazila Sadia, Matheus Froeyen, Safee Ullah Chaudhary. “An integrative pharmacophore-based screening, covalent docking, molecular dynamics and MM-GBSA approach reveals a covalent inhibitor for targeting drug-resistant Genotype 3 variants of Hepatitis C Viral NS3/4A serine protease”. PLOS ONE. (In Review)

6. Ambreen Sabir, Zainab Nasir, Zonaira Khalid, Hafiz Muhammad Salman, Muhammad Farhan Khalid, Muhammad Burhan Khalid, Fatima Arshad, Kanzal Iman et al. “Profiling microbial and heavy metal contamination of Hudiara drain and its adjoining areas in Lahore, Pakistan.” Chemosphere. (In Review)