An Integrated In silico Approach to Characterize Proteoformsusing Sequence and Structure Data
Abstract:
Modern day proteomics involves integrative investigation of protein sequences, structures, function, and interactions in normal as well as perturbed conditions such as disease or treatment. Specifically, mass spectrometry-based proteomics provides for protein identification, characterization, and quantitation at both the intact protein and peptide levels. Recent advancements in proteomics protocols and instrumentation have enabled precise mass measurements of large proteins by employing soft ionization techniques coupled with high-resolution mass analyzers. This has led to the emergence of Top-Down Proteomics (TDP) protocol, which is becoming increasingly popular for analyzing intact proteins. TDP offers an enhanced sequence coverage as compared to other proteomics protocols along with improved identification and characterization of proteoforms (proteins and their variants). The sequence information retrieved using TDP data analysis tools coupled with recent advances in molecular and structural biology has also provided impetus to structural proteomics. Importantly, information on the three-dimensional structure of proteins is crucial in elucidating the functional role of proteins. In order to predict the sequence-structure-function relationship of proteins, in the first phase of our research, we have developed an integrated proteomics pipeline for in silico sequence and structure analysis of proteins. Towards enhancing proteoform identification and characterization in top-down proteomics, we have developed SPECTRUM, an open-source and open-architecture MATLAB toolbox. The state-of-the-art top-down proteomics platform provides significantly enhanced proteoform identification and characterization rates (91% to 177%) as compared to the contemporary TDP tools. Next, we developed a GPU-accelerated, web-based TDP search engine, PERCEPTRON that allows for the analysis of complex TDP spectra in large-scale whole proteome studies. PERCEPTRON outperforms all contemporary tools by up to 135% in terms of reported proteins and 10-fold in terms of runtime. We further enhanced the specificity of the aforementioned tools by developing a novel algorithm, COINS, for employment in proteoform search pipeline. COINS enhanced proteoform search by identifying up to 86% more proteoforms. An increase of up to 98% more proteoform spectral matches (PrSMs) and 3.6 folds increase in peptide sequence tags (PSTs) was reported offering high confidence proteoform assignment. Moreover, a significant increase in post-translational modifications (PTMs) characterization (1.5 folds) was also observed.
In the second phase of this work, we have employed sequence-based information for developing a novel protein structural proteomics workflow. For that, we have obtained ~200 sequences of Hepatitis C Virus (HCV) NS3/4A of Genotype 3 and modeled each mutational variant for onwards pharmacophore-based virtual screening (PBVS) followed by covalent docking. We then targeted the predicted structure using >100 ligands which were selected after screening small molecule databases including MolPort, ChEMBL, DrugBank, ZINC, PubChem, and Mcule. Mutations at 14 positions within the HCV NS3/4A G3 ligandbinding pocket including F43L, H57R, Q80K, R123T/S, I132L, Y134C/R/S, S139P, R155G, A156T, V158A, C159V, D168Q, C525W/Y, and Q526H/R were evaluated. Two mutations were identified within the catalytic triad (H57R and S139P). We applied several in silico methods to investigate the mutagenic variations in the binding pocket of Genotype 3 (G3) HCV NS3/4A and evaluated ligands towards its efficacious inhibition. cpd-217 (CHEMBL569970; PubChem45485999) was identified as a potential covalent inhibitor of Ser139 containing a chemical warhead. The hit established a covalent bond (C–S) with the reactive Ser139 forming favorable interactions with ligand-binding residues. The binding stability of cpd-217 was then confirmed by molecular dynamic simulation followed by MM/GBSA binding free energy calculation. The free energy decomposition analysis indicated that the resistant mutants alter the HCV NS3/4A-ligand interaction, resulting in an unbalanced energy distribution within the binding site leading to drug resistance. cpd-217 was identified to interact with all NS3/4A G3 variants with significant covalent docking scores ranging between -6.5 to -4.1 kcal/mol. We concluded that cpd-217 is a potential inhibitor of HCV NS3/4A G3 variants that warrants further in vitro and in vivo studies. The study will pave the way for drug design and development of HCV G3 NS3/4A, which has a high prevalence in developing countries, including Pakistan.
Towards a translational application of the pipeline, a metaproteomics case study was developed for profiling microbial and heavy metal contamination of Hudiara drain (a large wastewater channel) in Lahore, Pakistan. Profiling of microbiota within these water samples revealed the presence of bacterial and fungal species including Bacillus, Exiguobacterium, Aspergillus, and Penicillium. These species have been previously reported to be resistant to environmental stresses including high levels of heavy metal concentrations.
In conclusion, this thesis proposes a multi-level proteomics pipeline that brings together different computational proteomics approaches towards the sequence-structure-function analysis of proteins. The proposed pipeline paves way for developing a next-generation integrative in silico platform for an enhanced identification, and characterization of proteoforms.