Mass spectrometry (MS) is an analytical technique for determining the composition of a sample. Conventionally, peptide mass fingerprinting is widely used to identify proteins from MS dataset. Here, the authors developed a novel network-based inference software termed NBPMF. By analyzing peptide-protein bipartite network, they designed new peptide protein matching score functions. They present two methods: the static one, ProbS, is based on an independent probability framework; and the dynamic one, HeatS, depicts input data as dependent peptides. Moreover, they use linear regression to adjust the matching score according to the masses of proteins. In addition, they consider the order of retention time to further correct the score function. In the post processing, they design two algorithms: assignment of peaks and protein filtration. Finally, they propose two strategies to estimate the false discovery rate. The experiments on simulated, authentic, and simulated authentic dataset demonstrate that their NBPMF approaches lead to significantly improved performance compared to several state-of-the-art methods.
TopIntroduction
Proteins are a class of organic compounds, which play many critical roles in all living organisms. Additionally, proteins are made up of hundreds or thousands of amino acids that are linked by peptide bonds. There are 20 different types of amino acids that are common in humans and animals. Consequently, a proteome is a set of proteins produced in an organism, system, or biological context, and proteomics is the large-scale study of proteomes.
Mass spectrometry (MS) is one of the most informative techniques for determining the composition of a sample. Recently it has become a primary tool for protein identification, quantification, and post translational modification characterization (PTM) in proteomics research. There are usually two different approaches by MS to identify proteins: top-down and bottom-up. In top-down proteomics, intact protein ions can be generated by electrospray mass spectrometry, then introduced into a mass analyzer and subjected to gas-phase fragmentation. Top-down MS has the ability to sequence intact proteins, especially for the analysis of PTMs (Lanucara & Eyers, 2013). While in conventional bottom-up method, protein identification is based on mass spectrometric analysis of peptides derived from proteolytic digestion, usually with trypsin.
In traditional bottom-up approach, the proteins may first be purified by gel electrophoresis, resulting in one or a few proteins in each proteolytic digest. Alternatively, in shotgun proteomics, the crude protein extract is digested directly, followed by one or more dimensions of separation of the peptides by liquid chromatography (LC) coupled to MS. There are usually two modes for bottom-up approaches, the most widespread one is data dependent acquisition (DDA), where selected peptide precursors following chromatographic separation are fragmented by MS/MS (Link et al., 1999). Another mode is data-independent acquisition (DIA), where all ions within a selected m/z range are fragmented and analyzed in tandem MS. DIA is an alternative to DDA where a fixed number of precursor ions are selected and analyzed by tandem MS.
In wet-lab procedures for protein identification based on the most used DDA mode, a sample undergoes by enzymatic digestion. Then liquid chromatography and tandem mass spectrometry (LC-MS/MS) are used for analyzing the resultant peptides. This bottom-up approach attempts to reconstruct the original protein sample based on identified peptides, since they can be surrogates for their parent proteins. In order to analyze the dataset, we should have a protein sequence database that contains all target proteins. Each MS/MS scan is used to identify a peptide-spectrum match from it; finally, these peptides are searched against the database to identify the proteins.
For tandem mass spectra in DDA mode, there are roughly four ways to interpret the dataset and identify the fragmentation of proteins: sequence database searching (Cottrell & London, 1999; Zhang et.al, 2014), spectral library searching (Yates et.al, 1998; Lam et.al, 2007), database-independent approach (de novo sequencing, Ma et.al, 2003; Liu et.al, 2017; Tran et.al, 2017), and the hybrid interpretation algorithms (Mann & Wilm, 1994; Yan & Zhang, 2016). This method is also called peptide fragment fingerprinting (PFF).
Certain challenges will arise when the above enzymatic digestion LC-MS/MS work flow is applied to complex protein samples, such as plasma or a whole-cell lysate. For example, after digestion a sample of proteins can produce a multitude of peptides, including expected, missed cleavages, and PTMs. This will lead to a peptide under-sampling problem. Even with thorough sample preparation and chromatographic separation, the introductions of peptides into the mass spectrometer are still faster than their isolations and fragmentations. Therefore, the majority of peptides in the sample are often left unanalyzed. Even in another alternative DIA mode, such as selected reaction monitoring (SRM, Anderson & Hunter, 2006) or accurate mass and time tags (ATM, Smith, 2002), under-sampling is unlikely eliminated completely.