Article Preview
Top1. Introduction
Mass spectrometry (MS) is one of the most informative techniques for determining the composition of a sample. Recently it has become a primary tool for protein identification, quantification, and post translational modification characterization (PTM) in proteomics research. There are usually two different approaches by MS to identify proteins: top-down and bottom-up. In top-down proteomics, intact protein ions can be generated by electrospray mass spectrometry (ESI), then introduced into a mass analyzer and subjected to gas-phase fragmentation. Top-down MS has the ability to sequence intact proteins, especially for the analysis of PTMs (Lanucara & Eyers, 2013). While in conventional bottom-up method, protein identification is based on mass spectrometric analysis of peptides derived from proteolytic digestion, usually with trypsin.
There are usually two modes for bottom-up approaches, the most widespread one is data dependent acquisition (DDA), where selected peptide precursors following chromatographic separation are fragmented by MS/MS (Link et al., 1999). Another mode is data-independent acquisition (DIA), where all ions within a selected m/z range are fragmented and analyzed in tandem MS. DIA is an alternative to DDA where a fixed number of precursor ions are selected and analyzed by tandem MS.
In wet-lab procedures for protein identification based on the most used DDA mode, a sample undergoes by enzymatic digestion. Then liquid chromatography and tandem mass spectrometry (LC-MS/MS) are used for analyzing the resultant peptides. This bottom-up approach attempts to reconstruct the original protein sample based on identified peptides, since they can be surrogates for their parent proteins. In order to analyze the dataset, we should have a protein sequence database that contains all target proteins. Each MS/MS scan is used to identify a peptide-spectrum match from it; finally, these peptides are searched against the database to identify the proteins.
For tandem mass spectra in DDA mode, there are roughly four ways to interpret the dataset and identify the fragmentation of proteins: sequence database searching (Cottrell & London, 1999), spectral library searching (Yates et.al, 1998), database-independent approach (de novo sequencing, Ma et.al, 2003), and the hybrid interpretation algorithms (Mann & Wilm, 1994). This method is also called peptide fragment fingerprinting (PFF).
Certain challenges will arise when the above enzymatic digestion LC-MS/MS work flow is applied to complex protein samples, such as plasma or a whole-cell lysate. For example, after digestion a sample of proteins can produce a multitude of peptides, including expected, missed cleavages, and PTMs. This will lead to a peptide under-sampling problem. Even with thorough sample preparation and chromatographic separation, the introductions of peptides into the mass spectrometer are still faster than their isolations and fragmentations. Therefore, the majority of peptides in the sample are often left unanalyzed. Even in another alternative DIA mode, such as selected reaction monitoring (SRM, Anderson & Hunter, 2006) or accurate mass and time tags (ATM, Smith, 2002), under-sampling is unlikely eliminated completely.
To avoid the above problems, advances in LC and MS technologies make it possible to identify peptides solely on their MS masses and retention time (RT) without MS/MS. These advances require instrumentation capable of high-accuracy measurements, LC systems with sufficient RT precision, as well as precise prediction algorithms for relative RT (Krokhin et al., 2004). This technique is analogous to traditional peptide mass fingerprinting (PMF), which has long been used to identify proteins separated by gel-electrophoresis (Shevchenko et.al, 1996). However, due to the lack of specificity with a low-accuracy dataset in peptide identification, PMF has been limited to low complexity samples. The reason for the limitation is that each mass used for fingerprinting can typically be assigned to several peptides from different proteins. Therefore, in a complex sample, it becomes impossible to infer potential presented proteins.