The chemosensitivity of tumours to specific drugs can be predicted based on molecular quantities, such as gene expressions, miRNA expressions, and protein concentrations. This finding is important for improving drug efficacy and personalizing drug use. In this paper, the authors present an analysis strategy that, compared to prior work, retains more information in the data for analysis and may lead to improved chemosensitivity prediction. The authors apply improved methods for estimating the GI50 value of a drug (an indicator of the response to the drug), regression methods for constructing predictive models of the GI50 value, advanced variable selection techniques, such as MMPC, and a multi-task variable selection technique for identifying a small-size signature that is simultaneously predictive for several drugs and cell lines. The methods are applied on gene expression, miRNA expression, and proteomics data from 53 tumour cell lines after treatment with 120 drugs, obtained from the National Cancer Institute databases. A biological interpretation and discussion of the results is presented for the most clinically important subset of 14 drugs.
TopIntroduction
Prior work shows that the sensitivity of a tumour to a drug can be predicted better than chance based on the gene-expressions of the tumour (Potti et al., 2006; Augustine et al., 2009). This finding paves the way to personalized therapy models. In addition, identifying the molecular quantities that are predictive may lead to a better understanding of the biological mechanisms a drug employs to attack the tumour. In this paper, we develop an analysis strategy to produce predictive models, estimate their performance, and identify the smallest, most-predictive set of molecular quantities required for prediction. The strategy is first applied to and evaluated on the prediction of the response to a set of 120 chemotherapeutic agents based on the responses measured on 53 solid-tumour cell lines; the data have been obtained from the National Cancer Institute databases and contain pre-treatment gene expression, miRNA expression, and protein concentration profiles of the cell-lines. Subsequently, we focus our interest on a subset of 14 drugs that are the most interesting in clinical practice and provide a detailed presentation and biological interpretation of the results. In addition, we apply a method for multi-task feature selection which selects molecular quantities that combined, are simultaneously predictive for an array of drugs. Such algorithms are important for selecting the optimal therapy, by being able to predict the response to several drugs at once by measuring only a small set of molecular quantities.
Compared to prior works, our proposed strategy differs in several ways. The machine learning and statistical analysis employed in the literature process the data in a way that reduces the available information with potential detrimental effects both on the models' prediction performance as well as the identification of the molecular signatures (Potti et al., 2006; Augustine et al., 2009; Staunton et al., 2001). First the estimation of the response to a drug in prior work maybe sub-optimal (Potti et al., 2006; Augustine et al., 2009). The response of a tumour depends of course, on the dosage. The National Cancer Institute has treated a panel of 60 cancer cell lines with several thousand drugs and has created a dosage-response profile for each combination of drug and tumour. Often, this profile is summarized with a single value such as the log10GI50. The GI50 stands for “growth inhibition 50%,” the concentration of a given test drug that causes 50% growth inhibition at 48 hours, corrected for the cell count at time zero. NCI, in the majority of cases, estimates log10GI50 by piece-wise linear interpolation which are then employed by all prior work (e.g., Potti et al., 2006; Ma et al., 2009; Staunton et al., 2001). In this paper, we show that estimating the log10GI50 values by fitting a sigmoid to the dosage-response profile preserves more information about the effects of the drug that lead to statistically significantly improved predictive performance.
Second, prior work typically quantizes the log10GI50 values to create classes of tumours: Potti et al. (2006) and Augustine et al. (2009) categorize tumours as sensitive and resistant, while Staunton et al. (2001) and Ma et al. (2009) as sensitive, intermediate, and resistant. This type of quantization allows the application of machine learning classification techniques, variable selection methods for classification tasks, and statistical hypothesis testing techniques for discrete outcomes. However, our computational experiments demonstrate that maintaining the exact log10GI50 values and employing regression analysis instead of classification is often preferable as it improves chemosensitivity prediction in approximately half of the cases.
Third, prior work often employs simple methods for identifying molecular signatures such as selecting the top k genes that are mostly differentially expressed between different classes of tumours. We show that more sophisticated methods such as the Max Min Parents and Children (MMPC) algorithm for multivariate feature selection (Tsamardinos, Brown, & Aliferis, 2006) often select more predictive signatures for the same parameter k and are preferable to apply.