Medical Survival Analysis Through Transduction of Semi-Supervised Regression Targets

Medical Survival Analysis Through Transduction of Semi-Supervised Regression Targets

Faisal M. Khan, Qiuhua Liu
Copyright: © 2011 |Pages: 14
DOI: 10.4018/jkdb.2011070104
(Individual Articles)
No Current Special Offers


A crucial challenge in predictive modeling for survival analysis applications such as medical prognosis is the accounting of censored observations in the data. While these time-to-event predictions inherently represent a regression problem, traditional regression approaches are challenged by the censored characteristics of the data. In such problems the true target times of a majority of instances are unknown; what is known is a censored target representing some indeterminate time before the true target time. While censored samples can be considered as semi-supervised targets, the current limited efforts in semi-supervised regression do not take into account the partial nature of unsupervised information; samples are treated as either fully labeled or unlabelled. This paper presents a novel semi-supervised learning approach where the true target times are approximated from the censored times through transduction. The method can be employed to transform traditional regression methods for survival analysis, or can be employed to enhance existing state-of-the-art survival analysis methods for improved predictive performance. The proposed approach represents one of the first applications of semi-supervised regression to survival analysis and yields a significant improvement in performance over the state-of-the-art in prostate and breast cancer prognosis applications.
Article Preview


The employment of predictive time-to-event modeling in medical survival analysis usually falls into two broad categories. The first is prognostic, developing models for how a certain disease will progress. The purpose of such models includes understanding disease processes and prediction of how new patients will behave in the context of existing data. Examples include predicting which prostate cancer patients will recur so that therapy can be initiated early (Donovan et al., 2009) or identifying which group of patients will benefit more from a certain therapy. The second purpose is factor analysis; to analyze disease processes and explore interaction affects between disease factors. An example is understanding the interaction of whether a potentially significant gene will continue to be relevant when combined with other factors in a multivariate setting (Donovan et al., 2009) in order to possibly prioritize and identify candidate genes for targeted therapeutic drug development.

While time-to-event prediction is inherently a regression problem, it challenges computational modeling approaches due to the fact that healthcare data in such settings is characterized by censored and non-censored (event) observations. Healthcare data used in such prognostic modeling is usually obtained from tracking patients over the course of a well designed study, perhaps lasting years. Contrary to traditional regression problems, the information for most observations is incomplete and only known “up-to-a-point.” Patients who have experienced the endpoint of interest (cancer remission, recurrence, etc.) during their follow-up are considered as non-censored or events. Patients that did not experience the endpoint during study or were lost to follow-up for any cause (i.e., the patient moved during a multi-year study) are considered censored. All that is known about them is that they were disease free up to a certain point, but what subsequently occurred is unknown. For a d-dimensional vector xi Є Rd the observed time Si is called the censoring time. For such individuals, it is only known that they survived for at least time Si. The actual target Ti is unknown for censored cases, thus Si < Ti . An important assumption is that Ti and Si are independent conditional on xi, i.e., the cause for censoring is independent of the survival time. With an indicator function δi which is 0 if an event occurred and 1 if the observation is censored, the available training data can be summarized for N patients as D = { Ti, xi,δi } Ni=1 (Raykar et al., 2008).

Censored observations contribute incomplete information as the event of interest may occur after they were lost to follow-up. Simply omitting the censored observations (Burke et al., 1997; Shivaswamy, Chu, & Janasche, 2007) or treating them as non-recurring samples in a classifier (Snow, Smith, & Catalona, 1997) both bias the resulting model and should be avoided.

Complete Article List

Search this Journal:
Open Access Articles
Volume 8: 2 Issues (2018)
Volume 7: 2 Issues (2017)
Volume 6: 2 Issues (2016)
Volume 5: 2 Issues (2015)
Volume 4: 2 Issues (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing