Stage M2: Machine learning for survival data prediction

Type de poste
Niveau d'étude minimal
Durée du poste
Contrat renouvelable
Contrat non renouvelable
Date de prise de fonction
Date de fin de validité de l'annonce
Nom de la structure d'accueil

Université de Paris, 45 Rue des Saints-Pères
75006 Paris

Olivier Bouaziz
Vittorio Perduca
Email du/des contacts

The goal of survival analysis is to analyse data where the outcome of interest is the time until an event of interest occurs. For example, in medical applications, it is frequent to study the time to occurence of a disease (such as cancer for instance), the time to relapse from a disease, the time to hospitalisation, the time to death, . . . These types of data are hard to analyse in practice as they are usually incompletely observed. Right censoring, left truncation or interval-censoring often occur due to end of follow-up/dropout, delayed entry or intermittent follow-up. As a result, these censored observations need dedicated tools in order to perform inference. Classic methods include the Kaplan-Meier estimator to estimate the survival function in a non-parametric way and the Cox model to model the hazard function given a set covariates. See for instance Andersen et al. (1993); Fleming & Harrington (2011) for a thorough presentation of the subject.
In some applications, rather than trying to explain the effect of some covariates on the risk of the event, a different goal may consist in learning how to predict the time to event, based on covariates, as precisely as possible. In this perspective, it is tempting to turn to machine learning tools. However such methods cannot be directly deployed to survival data because parts of the observations in the learning samples are typically missing as a consequence of the aforementioned censoring. In recent years, a few machine learning methods have been adapted to deal with censored data, including survival trees and random forest (Ishwaran et al., 2008), deep survival (Katzman et al., 2018), the super learner for survival data (Golmakani & Polley, 2020) and others (Wang et al., 2019).
Machine learning prediction algorithms such as those based on random forests and deep learning are oftentimes qualified of black-boxes, because the internal mechanisms that produce the final prediction are opaque to users and the final output is not easy to interprete (Murdoch et al., 2019). On one side, such algorithms oftentimes give very accurate predictions, outperforming more interpretable models such as linear models, or, in the survival setting, Cox models, on the other side their lack of interpretability might undermine the trust of users and therefore their applicability in real situations. Several methods have been recently introduced to explain how black-box models use covariates to predict the outcome, including partial dependence plots (PDP) and their extensions, local surrogate models (LIMEs) and Shapely values (see for instance Molnar, 2020, for a friendly introduction to these and other methods). Some of these methods have also been adapted to explain the importance of variables in machine learning methods for survival data, for instance partial plots for survival random forest (Ishwaran & Kogalur, 2007).

Internship goals
During his/her internship, the intern will be asked to perform the following tasks:
• to review the state of the art of machine learning methods for survival data prediction,
• to implement and test several machine learning methods on simulated data (random forests, deep learning, super learner, . . .),
• to use prediction assessment tools, such as the C-index (see Harrell et al., 1982; Harrell Jr et al., 1984) or the Brier Score (see Gerds & Schumacher, 2006), in order to compare the performance of the different estimators,
• to understand and implement interpretability methods used for survival data prediction (such as PDP, LIMEs or partial plots).
All the implementations will be performed with the R software. The internship may lead to a PhD where, depending on the background and interests of the intern, different extensions might be considered. In particular, the methods reviewed during the internship will be applied to a real-data set on Primary immunodeficiencies (PIDs) which are inherited diseases associated with a considerable increase in susceptibility to infections. This dataset comes from the CEREDIH research group, based in Hôpital Necker. Since PIDs predispose to several diseases (such as cancer and immune diseases, including allergy, autoimmunity, and inflammation), a major interest of the methods studied during the internship is to predict the risk of future diseases for PIDs patients. The development of new interpretability methods for survival data is also of interest. Theoretical extensions might also be considered such as the development of theoretical results for super learner methods in survival analysis.

Required profile
The successful candidate will have a proved knowledge of applied statistics and machine learning. A basic understanding of survival analysis will be considered a plus. We will also consider candidates with a strong background in life sciences, medicine or public health provided they show high motivation and commitment to the subject. Proficiency in English and proved writing skills will be considered an asset.
Practical details
The internship will last at most 6 months and should begin in the first trimester of 2021. It will take place in the MAP5 Laboratory located in the Campus Saint-Germain of the Université de Paris (45 rue des Saints-Pères, Paris). Depending on the sanitary situation, all or part of it might take place remotely through regular online meetings.

Andersen, P. K., Borgan, Ø., Gill, R. D. & Keiding, N. (1993). Statistical models based on counting processes. Springer Series in Statistics. New York: Springer-Verlag.
Fleming, T. R. & Harrington, D. P. (2011). Counting processes and survival analysis, vol. 169. John Wiley & Sons.
Gerds, T. A. & Schumacher, M. (2006). Consistent estimation of the expected brier score in general survival models with right-censored event times. Biometrical Journal 48, 1029–1040.
Golmakani, M. K. & Polley, E. C. (2020). Super learner for survival data prediction. The International Journal of Biostatistics 1.
Harrell, F. E., Califf, R. M., Pryor, D. B., Lee, K. L. & Rosati, R. A. (1982). Evaluating the yield of medical tests. Jama 247, 2543–2546.
Harrell Jr, F. E., Lee, K. L., Califf, R. M., Pryor, D. B. & Rosati, R. A. (1984). Regression modelling strategies for improved prognostic prediction. Statistics in medicine 3, 143–152.
Ishwaran, H. & Kogalur, U. B. (2007). Random survival forests for R. R news 7, 25–31. Ishwaran, H., Kogalur, U. B., Blackstone, E. H., Lauer, M. S. et al. (2008). Random
survival forests. The annals of applied statistics 2, 841–860.
Katzman, J. L., Shaham, U., Cloninger, A., Bates, J., Jiang, T. & Kluger, Y. (2018). DeepSurv: personalized treatment recommender system using a cox proportional hazards deep neural network. BMC Medical Research Methodology 18, 24.
Molnar, C. (2020). Interpretable Machine Learning.
Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R. & Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences 116, 22071–22080.
Wang, P., Li, Y. & Reddy, C. K. (2019). Machine learning for survival analysis: A survey. ACM Computing Surveys (CSUR) 51, 1–36.

Equipe adhérente personne morale SFBI
Equipe Non adhérente