Stage de M2 en machine learning pour l'intégration de données multi-omiques

 Stage · Stage M2  · 6 mois    Bac+5 / Master   Centre National de Recherche en Génomique Humaine (CNRGH), CEA · Évry-Courcouronnes (France)

 Date de prise de poste : 1 mars 2026

Mots-Clés

Machine Learning Deep Learning Statistics Genomics Multi-omics

Description

Context.

In the context of cancer, accumulations of aberrations observed at multiple molecular levels are the source of the many differences observed between somatic tumor and normal cells. Abnormalities on DNA may include an increased number of mutations, differentially methylated sites (epigenetic markers), or copy number variations (different numbers of copies of a chromosome segment in a cell). Such modifications have an impact on gene expression, which in turn affect proteins. Studying these molecular data (namely omics) separately is often not enough to understand the undergoing dysregulation. This led to the establishment of multi-omics studies with the hope that looking jointly at all the molecular layers would unravel the big picture. From a statistical point of view, this would result in an increase of power. Indeed, combining multiple small effects, across several omic modalities, commonly explaining the same phenomenon would increase the signal-to-noise ratio. However, to achieve this purpose, the high dimensionality of such data (more than 20.000 coding genes) has to be handled to avoid estimating spurious associations. Therefore, a tremendous number of multi-omics analysis methods have been developed.
Among the tasks addressed with multi-omics data, survival analysis consists in estimating the duration between a patient’s initial diagnosis and their death. Such analysis can identify groups of patients with differential prognosis and distinguished by a molecular (omic) signature. Clinicians can further investigate such signatures for new treatments or to better adapt therapies according to the molecular specificities of a given cancer. This is one way of performing precision medicine.
Despite the promise of multi-omics data, their benefit in the field of cancer survival analysis remain limited. In an insightful study, 12 survival analysis methods were compared on 18 cancer data-sets analyzed separately. The aggregated results across all cancers showed that only two methods using both clinical and molecular data performed better (not statistically) than a reference model using only clinical data. Adding Deep Learning methods in a follow-up study did not change the conclusions. In an ongoing work, we added joint Dimension Reduction (jDR) methods to the comparison. These methods estimate a reduced space representing well the commonalities between omic layers.

We made the hypothesis that estimating such joint reduced space, prior to survival analysis, would improve the prediction results by better dealing with the high dimensionality of the data. Preliminary results identified two jDR methods, using both clinical and omics data, statistically outperforming the reference model, using clinical data only, after aggregating the results across all cancers. Further improvement in the performance of these methodologies may be expected.
However, we are still far from identifying robust candidate multi-omics biomarkers to be further investigated by clinical trials. This could mean that the dimensionality of the data is simply too high to construct good prediction models. We identified two major ways to better handle this curse of dimensionality. First, all studies mentioned above deal with complete data, i.e. if a subject has at least one missing omic modality, this subject is not considered in the analysis. This strategy is known to be suboptimal and can further exacerbate the curse of dimensionality. Then, this issue can be also alleviated by inserting information to the targeted data-set either by (i) making use of prior knowledge or (ii) through Transfer Learning (TL). The limitation with (i) is that the model must integrate a reliable/robust prior knowledge, which is not always possible especially in the case of rare diseases, which are typically poorly characterized and supported by very limited sample sizes. Transfer Learning, on the other hand, aims at extracting this prior knowledge from a Source data-set and transfer it to the desired data-set, called the Target, to learn faster (i.e. with fewer observations) a new task out of it. A common practice is to train a model on the Source and then fine-tune it on the Target. However, in order for this transfer to work, the datasets must be related.
General Goal. The objective of this internship is to study models able to both deal with missing data and perform Transfer Learning to tackle the curse of dimensionality in cancer survival multi-omics studies. These methodologies will be especially evaluated in the context of rare cancers (less than 6-15 new cases per 100.000 people per year; though 22-27% of cancer diagnosed and 25% of cancer mortality) that could benefit the most from these approaches.

Tasks.

To achieve this goal, the first task of this internship will be to perform a benchmark study on the biggest public multi-omic cancer data-set, The Cancer Genome Atlas (TCGA), gathering 33 cancer types for more than 11.000 patients across 8 modalities. Following previous TL studies working with complete data, all types of cancer but one will compose the Source data-set and the remaining one will act as the Target rare cancer. This setting is built upon the fact that preliminary studies have shown that information are shared across cancers through multiple omics data. This would allow to learn a general “cancer knowledge” transferable to a targeted cancer. This setting will be repeated for several Target cancer to draw robust conclusions. Furthermore, different missing data situations will be manually generated from the Source, the Target or both. Despite that jDR methods have already proven to outperform the others when a Target cancer is analyzed alone, in this study, the Source data-set will be composed of enough observations so that classical Machine Learning and Deep Learning methods are expected to be comparable. Hence, both jDR13 and Variational Auto-Encoders, their equivalent within a Deep-Learning framework, will be evaluated in this benchmark.

In a second time, such analysis will be applied on an adult rare cancer multi-omic cohort provided by Dr. Agusti ALENTORN. This cohort gathers 147 clinical, 123 transcriptomic, 115 Whole Exome Sequencing and only 64 methylation profiling data on Primary Central Nervous System Lymphoma.

Required Profile.

  • M2 or last year of engineer school with specialty/knowledge in Computer Science / Statistics / Machine Learning / Deep Learning / BioStatistics.
  • Working knowledge in programming (R / Python, …).
  • Previous experience with applications to genomics will be a plus.

Supervisors.

Mary SAVINO, Arnaud GLOAGUEN and Arthur TENENHAUS.

Date.

March 2026 (flexible).

Fundings.

This internship will be partly funded by DataIA.

References.

Sun, W. et al. The association between copy number aberration, DNA methylation and gene expression in tumor samples. Nucleic Acids Research 46, 3009–3018 (2018).
Cantini, L. et al. Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer. Nat Commun 12, 124 (2021).
Herrmann, M. et al. Large-scale benchmark study of survival prediction methods using multi-omics data. Brief Bioinform 22, bbaa167 (2021).
Wissel, D. et al. Systematic comparison of multi-omics survival models reveals a widespread lack of noise resistance. Cell Rep Methods 3, 100461 (2023).
Goff, V. et al. Impact of joint Dimension Reduction methods for survival prediction - Extension of a multi-omics benchmark study. in (2024).
Flores, J. E. et al. Missing data in multi-omics integration: Recent advances through artificial intelligence. Front Artif Intell 6, 1098308 (2023).
Boyd, N. et al. Rare cancers: a sea of opportunity. Lancet Oncol 17, e52–e61 (2016).
Chai, H. et al. Predicting bladder cancer prognosis by integrating multi-omics data through a transfer learning-based Cox proportional hazards network. CCF Trans. HPC 3, 311–319 (2021).
Li, Y. et al. Transfer Learning for Survival Analysis via Efficient L2,1-Norm Regularized Cox Regression. in 2016 IEEE 16th International Conference on Data Mining (ICDM) 231–240.
Sato, G. et al. Pan-cancer and cross-population genome-wide association studies dissect shared genetic backgrounds underlying carcinogenesis. Nat Commun 14, 3671 (2023).
Li, Y. et al. Pan-cancer proteogenomics connects oncogenic drivers to functional states. Cell (2023)
Hanczar, B. et al. Assessment of deep learning and transfer learning for cancer prediction based on gene expression data. BMC Bioinformatics 23, 262 (2022).
Hirst, D. P. et al. MOTL: enhancing multi-omics matrix factorization with transfer learning. Genome Biology 26, 224 (2025).
Benkirane, H. et al. Multimodal CustOmics: A unified and interpretable multi-task deep learning framework for multimodal integrative data analysis in oncology. PLOS Computational Biology (2025)
Ranjbari, S. et al. Integration of incomplete multi-omics data using Knowledge Distillation and Supervised Variational Autoencoders for disease progression prediction. Journal of Biomedical Informatics 147, 104512 (2023).
Hernández-Verdin, I. et al. Molecular and clinical diversity in primary central nervous system lymphoma. Ann Oncol 34, 186–199 (2023).

Candidature

Procédure : Send your resume and a motivation letter to Mary Savino and Arnaud Gloaguen

Date limite : 30 juin 2026

Contacts

 Mary SAVINO
 msNOSPAMavino@cnrgh.fr

 Arnaud Gloaguen
 agNOSPAMloague@cng.fr

Offre publiée le 20 décembre 2025, affichage jusqu'au 1 mars 2026