Data Science - Multi-omic clustering and data integration

Type de poste
Niveau d'étude minimal
Durée du poste
Contrat renouvelable
Contrat renouvelable
Date de prise de fonction
Date de fin de validité de l'annonce
Nom de la structure d'accueil

CEA Saclay

Stéphane GAZUT, PhD
Email du/des contacts


A part of the multi-omics approaches consists in analyzing biological samples with different kinds of ‘omics’ measures such as genomic, proteomic, transcriptomic, metabolomics etc. One of the main objective of such approaches is to identify biological signatures that are more efficient and robust by considering simultaneously several sources of data obtained from the same samples. The combination of these heterogeneous data would allow improving the patient stratification and patient care. Indeed, research results from international consortia have recently shown that, for instance, genomic and proteomic data integration enabled the identification of new subtypes of breast cancer that were not accessible by using genomic analysis alone. Methodological developments to analyze multiple omics datasets are numerous and often use the multiblock analysis approach [1], [2]. The article [3] provides a list of the main methods and tools in this context of multi-omics datasets.


The objective of the project is to develop statistical methods in the field of multi-omics data integration based on multivariate analysis and multi-block analysis. The developments would also take into account additional biological information described in systems biology in order to define the most relevant multi-block definition or in integrating the structure between variables into the parsimony constraints for the feature selection task [4, 5, 6]. We will use datasets from the TCGA and CPTAC consortia to evaluate the developments. A preliminary task of the project will be dedicated to carry out multi-omics clustering (unsupervised approaches) to identify subsets of data that would be relevant for the signature identification task [7].


Interested applicants should have a PhD in data science or applied statistics (data analysis, machine learning, feature selection…) and be interested by multidisciplinary project (data science and biology). Knowledge in biology would be highly appreciated.


[1] Bouveresse D. et al., Identification of significant factors by an extension of ANOVA-PCA based on multi-block analysis, Chemometrics and Intelligent Laboratory Systems, 2011

[2] Chen Meng et al., moCluster: Identifying Joint Patterns Across Multiple Omics Data Sets, Journal of proteome research, 2016

[3] Bersanelli M. et al., Methods for the integration of multi-omics data: mathematical aspects, BMC Bioinformatics, 2016

[4] Jenatton R. et al., Structured Variable Selection with Sparsity-Inducing Norms, Journal of Machine Learning Research, 2011

[5] Safo S. et al., Integrative analysis of transcriptomic and metabolomics data via sparse canonical correlation analysis with incorporation of biological information, Biometrics, 2018

[6] Löfstedt T. et al., A general multiblock method for structured variable selection, arxiv, , 2016

[7] Rappoport N. et al., Multi-omic and multi-view clustering algorithms: review and cancer benchmark, Nucleic Acids Research, 2018

Equipe adhérente personne morale SFBI
Equipe Non adhérente