PhD in Machine Learning / Statistics applied to Genomics data analysis

 CDD · Thèse  · 36 mois    Bac+5 / Master   CEA / Centre National de Recherche en Génomique Humaine (CNRGH) · Evry (France)  Oui

 Date de prise de poste : 2 octobre 2023

Mots-Clés

Machine Learning Statistics Genomics Multi-omic

Description

Integration of a priori biological knowledge in multi-omics analysis methods to develop more explainable and interpretable models

 

Description

     Theme: Machine Learning/Statistics/Genomics/Multi-omics.
     Duration: 36 months (3 years).
     Location: National Center for Human Genome Research (CNRGH) at Evry (91000).
     Contact: Arnaud GLOAGUEN (agloague@cng.fr) and Edith LE FLOCH (lefloch@cng.fr).
     Start in: October 2023.

Context

In order to fully grasp the complexity of a disease, biologists have access to a wide variety of measurements on the human genome. They each shed light on a particular aspect of the underlying molecular mechanisms. Studied separately, they are often not enough to understand the undergoing dysregulation, which led to the establishment of multi-omic studies that aim at merging the information coming from different modalities (omics) observed on the same set of genomes in the hope of catching a meaningful signal. With the raise of high-throughput sequencing, multi-omics studies and their expectations to finally decipher molecular mechanisms underneath complex diseases, steadily pass from a mere concept 20 years ago to a classical design in nowadays experiments. Yet, the analysis of such data is quite hard as each omics suffers from high-dimensionality (the human genome is composed of approximately 20.000 genes) and potentially missing observations, which in the context of a low number of samples, has to be dealt with a better strategy than throwing out samples with missing values. At the single omics level, analysing such data already requires state-of-the-art Machine Learning techniques to overcome these issues. Thus, analysing them jointly to unravel commonalities or interactions that would better explain a diagnosis is even harder and requires the development of new techniques pushing forward the boundaries of Artificial Intelligence. This has been the focus of the last decade, where an astonishing number of multi-omics methods have been flourishing in the literature (Hesami et al., 2022). Very recently, in order to understand the capabilities of all these tools, benchmarks have been published (Cantini et al., 2021; Herrmann et al., 2021; Meng et al., 2016; Pierre-Jean et al., 2020; Rappoport and Shamir, 2018).

General Goal

Starting from there, the goal of this thesis is to explore an avenue, that we believe is under-looked, consisting in including more prior biological knowledge into the existing tools that usually are general and can be used in other application fields.

Objectives

Several avenues ca be investigated in order to include biological knowledge into already existing tools. Firstly, by including group knowledge at the variables level. This would both leads to more interpretable results and allows the selection of groups of variables by the model, which is an interesting way of dealing with high-dimensionality. For example, this could be done by aggregating genes in groups corresponding to biological pathways according to classical databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2016) and the Human Metabolome DataBase (HMDB) (Wishart et al., 2013). This group prior knowledge can be included in several multi-omics methods by different strategies: either by dividing each omics into as many matrices as there exist pathways, similarly to (Garali et al., 2018), or by specifying the group structure into an appropriate penalty term in the model as performed in (Du et al., 2018; Guillemot et al., 2021; Löfstedt et al., 2016).

Secondly, by enlarging the list of methods in actual benchmarks, especially in a subcategory of models called joint Dimension Reduction (jDR) models. jDR aims at estimating a lower dimensional space describing the joint information between omics. However, in almost all benchmarks, compared jDR methods impose that this information is shared by all omics. This assumption is interesting to gain statistical power by looking for phenomenons common to all modalities, however, in the case of mechanisms shared by only a few omics, this may be too constraining and make these methods incapable of recovering such situations. In (Smilde et al., 2022), the distinction is made between jDR methods that are  able to estimate a lower dimensional space based either only on Common (C) information across omics, on Common and Distinct (CD)  or on Common, Distinct and Local (CDL; understand Local to a subset of omics) information. This last category corresponds to a  recent active field, with methods such as (Lock et al., 2022; Park and Lock, 2020; Samorodnitsky et al., 2022; Yi et al., 2022) that we  wish to include in current benchmark studies. Actually, combining the two approaches, being able to insert a group structure and extract  CDL information, is even more interesting as they are complementary. This would allow to extract subgroups of omics  describing a specific interaction based on a small number of biological pathways for example.

Finally, the last objective is to propose new multi-omics methods that would make use of more biological information. A major avenue that is going to be exploited consists in including more constraints in the models by specifying that some variables, even though measured in different omic modalities, are located in the same genomic region, which is almost never taken into account in multi-omics methods. A first way to integrate this knowledge is to define a common scale for all variables, for example the gene scale, and aggregate all omics to this very same scale. Thus each omics would be represented by two common dimensions: the sample and
the gene dimension, allowing to work with thoroughly studied mathematical objects called tensors (Acar and Yener, 2009; Kolda and Bader, 2009).

Delivrable

All these developments would be built upon current state-of-the-art benchmarks in order to have systematically an evaluation of the effect of integrating a new biological prior to existing models. We believe that allowing to extract information shared by a subset of omics modalities, with the integration of prior knowledge on pathways of genes or genomic localizations would lead to more robust and more interpretable models. Each step will be published in high impact journals and comparison results will be made reproducible for the
community, in the form of a R package for instance, so it can be easily used by others to test their developments.

Ultimately, it would be evaluated in the context of exploratory analysis on open datasets such as TCGA or on different collaborative projects of the CNRGH, such as the France Genomic Medicine Plan (Sanlaville, 2022) for personalized medicine in the field of cancer or the PROPSY project recently laureate for the call for proposals for Prioritized Exploratory Research Projects, Programs and Equipments 2022 aiming at identifying new biomarkers for 4 major mental disorders : Autism, Schizophrenia, Major Depressive Disorder and Bipolar Disorder.

Required Profile.

     - M2 or engineer school with specialty/knowledge in Computer Science / Statistics / Machine Learning / BioStatistics.
     - Working knowledge in programming (R / Python, ...).
     - Previous experience with applications to genomics will be a plus.

References

Acar, E. and Yener, B. (2009). Unsupervised Multiway Data Analysis: A Literature Survey. IEEE Transactions on Knowledge and Data Engineering, 21(1):6–20.

Cantini, L., Zakeri, P., Hernandez, C., Naldi, A., Thieffry, D., Remy, E., and Baudot, A. (2021). Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer. Nature Communications, 12(1):124. Number: 1 Publisher: Nature Publishing Group.

Du, L., Liu, K., Zhang, T., Yao, X., Yan, J., Risacher, S. L., Han, J., Guo, L., Saykin, A. J., Shen, L., and Alzheimer’s Disease Neuroimaging Initiative (2018). A novel SCCA approach via truncated 1-norm and truncated group lasso for brain imaging genetics. Bioinformatics (Oxford, England), 34(2):278–285.

Garali, I., Adanyeguh, I. M., Ichou, F., Perlbarg, V., Seyer, A., Colsch, B., Moszer, I., Guillemot, V., Durr, A., Mochel, F., and Tenenhaus, A. (2018). A strategy for multimodal data integration: application to biomarkers identification in spinocerebellar ataxia. Briefings in Bioinformatics, 19(6):1356–1369.

Guillemot, V., Gloaguen, A., Tenenhaus, A., Philippe, C., and Abdi, H. (2021). Introducing group-sparsity and orthogonality constraints in RGCCA. page 6.

Herrmann, M., Probst, P., Hornung, R., Jurinovic, V., and Boulesteix, A.-L. (2021). Large-scale benchmark study of survival prediction methods using multi-omics data. Briefings in Bioinformatics, 22(3):bbaa167.

Hesami, M., Alizadeh, M., Jones, A. M. P., and Torkamaneh, D. (2022). Machine learning: its challenges and
opportunities in plant system biology. Applied Microbiology and Biotechnology, 106(9):3507–3530.

Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M., and Tanabe, M. (2016). KEGG as a reference resource for gene and protein annotation. Nucleic Acids Research, 44(D1):D457–D462.

Kolda, T. G. and Bader, B. W. (2009). Tensor Decompositions and Applications. SIAM Review, 51(3):455–500.

Lock, E. F., Park, J. Y., and Hoadley, K. A. (2022). Bidimensional linked matrix factorization for pan-omics pan-cancer analysis. The Annals of Applied Statistics, 16(1).

Löfstedt, T., Hadj-Selem, F., Guillemot, V., Philippe, C., Raymond, N., Duchesney, E., Frouin, V., and Tenenhaus, A. (2016). A general multiblock method for structured variable selection. arXiv:1610.09490 [stat].

Meng, C., Zeleznik, O. A., Thallinger, G. G., Kuster, B., Gholami, A. M., and Culhane, A. C. (2016). Dimension reduction techniques for the integrative analysis of multi-omics data. Briefings in Bioinformatics, 17(4):628–641.

Park, J. Y. and Lock, E. F. (2020). Integrative factorization of bidimensionally linked matrices. Biometrics,76(1):61–74. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/biom.13141.

Pierre-Jean, M., Deleuze, J.-F., Le Floch, E., and Mauger, F. (2020). Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration. Briefings in Bioinformatics, 21(6):2011–2030.

Rappoport, N. and Shamir, R. (2018). Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Research, 46(20):10546–10562.

Samorodnitsky, S., Hoadley, K. A., and Lock, E. F. (2022). A hierarchical spike-and-slab model for pan-cancer survival using pan-omic data. BMC Bioinformatics, 23:235.

Sanlaville, D. (2022). Plan France Médecine Génomique 2025 – Plateformes AURAGEN et SeqOïa. Morphologie, 106(354, Supplement):S2.

Smilde, A. K., Næs, T., and Liland, K. H. (2022). Multiblock data fusion in statistics and machine learning: applications in the natural and life sciences. Wiley, Hoboken, NJ. OCLC: 1311417983.

Wishart, D. S., Jewison, T., Guo, A. C., Wilson, M., Knox, C., Liu, Y., Djoumbou, Y., Mandal, R., Aziat, F., Dong, E., Bouatra, S., Sinelnikov, I., Arndt, D., Xia, J., Liu, P., Yallou, F., Bjorndahl, T., Perez-Pineiro, R., Eisner, R., Allen, F., Neveu, V., Greiner, R., and Scalbert, A. (2013). HMDB 3.0–The Human Metabolome Database in 2013. Nucleic Acids Research, 41(Database issue):D801–807.

Yi, S., Wong, R. K. W., and Gaynanova, I. (2022). Hierarchical nuclear norm penalization for multi-view data. arXiv:2206.12891 [stat].

Candidature

Procédure :

Date limite : 17 mars 2023

Contacts

Arnaud GLOAGUEN

 agNOSPAMloague@cng.fr

Offre publiée le 20 février 2023, affichage jusqu'au 17 mars 2023