Mots-Clés
machine learning
neural networks
bioinformatics
genomics
genome interpretation
Description
Title
DNA large language models for end to end genome interpretation
Motivation
Interpreting the genome means modeling the relationship between genotype and phenotype, which is the fundamental goal of biology. Achieving this would revolutionize genetics, medicine, and agro-tech. In clinical genetics, it could lead to personalized treatments tailored to each patient’s genome, enabling precision medicine.
Objectives
This project fuses quantitative genetics with bioinformatics and cutting-edge Artificial Intelligence, using the latest Deep Learning Large Language Models for DNA to advance our ability to predict the phenotypes deriving from the observed genotypes. This project is based on the previous work of Dr. Raimondi on Genome Interpretation (GI) for the prediction of clinically relevant phenotypes in humans.
Dr. Raimondi previous works focused on encoding Whole-Exome/Whole-Genome Sequencing (WES/WGS) data into compact, machine-readable features, while reducing overfitting caused by high-dimensional genomic data. However, to limit complexity, we had to rely on coarse gene-level summaries, such as mutational burden per gene, which sacrificed fine-grained genetic detail.
The goal of this project is to move beyond these gene-centric encodings by building neural network architectures that operate directly at the nucleotide level. To achieve this, we propose integrating pre-trained DNA LLMs as unsupervised feature extractors within GI models. These models, trained using self-supervised learning on entire genomes, capture rich patterns of DNA dependencies and can produce information-dense latent representations.
DNA LLMs have shown strong performance in various functional genomics tasks, such as identifying regulatory elements and variant effects. This project will evaluate whether their latent representations can improve phenotype prediction. The new DNA LLM methods will be prototyped on A.thaliana, which is a well known model organism. Later developments will be translated to the disease risk prediction of Inflammatory Bowel Disease (IBD). Unlike DNA LLM research, this work applies LLMs to the interpretation of individual-level WES/WGS data for disease risk, marking a novel use of these models in human genetic prediction.
Candidate profile
We are looking for a motivated and curious candidate, with a strong passion for science and for scientific discovery through the use and creation of new data science and Machine Learning methods. Bioinformatics and Genome Interpretation are multi-disciplinary and rapidly evolving fields. Therefore, the candidate is expected to 1) be eager to continuously learn new skills, methods and concepts, and 2) to enjoy finding new solutions in the face of new and unforeseen difficulties.
The ideal candidate has very good 1) python programming skills, 2) understanding of the mathematical foundations and principles of Machine Learning, Linear Algebra (vectorial and matricial operations, optimization), with a particular focus on Neural Networks, 3) problem solving skills, 4) familiarity with GNU/Linux environment and 5) ability to multi-task across different projects. A good understanding of the basic concepts of Bioinformatics is not necessary but welcome. The project will be based on the development of un-orthodox Neural Network models with Pytorch.
B2 level of English is required.
- The offer provides an initial 6-month contract, with possibility of renewal to 2 years. This project can be extended to 3 years and offers the opportunity to obtain a PhD.
**Research environment **
The recruited person will join the “AI for Genome Interpretation” team led by Dr. Daniele Raimondi at IGMM. The work will be conducted in an international (English-speaking) and interdisciplinary environment.
The Institute of Molecular Genetics of Montpellier (IGMM) is a joint research unit affiliated with the CNRS and the University of Montpellier. It comprises around 200 members, organized into 18 research groups, 9 shared support services and 9 technological and scientific platforms.
IGMM is a multidisciplinary institute whose research has both fundamental and translational impact in molecular and cellular biology at the international level.
Qualifications
The ideal candidate has very good 1) python programming skills, 2) understanding of the mathematical foundations and principles of Machine Learning, Linear Algebra (vectorial and matricial operations, optimization), with a particular focus on Neural Networks, 3) problem solving skills, 4) familiarity with GNU/Linux environment and 5) ability to multi-task across different projects. A good understanding of the basic concepts of Bioinformatics is not necessary but welcome. The project will be based on the development of un-orthodox Neural Network models with Pytorch. B2 level of English is required. Familiarity with scientific computing and libraries such as numpy, scikit-learn, scipy, pytorch.
TO APPLY: https://umontpellier.nous-recrutons.fr/poste/7sqwj9n4vg-ingenieur-en-calcul-scientifique-ou-ingenieur-en-ingenierie-logicielle-fh/