Mots-Clés
pangenome
graph
machine learning
Description
Context and Objectives:
Prokaryotes (i.e. bacteria and archaea) constitute a fascinating field of living organisms, representing remarkable diversity and ubiquity. Their impact on the biosphere is immense, influencing human and animal health, soil and ocean biogeochemistry, and much more. Large-scale exploration of microbial genomes has helped uncover the molecular mechanisms underlying their diversity, and particularly the role of Mobile Genetic Elements (MGE).
In recent years, with the explosion of sequencing projects, several bioinformatics approaches have been developed based on the pangenome concept, offering solutions for efficiently managing and exploiting large quantities of data [1]. Pangenomics examines genetic variability across all available genomes of a given group, usually a species, rather than relying on a single reference genome or making pairwise comparisons. In terms of gene content, a distinction is made between the core genome, i.e. the genes present in all individuals, and the accessory (or variable) genes that are more or less conserved in the genomes, and therefore likely to explain phenotypic particularities. The development of pangenomic methods is thus a response to the challenge of massive data in biology, helping to understand the evolution of microorganisms in relation to epidemiological or environmental data.
For several years now, the LABGeM and the LaMME team has been working on a model to represent genomic data as a pangenome graph at the gene family level, enabling the compression of information from thousands of genomes while preserving the chromosomal organization of genes. The PPanGGOLiN software suite [2] (awarded an Open Science Research Prize by the French Ministry of Research in 2023; >220 citations since 2020) has been developed to reconstruct and analyze pangenome graphs. It includes methods such as the identification of regions of genomic plasticity (panRGP method) [3] and their fine description in conserved modules (panModule method) [4], demonstrating their utility for identifying genomic islands and their MGEs. LABGeM is also developing PanGBank, a database of pangenomes reconstructed from public genomes from Genbank and RefSeq databases using the GTDB classification. It currently gathers pangenomes for >4300 prokaryotic species.
The PanGAIMiX project aims to revolutionize microbial genome analysis by integrating pangenome graph models with advanced machine learning techniques. Within this project, Work Package 2 (WP2) focuses on developing methods to detect conserved genomic context modules across multiple pangenomes using a MultiGraph Neural Network approach. The goal is to overcome the scalability limitations of traditional graph-based algorithms and enable the detection of evolutionary patterns across hundreds of species.
Tasks:
Build a cross-species layered pangenome multigraph from the panGBank resource, where edges encode either gene co-localization within genomes or homology relationships across species.
Identify conserved modules across pangenomes by applying deep learning architectures, such as U-Net [5], adapted for graph-based data segmentation.
Benchmark the method against state-of-the-art approaches for detecting conserved modules (ex: panModule, STRING-DB https://string-db.org/ )
Interpret and visualize the detected genomic modules by projecting their learned embeddings into low-dimensional spaces using dimensionality reduction techniques such as UMAP. This will facilitate the exploration of phylogenomic relationships by revealing clusters, gradients, and evolutionary trajectories among gene families across species.
Environment:
This internship topic is part of the ANR PanGAIMiX project. The internship will be conducted in collaboration with leading researchers from LABGeM (David Vallenet, Alexandra Calteau), MalAGE (Guillaume Gautreau), and LaMME (Christophe Ambroise and Marie Szafranski). The intern will have access to high-performance computing resources and will work in a multidisciplinary environment that combines expertise in microbial genomics, bioinformatics, and machine learning.
Requirements:
Background in bioinformatics, computer science, or mathematics.
Machine learning experience with some exposure to graph-based methods and neural networks.
Python proficiency with hands-on use of ML libraries (e.g., PyTorch, UMAP, U-net [5]); experience with libraries for graph processing (GNN) is a strong plus.
Strong analytical and problem-solving skills.
Interest in microbial genomics and evolutionary biology.
Contact Information:
For more information and to apply, please send your resume and cover letter to Guillaume Gautreau (guillaume.gautreau@inrae.fr), Christophe Ambroise (christophe.ambroise@univ-evry.fr), Marie Szafranski (marie.szafranski@math.cnrs.fr) and David Vallenet (vallenet@genoscope.cns.fr).
Location: This internship will take place at University of Evry in LaMME (Laboratory of Mathematic and Modeling of Evry Val d’Essone).
Duration: 6 months starting in January-march 2026, funded by University of Evry. Possibility of continuing with a fully funded PhD.
References:
[1] Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Brief Bioinform. 2016. doi:10.1093/bib/bbw089
[2] Gautreau G, et al. PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLoS Comput Biol. 2020;16: e1007732. doi:10.1371/journal.pcbi.1007732
[3] Bazin A, et al. panRGP: a pangenome-based method to predict genomic islands and explore their diversity. Bioinformatics. 2020;36: i651–i658. doi:10.1093/bioinformatics/btaa792
[4] Bazin A, et al. panModule: detecting conserved modules in the variable regions of a pangenome graph. bioRxiv. 2021. p. 2021.12.06.471380. doi:10.1101/2021.12.06.471380
[5] Ronneberger, O. et al. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015. (pp. 234-241). Springer International Publishing. https://doi.org/10.48550/arXiv.1505.04597