Origins of the diversity of protein folds and identification of de novo folds (M2 + Thèse ANR)

 CDD · Stage M2  · 42 mois    Bac+5 / Master   I2BC · Gif-sur-Yvette (France)  indemnités de stage + bourse de thèse

 Date de prise de poste : 13 février 2023


microbial ecology and evolution, de novo folds, structural diversity, bioinformatics, deep learning, structural bioinformatics, genomics


Context: Characterizing how environmental pressures have shaped the diversity of protein folds and functions is essential to understand the structural bases of molecular innovation and finally, biodiversity. Proteins are molecular Legos of protein bricks whose combination dictates the outcoming protein fold and function. Until recently, it was suggested that most folds have already been inventoried in public databases. Nevertheless, metagenomics-based analyses and genome sequencing of species living in extreme environments suggest that there are still novel structural spaces to be discovered. These studies report important ratios of Orphan genes (genes without any homolog in other species) or Taxonomically Restricted Genes (TRGs) that cannot be related to any known protein fold. Notably, archaeal genomes usually contain about 30% of TRGs, including many Orphans. Some of them have been shown to confer new traits and play significant roles in species adaptation to the environment. Are these Orphan and TRGs made of already known protein bricks or do they include new ones? Do these TRGs fold into novel protein folds? How do environmental constraints shape the protein structural space? Finally, what are the main sources of protein diversity and, is structural innovation more frequent in specific environments? These are fundamental questions to understand the structural bases of molecular innovation and species adaptation.


Objectives: We aim to investigate the protein diversity of archaea from a wide variety of habitats, including extreme ones. Encoding many Orphan genes and being poorly studied, archaea are expected to potentially uncover new protein folds. Until recently, characterizing the fold diversity in these proteomes was unthinkable due to the time and resources required to solve even a single protein structure. However, the recent DeepMind's program AlphaFold2 has revolutionized structural biology with an artificial intelligence network able to predict 3D protein structures from amino-acid sequences with unrivaled atomic accuracy. This, therefore, offers an unprecedented opportunity to study the relationship between environmental constraints and the 3D repertoire of protein bricks, folds and functions. We will address these questions by exploring a large dataset of 500 archaeal genomes and MAGS selected to cover the archaeal tree of life and ecosystem types, along with the annotation and study of their protein folds.


Work: The project funded by ANR is a long-term project that is intended to be pursued with a doctoral thesis. The first step of the master 2 internship will consist in the structural annotation of 500 archaeal proteomes and the proteomes already released by DeepMind. The second step will consist in the analysis of the resulting protein structures, in particular, the identification of novel folds and the careful analysis of the AlphaFold2 models with respect to their scores. Overall, the data and analyses produced in this project will provide a comprehensive view of the archaeal protein universe, with the largest 3D repertoire to date at different structural levels (protein folds and bricks) and different phylogenetic and ecological levels (universal, specific to archaea, clade-specific, and environment-specific). Finally, multivariate statistics of protein folds, phylogenetic taxa and ecosystems will enable us to shed light on the molecular innovations undertaken by species to adapt to their environment improving our understanding and, therefore, our anticipation of the impact of environmental changes on biodiversity.


Environment: The internship/thesis will take place in the novel buildings of the Institute of Integrative Biology of the Cell (I2BC) located in the south of Paris in the green and historical campus of the CNRS. I2BC is a recent and highly dynamic Institute that gathers about 600 people including »150 PhD and Post-doctoral fellows. The project will benefit from the large palette of expertise and technologies already available at I2BC, and from the strong network of collaborations we have developed with experimentalists outside I2BC. Notably, we will take advantage of a recent expedition to Chile of our ANR partner who have sequenced novel metagenomes from environments dominated by archaea, including saline ones (with various chemistries), high-temperature ones, and a combination of both with the expectation to find many new archaeal lineages there.


Technical skills required: programming and basic knowledge in structural bioinformatics (mandatory). Skills in statistics, AI and/or genomics are welcome.


Procédure :

Date limite : 30 novembre 2022


Anne Lopes

Offre publiée le 3 octobre 2022, affichage jusqu'au 30 novembre 2022