M2 - Identification and characterization, in the non-coding genome, of hotspots involved in the creation of novel genes with integrative bioinformatics and machine learning

Type de poste
Niveau d'étude minimal
Dates
Durée du poste
Contrat renouvelable
Contrat non renouvelable
Date de prise de fonction
Date de fin de validité de l'annonce
Localisation
Adresse

Bât 400 - route de Chevreuse - Université Paris-Saclay
91400 Orsay
France

Contacts
Anne Lopes
Email du/des contacts
anne.lopes@i2bc.paris-saclay.fr
Description

Summary:
--------------
De novo gene (i.e. new genes emerging from a non-coding ORF) emergence is a much more frequent event than previously thought. However, the process of novel gene creation remains mysterious and has promoted many studies this last decade. As a matter of fact, mass spectrometry studies provide evidence of de novo peptides encoded by presumed non-coding regions. Notably, it has been shown that these de novo peptides may have a neuronal activity or be involved in cancers or other human pathologies. These results attribute a central role to the non-coding genome in the emergence of new genes and the development of pathologies, thereby strengthening the importance of characterizing the sequence and structural properties of intergenic ORFs along with their translational activity. We recently highlighted the factors determining the emergence of a de novo peptide through Ribosome Profiling experiments and comparative genomics approaches. This project lies at the interface between OMICS data analyses (e.g. transcriptomics, Ribosome Profiling and mass spectrometry), genomics, and structural bioinformatics, and aims at predicting (i) hotspots for de novo gene birth in non-coding regions and (ii) the toxicity of the translated intergenic ORFs identified by Ribosome Profiling. This work will improve our understanding of the mechanisms occurring in the emergence of new genes from the non-coding genome and will provide a better comprehension of the role of these de novo peptides in human diseases.

Context:
------------
De novo gene (i.e. new genes emerging from a non-coding ORF) emergence is a much more frequent event than previously thought and plays an important role in the evolution of genomes and the emergence of new functions [1]. Many cases have been and continue to be reported in different organisms such as S. cerevisiae, C. elegans, D. melanogaster, M. musculus, or H. sapiens, for example [2]. The activities of these novel genes are generally associated with stress response or cognitive functions for mammals. Besides, several recent mass spectrometry studies report the existence of de novo peptides (derived from non-coding ORFs) that may have a neuronal activity or which are involved in cancers or other human pathologies [3-5]. These results attribute a central role to the non-coding genome in the emergence of new genes and the development of pathologies. However, the mechanisms governing these processes are still unknown, though they are essential to understand (i) the evolutionary forces governing the emergence of these new proteins and the evolution of genomes and (ii) their role in some cancers, or other human pathologies.
Recently, we revealed the widespread existence of thousands of small non-coding ORFs (IGORFs for InterGenic ORFs) in the genome of S. cerevisiae [6]. Using Ribosome Profiling experiments and ancestral IGORF reconstruction approaches, we have highlighted the factors determining the emergence of a de novo peptide (or protogene). The latter can subsequently provide the raw material for de novo gene birth. In particular, we have shown that these de novo peptides are encoded by IGORFs that exhibit specific sequence and structure properties.These results allowed us to establish a model of novel gene creation from non-coding regions through the existence of hotspots of IGORFs with specific sequence and structural properties and which encode small peptides that can be further selected to give rise to new genes.

Objectives:
---------------
This project of Integrative Biology lies at the interface between OMICS data analyses (IGORF expression and translation profiles, mass spectrometry for identification of de novo peptide), genomics, and structural bioinformatics. The project relies on bioinformatics, statistics, and machine learning approaches with a strong component in heterogeneous data integration. The long-term objective consists of developing a method which will allow to (i) automatically scan and annotate the non-coding genome of any species of interest (with the application on H. sapiens) according to the descriptors that we have already identified, (ii) identify the non-coding regions with a high propensity for de novo gene birth (i.e. hotspots) and (iii) anticipate and avoid the production of harmful peptides in the cell. This work will enable us (i) to improve our understanding of the mechanisms occurring in the emergence of new genes from the non-coding genome and (ii) to better understand (and therefore to prevent) the role of de novo peptides, resulting from the translation of non-coding regions, in specific human pathologies. This internship will take place at the Institute of Integrative Biology of the Cell (I2BC), which gathers all the expertise necessary to realize this project. This project will be carried out in collaboration with the Genomics, Structure, and Translation (I2BC) team, which has already performed the RNA-seq and Ribosome Profiling experiments (mass spectrometry data to come).

Required work:
--------------------
The project is vast, and depending on the motivations/skills of the candidate, the internship (short-term objective) will consist in either (i) developing a predictor of the structural and toxicity properties of the translated IGORFs (identified by Ribosome Profiling) or (ii) to characterize the translational activity of non-coding regions and the production of associated peptides by OMICS approaches in different species of interest and in the same species under different conditions. This will enable us to better understand (i) the molecular mechanisms governing the production of these peptides (impact of conditions and conservation of translation activity between related species) and (ii) the role of these peptides in some phenotypes which may, for example, be associated with pathologies.

Technical skills required:
---------------------------------
programming (scripting language). Skills in statistics and/or genomics are welcome. Depending on the candidate's profile and motivations, the internship can be mostly oriented towards OMICS analyses or machine learning approaches.

References:
------------------
[1] Tautz D, Domazet-Lošo T. 2011, The evolutionary origin of orphan genes. Nat Rev Genet. 12:692-702
[2] Carvunis AR, et al 2012, Proto-genes and de novo gene birth. Nature. 487:370-4
[3] Prabakaran S, Hemberg M, Chauhan R, Winter D, Tweedie-Cullen RY, Dittrich C, Hong E, Gunawardena J, Steen H, Kreiman G, Steen JA. 2014, Quantitative profiling of peptides from RNAs classified as noncoding. Nature Commun. 5:5429
[4] Barbosa C, Peixeiro I, Romao L. Gene expression regulation by upstream open reading frames andhuman disease. PLOS Genet.2013;9:e1003529.
[5] Yin X, Jing Y, Xu H. 2019, Mining for missed sORF-encoded peptides. Expert Rev Proteomics. 16(3):257-266.
[6] Papadopoulos C, Callebaut I, Gelly JC, Namy O, Renard M, Lespinet O, Lopes A (manuscript in preparation)

****This project constitutes the first steps of a PhD thesis*****

Equipe adhérente personne morale SFBI
Equipe Non adhérente