Post-doc ANR 3 years H/F Machine Learning & Regulatory Genomics

 CDD · Postdoc  · 26 mois    Bac+8 / Doctorat, Grandes Écoles   IGMM & LIRMM · Montpellier (France)  entre 2833 et 3347€ brut / mois selon expérience

 Date de prise de poste : 1 février 2023

Mots-Clés

machine learning HMM genomic regulations genetics transcription

Description

Finding how regulatory DNA sequence operates to control genome expression, and deciphering the DNA cis-regulatory code, is key to clinically interpret the myriad of genetic variations observed in individual genomes and to foster genomic medicine [Zeitlinger, 2020]. The wealth of genomics data of exquisite high resolution offers a unique opportunity to decode this genome-wide syntax, and, in this task, bioinformatics and machine learning methods have appeared instrumental. In fact, numerous approaches have already been developed [Libbrecht & & Noble, 2015 ; Eraslan et al., 2019], not only to identify transcription factor (TF) motifs [Bailey TL & Elkan, 1944 ; Bussmaker et al., 2001] but also motifs predictive of histone modifications, chromatin opening or directly RNA expression, often with high accuracy [Zhou et al., 2015 ; Agarwal et al., 2020 ; Kelley et al., 2018 ; Avsec et al., 2021]. While the high accuracies achieved by these methods confirm the existence of sequence-level instructions for genome regulation and RNA expression, most of them focus on single nucleotides and motifs (typically Transcription Factor (TF) binding sites) and do not take-into-account the fact that the nucleotide distribution along the genome is not uniform [Bernardi et al., 1985 ; Bessière et al., 2018]. Yet this particular distribution creates large and relatively homogeneous regions with low complexity (thereafter called Low Complexity Regions, LCRs), which can be due to strong inequality in nucleotide content (biased content) or by the presence of tandem repeats (e.g. microsatellites), as well as by a combination of these features [Bernardi et al., 1985 ; Orlov & Potapov, 2004]. LCRs can play key functions in various genomic regulations making them key elements of the DNA cis-regulatory code. While several approaches have long been proposed to model their textual complexity [Orlov & Potapov, 2004], the DExTER method developed by our team remains the sole approach designed to specifically and automatically characterize LCRs associated with their now widely admitted regulatory functions [Menichelli et al., 2021]. In the present project, we propose to continue our efforts and to develop statistical and machine learning methods, inspired from Hidden Markov Models (HMMs) and Convolutional Neural Networks (CNNs), aimed at specifically characterizing LCRs implicated in two fundamental biological processes: RNA transcription and TF binding. These models will be trained with data collected in different cell lines and in two species (human and Plasmodium falciparum). The chosen biological processes will be studied by these methods, only changing the predicted variable (regression and continuous variable for RNA transcription; binary classification for TF binding) and the learning algorithms. LCRs identified by these analyses as new regulatory elements will further be experimentally validated by the experimentalists of the consortium (S. Spicuglia lab, Marseille, for human cells, and JJ. Lopez-Rubio lab, Montpellier, for P. falciparum).

From a fundamental perspective, our project will identify new regulatory regions and their evolution/conservation in several species. These results should provide new avenues of research explaining at least in part the heterogeneity of the nucleotide composition of genomes. It will also provide new insights into the regulations observed in one important human pathogen, P. falciparum, which are still poorly understood. Our project also has promising and innovative applications in public health in particular in genomic medicine. Our methods are indeed of prime importance to delineate new regulatory regions and to characterize their nucleotide compositions, thereby allowing a better interpretation of thousands of variations located within these regions and sometimes linked to specific traits by statistical analyses (e.g., clinical traits in GWAS or gene expression in eQTLs) without any molecular mechanism to support these regulations. In fact, most of the genetic variations observed in individuals are located in unannotated non-coding regions of the genome, which prevents their biological and clinical interpretations. Our project should start filling this gap.

Candidate profile: The candidate will work in a multidisciplinary team (combining biology, computer sciences and statistics), in collaboration with experimentalists (Montpellier and Marseille), and in a very active international environment (with collaborators from CRG, Barcelona, Spain ; UBC, Vancouver, Canada ; RIKEN Yokohama, Japan). She/he will be specifically involved in the development of the HMM and CNN models and will help in the dissemination of the models to the scientific community to make codes freely available. The candidate will have a PhD in bioinformatics, computer science, statistics or related fields with experience in machine learning. Being familiar with genetics, genomics and/or gene expression is an advantage but this knowledge can be acquired in the course of the project with interactions of members of the consortium and/or with dedicated theoretical courses and workshops. Individual qualities such as adaptability, perseverance, creativity and teamwork are expected

Candidature

Procédure : Candidater via le site en lien

Date limite : 23 décembre 2022

Contacts

charles LECELLIER

 chNOSPAMarles.lecellier@igmm.cnrs.fr

 https://emploi.cnrs.fr/Gestion/Offre/Default.aspx?Ref=UMR5535-CHALEC-002

Offre publiée le 5 décembre 2022, affichage jusqu'au 23 décembre 2022