Post-doc ANR 3 years H/F Machine Learning & Regulatory Genomics
CDD · Postdoc · 36 mois Bac+8 / Doctorat, Grandes Écoles IGMM & LIRMM · MONTPELLIER (France) 2280 net
Date de prise de poste : 1 janvier 2023
machine learning genomic regulations transcription
Finding how regulatory DNA sequence operates to control genome expression, and deciphering the DNA cis-regulatory code, is key to clinically interpret the myriad of genetic variations observed in individual genomes and to foster genomic medicine [Zeitlinger, 2020]. The wealth of genomics data of exquisite high resolution offers a unique opportunity to decode this genome-wide syntax, and, in this task, bioinformatics and machine learning methods have appeared instrumental. In fact, numerous approaches have already been developed [Libbrecht & & Noble, 2015 ; Eraslan et al., 2019], not only to identify transcription factor (TF) motifs [Bailey TL & Elkan, 1944 ; Bussmaker et al., 2001] but also motifs predictive of histone modifications, chromatin opening or directly RNA expression, often with high accuracy [Zhou et al., 2015 ; Agarwal et al., 2020 ; Kelley et al., 2018 ; Avsec et al., 2021]. While the high accuracies achieved by these methods confirm the existence of sequence-level instructions for genome regulation and RNA expression, most of them focus on single nucleotides and motifs (typically Transcription Factor (TF) binding sites) and do not take-into-account the fact that the nucleotide distribution along the genome is not uniform [Bernardi et al., 1985 ; Bessière et al., 2018]. Yet this particular distribution creates large and relatively homogeneous regions with low complexity (thereafter called Low Complexity Regions, LCRs), which can be due to strong inequality in nucleotide content (biased content) or by the presence of tandem repeats (e.g. microsatellites), as well as by a combination of these features [Bernardi et al., 1985 ; Orlov & Potapov, 2004]. LCRs can play key functions in various genomic regulations making them key elements of the DNA cis-regulatory code. While several approaches have long been proposed to model their textual complexity [Orlov & Potapov, 2004], the DExTER method developed by our team remains the sole approach designed to specifically and automatically characterize LCRs associated with their now widely admitted regulatory functions [Menichelli et al., 2021]. In the present project, we propose to continue our efforts and to develop statistical and machine learning methods, inspired from Hidden Markov Models (HMMs) and Convolutional Neural Networks (CNNs), aimed at specifically characterizing LCRs implicated in two fundamental biological processes: RNA transcription and TF binding. These models will be trained with data collected in different cell lines and in different species (from human to Plasmodium falciparum). The chosen biological processes will be studied by these methods, only changing the predicted variable (regression and continuous variable for RNA transcription; binary classification for TF binding) and the learning algorithms. LCRs identified by these analyses as new regulatory elements will further be experimentally validated by the experimentalists of the consortium (S. Spicuglia lab, Marseille, for human cells, and JJ. Lopez-Rubio lab, Montpellier, for P. falciparum).
From a fundamental perspective, our project will identify new regulatory regions and their evolution/conservation in several species. These results should provide new avenues of research explaining at least in part the heterogeneity of the nucleotide composition of genomes. It will also provide new insights into the regulations observed in one important human pathogen, P. falciparum, which are still poorly understood. Our project also has promising and innovative applications in public health in particular in genomic medicine. Our methods are indeed of prime importance to delineate new regulatory regions and to characterize their nucleotide compositions, thereby allowing a better interpretation of thousands of variations located within these regions and sometimes linked to specific traits by statistical analyses (e.g., clinical traits in GWAS or gene expression in eQTLs) without any molecular mechanism to support these regulations. In fact, most of the genetic variations observed in individuals are located in unannotated non-coding regions of the genome, which prevents their biological and clinical interpretations. Our project should start filling this gap.
Procédure : Envoyer un email avec CV + références à email@example.com ou candidater via https://emploi.cnrs.fr/Offres/CDD/UMR5535-CHALEC-001/Default.aspx?lang=EN
Date limite : 31 octobre 2022
Offre publiée le 22 septembre 2022, affichage jusqu'au 31 octobre 2022