Mots-Clés
Deep Learning
Enhancer
Regulatory regions
ChIP-seq
RNA-seq
Machine Learning
Data-mining
Description
Summary of the project:
Our recent work suggests that a substantial fraction of coding exons displays hallmark features of enhancer activity and can act as regulatory elements in functional assays, defining a class of elements referred to as exonic enhancers (EEs). This dual function raises several scientific and methodological challenges.
First, the diversity of EEs remains to be clarified: do they form a homogeneous class, or do they comprise multiple subtypes driven by distinct mechanisms (TF binding density, motif grammar, chromatin context, base composition, non-B DNA structures such as G-quadruplexes, tissue specificity)? Second, EE identification often relies on dense experimental datasets (e.g., ChIP-seq peak collections, chromatin accessibility profiles), which are unevenly available across tissues and species. It is therefore crucial to assess to what extent sequence alone, or models that learn regulatory “grammar”, can predict EE activity without systematically depending on rich epigenomic profiling.
Beyond identification and prediction, EEs have direct implications for variant interpretation. A coding variant, including synonymous changes, may disrupt a TF binding site (TFBS) embedded within an EE and thereby alter gene expression, in addition to its potential effects on protein sequence or RNA processing. This raises an important question in functional genomics and human genetics: what fraction of “classical” coding variants (synonymous, missense) exerts regulatory effects through EEs, and how can such variants be prioritized for functional follow-up or clinical interpretation? This motivates the development of a reproducible framework, starting from VCF files, to annotate variants overlapping EEs, estimate their impact on motifs or TFBS, and provide actionable prioritization scores.
Finally, the existence of EEs implies a distinctive evolutionary constraint: the same sequence must satisfy both protein-level constraints (codons, amino acids, structure) and regulatory constraints (TF motifs, sequence composition, regulatory architecture). Understanding how evolution resolves this trade-off, and whether some EE subclasses are more conserved than others, provides a mechanistic view of dual function and enables testing the generalizability of sequence-based models across species. Overall, this PhD project aims to build an integrated framework, spanning quantitative definition of EE subtypes, sequence-based prediction, variant interpretation, and an evolutionary perspective on the constraints shaping these elements.
This project combines regulatory genomics and large-scale bioinformatics, balancing mechanistic analyses, sequence-based modelling, and reproducible tool development. It offers a unique opportunity to work on an emerging biological concept with direct implications for coding-variant interpretation.
Objectives:
This thesis project follows a logical storyline: define (what is an EE and which subfamilies exist), predict (can EEs be detected without extensive experimental data), interpret (what are the consequences of variants), and understand (generalization and evolutionary constraints).
From our recent work , several key questions arise:
- How common are exonic enhancers (EEs), and do they form distinct mechanistic subclasses rather than a single homogeneous category?
- Can EE activity be predicted from DNA sequence alone, enabling discovery in tissues and species where epigenomic data are limited?
- Do coding variants, including synonymous changes, disrupt transcription factor binding sites embedded in EEs and thereby alter gene regulation?
- How does evolution accommodate the coexistence of protein-coding constraints and regulatory constraints within the same exon, and which EE subclasses are most conserved?
In a first step, the student will develop “classical” computational genomics analyses to define and stratify exonic enhancers based on quantitative signatures (TF binding support, motif content, sequence features, predicted G4 propensity, tissue specificity proxies, and conservation). Dimensionality reduction and clustering will be used to identify robust EE subtypes, and these groups will be validated using external datasets and functional readouts when available.
In a second step, the student will develop sequence-based predictive models (machine learning and deep learning) to distinguish EEs from matched non-EE exons and to predict regulatory activity. Model interpretability (attribution, in silico mutagenesis, codon-aware perturbations) will be used to connect predictions to biologically meaningful sequence determinants, such as motifs and TFBS organization.
In a third step, the student will implement a reproducible variant interpretation framework that takes VCF files as input, intersects variants with EEs, and estimates TFBS disruption or creation (gain/loss and changes in motif strength). Outputs will provide prioritization scores combining motif impact, EE subtype information, and sequence-model predictions, enabling the selection of a small set of candidate variants for targeted functional validation (for example luciferase assays or targeted MPRA).
In a fourth step, the student will perform a comparative and evolutionary analysis across species (human and additional model species where data are available) to quantify EE conservation, motif turnover, and cross-species transferability of sequence-based models. This will help identify where the trade-off between protein-coding and regulatory constraints is resolved (putative “safe harbor” regions) and how EE subclasses relate to evolutionary constraint and regulatory innovation.
Profile and skills required:
Applicants should hold a Master’s degree in bioinformatics, computational biology, genomics, and demonstrate a solid understanding of genome regulation and molecular biology (genes, promoters/enhancers, transcription factors, chromatin, variants).
For non-French degrees, applicants should have a First or upper Second class Honours degree; for French Master’s degrees, candidates should be in the top tier of their cohort (Mention AB minimum).
The ideal candidate will have strong programming skills in Python and R, and be comfortable handling large genomic datasets (BED, GTF/GFF, FASTA, VCF), genome annotations, and standard bioinformatics tooling (e.g., bedtools, samtools/bcftools). Experience with genomic interval analyses, motif or TF binding analyses, and integration of multi omics datasets (e.g., ChIP-seq, ATAC-seq, STARR/MPRA, RNA-seq) is highly desirable.
Familiarity with statistical learning methods (e.g., dimensionality reduction, clustering, classification) is expected. Experience with machine learning is a plus, but the emphasis is on using these approaches in a biologically informed way, including careful experimental design, appropriate controls, and interpretable results. Knowledge of relevant libraries (e.g., scikit-learn; optionally PyTorch or TensorFlow/Keras) is welcome.
Experience working on HPC environments (ideally SLURM), with good practices in reproducible research (Git, structured code, documentation, testing) is necessary. Experience with workflow management systems (Snakemake/Nextflow) is a plus, if not necessary.
Context :
FR 🇫🇷 Cette thèse sera financée par une bourse ministérielle et débutera le 01/10/2026. Elle se déroulera au laboratoire TAGC (Inserm U1090), sur le campus de Luminy à Marseille (France). Le ou la candidat(e) retenu(e) devra candidater au concours (dossier et audition) de l’École Doctorale ED 658; ce concours étant compétitif, le financement est conditionné à la réussite de cette sélection.
EN 🇬🇧 This PhD project will be funded through a French Ministry fellowship and will start on 01/10/2026. It will be carried out at TAGC (Inserm U1090) on the Luminy campus in Marseille, France. The selected candidate will be required to apply to the ED 658 doctoral school selection process (interview); as this competition is highly selective, funding is conditional upon successfully passing this selection.
Supervision:
The PhD will be supervised by Benoit Ballester (TAGC , ORCID , Linkedin), in a human sized team with good spirit. PhD students have direct access to the supervisor, and are encouraged to interact and work together.
How to apply:
Candidate should first apply by email (benoit.ballester@inserm.fr) with the following :
- CV
- Cover Letter
- Marks (Master 1 and Master 2, as well as current rank)
- Reports from previous academic placement (e.g. Master 1 placement)
Only one applicant will be presented at the Doctoral School competition.
Deadline for receipt of applications: 11 May 2026
ED Competition date: July 1st-3rd, 2026
ED Competition detail : 10min oral presentation, followed by 15min question with the jury (10-14 members).