Mots-Clés
statistical genetics
genome-wide association studies
population genetics
complex traits and diseases
genetic variant classification
Description
Context
Genome-wide association studies (GWASs) consist in scanning the genome to identify specific regions, namely genetic variants, that are significantly associated with complex traits and diseases (e.g. standing height or type 2 diabetes). Over the past two decades, countless associations have been identified in the human genome. According to the GWAS catalog inventory, 7,714 publications, and over 1 million unique genetic variant-trait associations have been reported as of June 2026. However, for complex traits and diseases, the biological mechanisms underlying the GWAS associations are challenging to establish and links between genes and diseases are mostly unelucidated. As a concrete example, a systematic review reported only 309 experimentally validated non-coding GWAS variants (Alsheikh et al. 2022).
Still, GWAS utility has been demonstrated with the development of Polygenic Risk Scores (PRS) which integrate the effects of multiple genetic variants identified through GWAS to quantify an individual’s susceptibility to a specific disease or trait. PRS consolidate these variant associations into a single score that can predict disease risk, enabling clinicians to identify high-risk individuals for preventive interventions and tailored treatment plans. However, multiple limitations to the use of PRS have been raised when training and test samples differ, notably the fact that they do not transfer well across ancestries or sociodemographic groups (Jayasinghe et al. 2024; Cudic et al. 2026).
Objectives
Therefore, we propose to conduct a systematic evaluation of the results of GWAS studies to evaluate whether common phenomena of population genetics can explain the current state of GWAS results, which could help to futher understand and interpret these associations. On the one hand, we will run extensive simulations to simulate genome evolution through time to illustrate that, in a GWAS, the lead variant and its mapped gene can change across generations simply because of recombination and linkage disequilibrium. We already implemented a preliminary forward-time population genetic simulation in SLiM (Haller et al. 2026), modelling a human population with mutation, recombination, genetic drift, and natural selection, and obtained highly encouraging preliminary results. On the other hand, we propose to study the genetic variants and their associations identified for a large number of complex traits from the perspective of genomic composition. Variants can be categorized based on their location in the human genome: exons, introns, splicing sites, promoters, 5’ UTR or 3’ UTR regions, enhancers, silencers, and intergenic regions. Thus, for a given trait, this classification allows us to define a “genomic composition” of the identified associations. The objective of this internship will to analyze and compare the genomic compositions of different complex traits and diseases using compositional data analysis tools and by comparing these results to a theoretical expected genomic composition.
By completing these objectives, the intern will contribute to the development of computational tools that support the validation and interpretation of genetic associations.
The successful candidate
- will be a master 2 student of data science linked to statistics, artificial intelligence, or computational biology; candidates with more theoretical background however showing strong interest in life science applications are also welcome;
- will be enthusiastic about transdisciplinary research and open science at the interface between data science and genetics;
- will show a clear interest to use applied science methodology to benefit biological understanding;
- will have good programming skills (R and/or Python);
- can have a background in biology or genetics;
- should be open-minded and willing to work as a team with other lab members.
Scientific environment
Starting date: February 2027
The 6-months Masters’ internship will be supervised by Dr Marie Verbanck who is Professor of statistical genetics at Institut Curie.
This internship could lead to a PhD thesis building on this project.
References related to the internship
Alsheikh, Ammar J., Sabrina Wollenhaupt, Emily A. King, et al. 2022. “The Landscape of GWAS Validation; Systematic Review Identifying 309 Validated Non-Coding Variants Across 130 Human Diseases.” BMC Medical Genomics 15 (1): 74. https://doi.org/10.1186/s12920-022-01216-w.
Cudic, Mihael, Justin D. Tubbs, Tian Ge, and Jordan W. Smoller. 2026. “Putting Polygenic Scores in Context: How Intersectional Factors Affect Relative and Absolute Genetic Risk.” The American Journal of Human Genetics 113 (5): 966–77. https://doi.org/10.1016/j.ajhg.2026.03.013.
Haller, Benjamin C, Peter L Ralph, and Philipp W Messer. 2026. “SLiM 5: Eco-evolutionary Simulations Across Multiple Chromosomes and Full Genomes.” Molecular Biology and Evolution 43 (1): msaf313. https://doi.org/10.1093/molbev/msaf313.
Jayasinghe, Dovini, Setegn Eshetie, Kerri Beckmann, Beben Benyamin, and S. Hong Lee. 2024. “Advancements and Limitations in Polygenic Risk Score Methods for Genomic Prediction: A Scoping Review.” Human Genetics 143 (12): 1401–31. https://doi.org/10.1007/s00439-024-02716-8.