Mots-Clés
Knowledge engineering
hypergraph data model
real-world evidence
knowledge base
ontologies
epidemiology
LLM-based agents
automated curation
Description
Real-world evidence studies leverage routinely collected healthcare data and modern epidemiological designs to estimate associations and causal effects between exposures (e.g., drugs), outcomes (e.g., disease onset, progression, adverse events), and patient-level factors across a broad spectrum of conditions. These results are increasingly produced with causal inference workflows and are critical to accelerate evidence synthesis for applications such as drug repurposing and clinical decision support.
A large Real-World Evidence Knowledge Base (RWE-KB) has recently been developed to structure and connect this evidence at scale. It compiles findings from epidemiological studies (results, populations, exposures, comparators, outcomes, covariates, bias/limitations, metadata), links treatments to targets, and is enriched with ontological and mechanistic knowledge. The resource is naturally represented as a hypergraph, enabling n-ary relations, hierarchical node/edge types, contextualized assertions, explicit evidence levels, and end-to-end provenance. However, the current hypergraph is still sparse and heterogeneous, and scaling it to a level that supports downstream tasks, such as AI development and clinician-facing products, requires stronger validation, data quality, provenance, and robust ingestion/curation workflows.
The core objective of this internship is to grow the existing RWE-KB into a large-scale, high-trust evidence hypergraph, with explicit provenance, quality signals, and conflict-aware aggregation. The intern will drive the expansion of the hypergraph by integrating new epidemiological evidence end-to-end, from normalization to representation, while strengthening the metamodel and validation rules that keep the KB consistent. Building on the current tooling, they will harden ingestion and curation workflows to improve key performance indicators and optimize LLM-based curation agents that reconcile inconsistent sources, handle deduplication, and reduce manual burden while keeping an auditable review loop. The outcome is a substantially larger, cleaner, and more reliable knowledge base designed to power downstream AI pipelines and clinician-facing applications.
Full offer description is at https://clreda.github.io/assets/offers/RWE_hypergraph_internship_proposal.pdf
How to apply
Interested candidates should apply either in English or French to reda@bio.ens.psl.eu and lamiae.grimaldi@aphp.fr with a detailed CV and a motivation letter.