Metagenomic data anonymization based on realistic simulations

 Stage · Stage M2  · 6 mois    Bac+5 / Master   Biofortis · Saint-Herblain (France)  Gratification mensuelle standard en vigueur en France

 Date de prise de poste : 1 janvier 2022


Metagenomics Anonymization Deep-learning


Stage proposal (reference ST2022_datasc)
Company activity: CRO specialized in clinical trials and bioanalysis, with strong experience in the
microbiome field.
Team : Bioinformatics & Data Science

TOPIC: Metagenomic data anonymization based on realistic simulations

BIOFORTIS, a CRO specializing in clinical trials and bioanalysis, has been supporting innovation for its clients in nutrition, agri-food, biotech, pharma, academia and cosmetics for 20 years. With a staff of 90 collaborators, supported by Institut Mérieux, the team has more than 500 projects to its credit, 250 clinical trials managed full-service in France, Europe and internationally. In the context of its metagenomic data analysis activities, we are looking for a trainee for a period of 6 months. Reusability of metagenomics data acquired in clinical trials allows to better understand variability of key microbiome indicators in order to aid the design of new experiments. Nevertheless, such reusability is not fully exploited currently due to legal concerns about data anonymization. A potential surrogate is to simulate covariate-dependent realistic data that statistically matches the true microbiome. Firstgeneration of such simulators were typically based on parametric models and they were not able to mimic the actual complexity of microbiome data, like correlation structure or sparsity. More recently, deep-learning-based data simulation using generative adversarial networks (GAN) has been proposed to better resemble the salient features of real microbiome data. The main goal of this internship is to assess the feasibility/suitability of such tools to obtain statistically equivalent synthetic datasets that allow to anonymize available data and aid the design of clinical trials that include microbiome-related endpoints.

Main tasks
● Adapt/extend available procedures for GAN-based simulation of microbiome data to anonymize data from typical scenarios in clinical research.
● Test the ability of the obtained GAN-based data to reproduce results of true clinical trials.
● Check anonymization capability while retaining trustful statistics.
● Derive procedures for sample size estimation based on the simulated data.
● Derive procedures for controlling false discovery rate based on knockoff filters.

Expected Background
● Master 2 student in Bioinformatics, data science.
● Experience with TensorFlow / Pytorch, Git and SVN, Linux environment.
● Background in machine learning concepts.
● Fluency in English.

Period: 6 months, starting by February 2022.
Workload: 35 hours/week

Key references
● Ruichen Rong, Shuang Jiang, Lin Xu, Guanghua XiaRuichen Rong, Shuang Jiang, Lin Xu, Guanghua Xiao, Yang Xie, Dajiang J Liu, Qiwei Li, Xiaowei Zhan, MB-GAN: Microbiome Simulation via Generative Adversarial Network, GigaScience, Volume 10, Issue 2, February 2021, giab005,
● Patuzzi, I., Baruzzo, G., Losasso, C. et al. metaSPARSim: a 16S rRNA gene sequencing count data simulator. BMC Bioinformatics 20, 416 (2019).
● Clément Feutry, Pablo Piantanida, Yoshua Bengio, Pierre Duhamel. Learning Anonymized Representations with Adversarial Neural Networks. 2018 ⟨hal-01742447⟩


Procédure : Send your CV and motivation letter to: Diego Tomassi, Senior Data Scientist

Date limite : 1 mars 2022


Diego Tomassi

Offre publiée le 10 décembre 2021, affichage jusqu'au 15 mars 2022