Mots-Clés
                    
                        
                            Big Data
                        
                             Data Lake
                        
                             Omics Data
                        
                             Bioinformatics
                        
                             Machine Learning
                        
                             Python
                        
                    
                    Description
                    
                        
                            
Offer: Master 2 internship - Data Engineering of Human Genomic Data Lake (CEA Grenoble)
Title: Data Lake for Human Genomic data
 
Background: Over the past 10 years, technological advances in DNA sequencers have enabled the generation of large amounts of genomic data stored in public data repositories. Research projects seeking to take advantage of this data to investigate issues in human health must systematically go through a stage of collection and integration of structured and unstructured large scale data sets. The classic approach used by data scientists involves manipulating many flat files and transforming them into dataframes using command line tools. This data transformation and organization strategy has a number of drawbacks for: (i) reproducing transformations and ensuring data versioning, (ii) finely and quickly configuring secure access to the different data transformation stages according to their levels of sensitivity and (iii) facilitating collaborative work on data to deploy data analytics and ML approaches.
 
Objective: This internship will contribute to the development of a Data Lake for human genomic data in collaboration with researchers and developers from the European KATY project on personalized medicine (https://katy-project.eu/).
 
Workplan: The genomic data sets having already been identified and collected, the internship will begin with the definition of a Data Lake design adapted to the use cases that will be selected. The intern will then work on the data ingestion, cleanup, and preparation stages in order to generate different data marts suitable for downstream analysis of use cases involving data analytics and machine learning approaches.
 
Host laboratory: The intern will be hosted in the “Genetics and Chemogenomics” team of the Interdisciplinary Research Institute of Grenoble (IRIG) of the CEA Grenoble. He/she will be supervised by Christophe Battail, expert in computational analysis and modeling of genomic data, and will evolve in a multidisciplinary research environment composed of bioinformaticians and biologists. The intern will also strongly interact with developers from the European KATY project.
 
Candidate profile:
Knowledge and skills in computer science: Big Data architecture (Data Lake), unix command line and Python programming.
Professional aptitude: curiosity and desire to improve their scientific and technological skills, rigor and organization, and ability to work in a team and interact with other students, engineers and researchers
 
Job contract:
6-month master's internship starting in February 2023.
Please send a CV, a cover letter and the name of a referee to christophe.battail@cea.fr.