23 Boulevard de France
Context : RNA 3D structure prediction is a task demanding a lot of computational resources when approached with thermodynamical simulations (molecular dynamics). The AROBAS team, whose research activities include RNA structure prediction, now proposes a deep learning approach to the RNA folding problem. However, there is no rigorous RNA dataset gathering sequence, evolutive and structural evolution ready yet for machine-learning applications. Indeed, biological sequences like RNA are linked by evolutionnary processes, so they cannot be considered independant and identically distributed. Therefore, one cannot simply randomly split a dataset into training, validation and test sets, one should think of a non-randomway to correctly split the data.
Internship : We wish to conceive and release a reference dataset for RNA applications (like there exists ImageNet in image processing or more recently ProteinNet in the field of structural bioinformatics). An algorithm has been proposed  to cluster proteins by similarity or dissimilarity, not only based on their sequence but on multiple sequence alignments (MSA), to increase sensitivity.
The intern will develop a pipeline of bioinformatics tools to apply the same clustering algorithm to all known reference non-coding RNA sequences (several millions), on a high performance computing infrastructure available on purpose. The intern will then think about a strategy to standardize the 3D data, and will produce as output a complete dataset combining sequence and evolutive information (the features) with 3D information (the labels).
Required knowledge : Linux environment, scripting language (Python 3 or bash), basics of machine learning
Optional : Clustering and sequence alignment algorithms, PDB and Rfam databases.
Useful references :
ProteinNet: a standardized data set for machine learning of protein structure, Mohammed AlQuraishi, BMC Bioinformatics, 2018
 Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families, Kalvari & al, Nucleic Acids Research, 2017