Standardisation of RNA structural data and MSA-based clustering of sequences to predict RNA structures by deep-learning

Type de poste
Niveau d'étude minimal
Durée du poste
Contrat renouvelable
Contrat non renouvelable
Détails de renouvellement
Negociations on the dates are welcome.
Date de prise de fonction
Date de fin de validité de l'annonce

23 Boulevard de France
91000 Evry

Fariza Tahi
Email du/des contacts

Context : RNA 3D structure prediction is a task demanding a lot of computational resources when approached with thermodynamical simulations (molecular dynamics). The AROBAS team, whose research activities include RNA structure prediction, now proposes a deep learning approach to the RNA folding problem. However, there is no rigorous RNA dataset gathering sequence, evolutive and structural evolution ready yet for machine-learning applications. Indeed, biological sequences like RNA are linked by evolutionnary processes, so they cannot be considered independant and identically distributed. Therefore, one cannot simply randomly split a dataset into training, validation and test sets, one should think of a non-randomway to correctly split the data.

Internship : We wish to conceive and release a reference dataset for RNA applications (like there exists ImageNet in image processing or more recently ProteinNet in the field of structural bioinformatics). An algorithm has been proposed [1] to cluster proteins by similarity or dissimilarity, not only based on their sequence but on multiple sequence alignments (MSA), to increase sensitivity.
The intern will develop a pipeline of bioinformatics tools to apply the same clustering algorithm to all known reference non-coding RNA sequences (several millions), on a high performance computing infrastructure available on purpose. The intern will then think about a strategy to standardize the 3D data, and will produce as output a complete dataset combining sequence and evolutive information (the features) with 3D information (the labels).

Required knowledge : Linux environment, scripting language (Python 3 or bash), basics of machine learning
Optional : Clustering and sequence alignment algorithms, PDB and Rfam databases.

Useful references :
[1]ProteinNet: a standardized data set for machine learning of protein structure, Mohammed AlQuraishi, BMC Bioinformatics, 2018
[2] Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families, Kalvari & al, Nucleic Acids Research, 2017

Equipe adhérente personne morale SFBI
Equipe Non adhérente