Mots-Clés
machine learning, genomics, secretion systems, deconvolution approaches
Description
Subject
Secretion systems are crucial for bacterial organisms to interact with their environment, such as acquiring nutriments, setting up biotic defenses, as well as delivering virulence factors. There are currently 12 bacterial secretion systems known varying in size (1 to 15 proteins involved). Some organisms can have several times the same secretion system, and sometimes a single protein or group of proteins is involved in several secretion systems. Thus, detecting which protein belongs to which secretion system is a difficult task.
If we assume we know all homologs (a protein family with a common ancestor) involved in all secretion systems, can we find which homolog is involved in which secretion system? We propose to formulate this task as a deconvolution approach (e.g., non-negative matrix factorization): given a matrix X where each row corresponds to an organism, each column to a homolog, and each entry the number of times a homolog is found in an organism, factorize X into two matrices, one, denoted by V , corresponding to which homologs are found in which type of secretion system, the other, denoted by U corresponding to the number of secretion system in each organism.
This internship aims at addressing several challenges of using NMF approaches on such application: tuning the hyper-parameters of the model, nclusion of prior knowledge through penalization approaches, …
Data
We have access to a large database of bacterial and archael genomes, where each protein has been annotated as potentially being part of specific secretion systems (through sequence similarity). We also have groundtruth information on the presence / absence of secretion systems, thus making this an ideal dataset to develop and test deconvolution approaches for real-world problems of major importance.
Applications
We are looking for a 1st or 2nd year master student or equivalent for a 4 to 6 months internship.