Evaluation and development of methodologies for unsupervised clustering of metagenomics data
Context
Meta-genomics is a novel approach to DNA sequencing, in which the “genome” of complete microbial communities (e.g. the gut flora, ocean water samples) can be determined. Because the rapid advances in DNA sequencing technologies, this approach is yielding millions of genes which need to be analysed to understand how these ecosystem functions or how disturbances in our flora are linked to disease. One important step in this process is the functional annotation of genes and the identification of completely novel gene families and proteins with biological, biotechnological or medical interest. To do this, all-against-all sequence comparisons are performed, yielding a long list of pairwise similarity scores which can be seen as a sparse graph of interrelationships. Using clustering techniques like Markov Chain Linkage, the graph can be clustered and subdivided into clusters of genes (gene families). However, due to the size of current datasets, traditional clustering approaches are becoming limiting in terms of memory usage and computational efficiency, or don’t even converge anymore.
In this project, the student will investigate the feasibility of clustering of large datasets (>15 mio nodes) using various existing approaches and/or modifications thereof. The goal is to come to a workflow that is memory-efficient, parallelizable and at the same time produces biologically relevant results.
Contact
Interested candidates can apply to Prof. Jeroen Raes (jeroen.raes@vib-vub.be) , VUB/VIB campus etterbeek http://systemsbiology.vub.ac.be
