Research projects

Personalised Risk Stratification for Prevention and Early Detection of Breast Cancer (collaboration)

Each year, over 22,000 Canadian women are diagnosed with breast cancer, a disease that will claim the lives of 5,000 of them. The routine screening program currently in place is more accessible to women over the age of 50. However, one in five women diagnosed with breast cancer are under the age of 50.
The project aims to develop a decision-making support tool that will help extend the benefits of the current screening program to those women most at risk for breast cancer.

Through involvement with the largest international consortium on the study of breast cancer, the project will help broaden existing knowledge in order to provide better risk stratification tools, fine tune intervention strategies and offer the population more effective tools.

Development of proteomics tools

Our team aims at bringing the power and flexibility of the R/Bioconductor statistical plateform to mass spectrometry based proteomics. The Bioconductor plateform is a repertory of softwares, data and annotation packages based on the R statistical language. This plateform allows to quickly build new analytical pipelines by seamlessly connecting various tools for data manipulation, statistical analysis, annotation or visualisation. The bioconductor plateform also facilitates the deployment of a pipeline on HPC servers or on cloud computing services.

Two packages on this plateform are currently developed: rTANDEM and shinyTANDEM. rTANDEM is the first protein identification algorithm implemented in R. It includes the tandem algorithm as well as many associated scoring functions like the k-score, hrk-score and PTMTreeSearch-score. The package also provides converter functions allowing quick conversions between R-object and XML files.


In the past decade, the main strategy for genome-wide mapping of chromatin modifications, histone marks and interactions between DNA and proteins, has been ChIP followed by microarray analysis (ChIP-chip). Recent improvements in the efficiency, quality, and cost of genome-wide sequencing prompted biologists to abandon microarrays in favor of next-generation sequencing, a method referred to as ChIP-Seq. Functional annotation of the noncoding sequences, which account for more than 95% of the genome, is difficult however due to the inherent lack of statistical and computational biology methods and tools available to agnostically interrogate epigenomic changes in humans.

The main goal of our research program is to build new computational tools to comprehensively characterize and functionally annotate the human epigenome. This research programs builds on the power of next- generation sequencing (NGS) coupled with chromatin immunoprecipitation (ChIP), an approach called ChIP-Seq to detect epigenetic variations at an unprecedented level of resolution.

Statistical Computing

Data resulting from novel high throughput technologies have led to novel statistical problems and challenges. As a consequence, it is essential that analytical tools and statistical methods evolve in parallel with the assay technologies and datasets. Hence, the development of efficient statistical algorithms is a priority in our projects.

The project (link) is an online biohub portal that combines Elasticsearch, a fast
search engine designed to manage very large amounts of data, and Siren, a web visualization plugin, that
creates relational links between biological databases. This solution enables biologists to extract
meaningful information from available biological research data repositories. It also removes boundaries
by solving compatibility issues between resources (i.e. different data types, separation into specialized
repositories) and performs complex searches on many resources simultaneously.

Machine learning

We use machine learning approaches in different projects to analyze omics data. We also develop machine learning tools to help biomarker signature identification from disease-derived omics datasets. Omics datasets are generally highly unbalanced, where features largely outnumber samples and the patients are unequally distributed among measured outcomes. The data is also often heterogeneous (e.g. cancer data), of diverse types (e.g. categorical, numerical), and are often sparse. Thus, specific machine learning strategies have to be developed to adapt the special characteristics of omics data. 


The Bio2RDF project (software link) aspires to transforms silos of life science data into a globally distributed network of linked data for biological knowledge discovery. Bio2RDF creates and provides machine understandable descriptions of biological entities using the RDF/RDFS/OWL Semantic Web languages. Using both syntactic and semantic data integration techniques, Bio2RDF seamlessly integrates diverse biological data and enables powerful new SPARQL-based services across its globally distributed knowledge bases.