Research projects

Machine learning for biomarker discovery

We use machine learning approaches in different projects to analyze omics data. We also develop machine learning tools to help biomarker signature identification from disease-derived omics datasets. Omics datasets are generally highly unbalanced, where features largely outnumber samples and the patients are unequally distributed among measured outcomes. The data is also often heterogeneous (e.g. cancer data), of diverse types (e.g. categorical, numerical), and are often sparse. Thus, specific machine learning strategies have to be developed to adapt the special characteristics of omics data.


Development of proteomics tools

Our team aims at bringing the power and flexibility of the R/Bioconductor statistical plateform to mass spectrometry based proteomics. The Bioconductor plateform is a repertory of softwares, data and annotation packages based on the R statistical language. This plateform allows to quickly build new analytical pipelines by seamlessly connecting various tools for data manipulation, statistical analysis, annotation or visualisation. The bioconductor plateform also facilitates the deployment of a pipeline on HPC servers or on cloud computing services.

Two packages on this plateform are currently developed: rTANDEM and shinyTANDEM. rTANDEM is the first protein identification algorithm implemented in R. It includes the tandem algorithm as well as many associated scoring functions like the k-score, hrk-score and PTMTreeSearch-score. The package also provides converter functions allowing quick conversions between R-object and XML files.

Development of Metagenomic pipelines

Metagenomic analysis aims to understand the microbial ecology of various environment, from the human microbiome including stools, skin mouth saliva , to animal microbiome as the poultry, cow mouse; by extracting and sequencing the DNA from the environment studied.
The DNA can be sequenced using various methods, including amplicon based methods with the amplification of target sequences as 16S (bacteria) or ITS (fungi) sequence, or directly sequenced (whole genomic shotgun).
The sequencing results can be analyzed using different methodologies:
1/ Taxonomic annotation, in order to obtain taxonomic matrices from species to kingdom levels
2/ Functional annotation, to obtain the gene content and the interpretation of the prevalence of biochemical pathways in samples
3/ De novo genomic reconstruction to reconstruct new genomes from the samples and understand their gene content, behavior in the environment studied

The matrices obtained can be then analyzed using differential analysis (as in DESEQ2/LIMMA package), in order to understand in the groups compared if there is a different distribution of species/genes/functions. A biomarker discovery analysis can be achieved using machine learning models in order to define a set of genes/species/functions specific to the environment.

Finally, we aim to develop the aera of metagenomic analysis by developing new strategy of analysis using the kmer direct analysis of samples.

Predicting prostate cancer trajectory

Predicting the clinical trajectory of individual patients with cancer is complex as multiple biological and physical parameters need to be integrated to make an accurate medical assessment. The decision process is critical to select when and how to act and ensure maximal recovery. Currently, the decision making relies on clinicians manually integrating various sources of data to make the best judgment call. To overcome this, artificial intelligence tools provide the ability to quickly integrate data from multiple patients and provide predictions to support the physicist in his/her decision.

With this project, we are developping a software able to extract thousands of data points collected during the clinical trajectory of each patient. We will apply this software to a problem in prostate cancer management which consists of determining which tumor will progress toward an aggressive stage. This is important to tailor interventions to patients and avoid overtreatment. Taken together, our work will represent a first step toward providing tools to support clinicians in their decision making for cancer management.

Identify genetic makers in neurodevelopmental disabilities

The prevalence of neurodevelopmental disabilities (NDD) concerns 13% of the Canadian population. The morbidity of NDD leads to significant cost for the family and society as a whole. An explosion in our understanding of the genetic basis of such developmental differences has paved the way for several preclinical studies and more recent clinical trials. The complexity derived from the genetic heterogeneity and the clinical diversity has proven challenging to traditional approaches for treatment. The general objective of this project is to use machine learning (ML) approaches to combine clinical diagnosis and personalized medicine, to further treat and improve health outcomes of patients with neurodevelopmental disability (NDD). We will demonstrate a path to use the modern ML approaches for analysis of modifier genes, and identify potential targets that can be used to improve neurodevelopmental disabilities.

Characterization of immune infiltration

Immune checkpoint blockades (ICBs), molecules that restore the immune system’s ability to recognize and eliminate tumor cells, have revolutionized the treatment for some adult cancers. To be efficient, ICBs need the tumor to be primarily infiltrate with immune cells, especially activated “inflamed” T-cells. Only a small percentage of pediatric tumors are sensitive to ICBs. We believe that ICBs sensitivity is highly dependent on tumor immune composition and can be anticipated. For this study, large scale genomics dataset from hundreds of childhood cancers (solid tumors and leukemias) will be analysed by high-throughput bioinformatics and machine learning approaches. Gene expression and mutational landscape analysis will be used to identify subgroups of tumors that harbor a rich immune infiltrate, to explore the B- and T-cell repertoire and to investigate the genes and pathways that drive the tumor immune recruitment/desertion.

Prostate cancer and biological networks

Prostate cancer is the second most common form of cancer in men and its rate is growing in the population worldwide. The goal of this project is to find new therapeutic molecules for its treatment using in silico approaches for drug repurposing and drug discovery. New deep learning and network-based methods will be explored for multi-omics datasets, in order to understand and map tissue-specific interactions between omics layers, predict the effect of drugs on tumor cells and propose new solutions to help fight the pathology.

Predicting drug toxicity using multi-layer biological networks

Despite the importance of knowing a drug’s mechanism of action (MOA) for its success in clinical trials and for understanding its potential side effects, it is not a requirement for Food Drug Administration (FDA) approval. As a result, many drugs on the market are administered without knowing their precise mechanism of action. The current challenge is to predict whether a drug presents a warning toxicity and how it interacts with its environment and therefore better characterize its mechanisms via multi-layer omic network analysis approaches.


In the past decade, the main strategy for genome-wide mapping of chromatin modifications, histone marks and interactions between DNA and proteins, has been ChIP followed by microarray analysis (ChIP-chip). Recent improvements in the efficiency, quality, and cost of genome-wide sequencing prompted biologists to abandon microarrays in favor of next-generation sequencing, a method referred to as ChIP-Seq. Functional annotation of the noncoding sequences, which account for more than 95% of the genome, is difficult however due to the inherent lack of statistical and computational biology methods and tools available to agnostically interrogate epigenomic changes in humans.

The main goal of our research program is to build new computational tools to comprehensively characterize and functionally annotate the human epigenome. This research programs builds on the power of next- generation sequencing (NGS) coupled with chromatin immunoprecipitation (ChIP), an approach called ChIP-Seq to detect epigenetic variations at an unprecedented level of resolution.


Personalised Risk Stratification for Prevention and Early Detection of Breast Cancer

Each year, over 22,000 Canadian women are diagnosed with breast cancer, a disease that will claim the lives of 5,000 of them. The routine screening program currently in place is more accessible to women over the age of 50. However, one in five women diagnosed with breast cancer are under the age of 50.
The project aims to develop a decision-making support tool that will help extend the benefits of the current screening program to those women most at risk for breast cancer.

Through involvement with the largest international consortium on the study of breast cancer, the project will help broaden existing knowledge in order to provide better risk stratification tools, fine tune intervention strategies and offer the population more effective tools.


The Bio2RDF project (software link) aspires to transforms silos of life science data into a globally distributed network of linked data for biological knowledge discovery. Bio2RDF creates and provides machine understandable descriptions of biological entities using the RDF/RDFS/OWL Semantic Web languages. Using both syntactic and semantic data integration techniques, Bio2RDF seamlessly integrates diverse biological data and enables powerful new SPARQL-based services across its globally distributed knowledge bases.

The project (link) is an online biohub portal that combines Elasticsearch, a fast search engine designed to manage very large amounts of data, and Siren, a web visualization plugin, that creates relational links between biological databases. This solution enables biologists to extract meaningful information from available biological research data repositories. It also removes boundaries by solving compatibility issues between resources (i.e. different data types, separation into specialized repositories) and performs complex searches on many resources simultaneously.

Statistics in bioinformatics

Data resulting from novel high throughput technologies have led to novel statistical problems and challenges. As a consequence, it is essential that analytical tools and statistical methods evolve in parallel with the assay technologies and datasets. Hence, the development of efficient statistical algorithms is a priority in our projects.