Machine learning for biomarker discovery
We use machine learning approaches in different projects to analyze omics data. We also develop machine learning tools to help biomarker signature identification from disease-derived omics datasets. Omics datasets are generally highly unbalanced, where features largely outnumber samples and the patients are unequally distributed among measured outcomes. The data is also often heterogeneous (e.g. cancer data), of diverse types (e.g. categorical, numerical), and are often sparse. Thus, specific machine learning strategies have to be developed to adapt the special characteristics of omics data.
Development of proteomics tools
Our team aims at bringing the power and flexibility of the R/Bioconductor statistical plateform to mass spectrometry based proteomics. The Bioconductor plateform is a repertory of softwares, data and annotation packages based on the R statistical language. This plateform allows to quickly build new analytical pipelines by seamlessly connecting various tools for data manipulation, statistical analysis, annotation or visualisation. The bioconductor plateform also facilitates the deployment of a pipeline on HPC servers or on cloud computing services.
Two packages on this plateform are currently developed: rTANDEM and shinyTANDEM. rTANDEM is the first protein identification algorithm implemented in R. It includes the tandem algorithm as well as many associated scoring functions like the k-score, hrk-score and PTMTreeSearch-score. The package also provides converter functions allowing quick conversions between R-object and XML files.
Development of Metagenomic pipelines
The DNA can be sequenced using various methods, including amplicon based methods with the amplification of target sequences as 16S (bacteria) or ITS (fungi) sequence, or directly sequenced (whole genomic shotgun).
The sequencing results can be analyzed using different methodologies:
1/ Taxonomic annotation, in order to obtain taxonomic matrices from species to kingdom levels
2/ Functional annotation, to obtain the gene content and the interpretation of the prevalence of biochemical pathways in samples
3/ De novo genomic reconstruction to reconstruct new genomes from the samples and understand their gene content, behavior in the environment studied
The matrices obtained can be then analyzed using differential analysis (as in DESEQ2/LIMMA package), in order to understand in the groups compared if there is a different distribution of species/genes/functions. A biomarker discovery analysis can be achieved using machine learning models in order to define a set of genes/species/functions specific to the environment.
Finally, we aim to develop the aera of metagenomic analysis by developing new strategy of analysis using the kmer direct analysis of samples.
Predicting prostate cancer trajectory
Predicting the clinical trajectory of individual patients with cancer is complex as multiple biological and physical parameters need to be integrated to make an accurate medical assessment. The decision process is critical to select when and how to act and ensure maximal recovery. Currently, the decision making relies on clinicians manually integrating various sources of data to make the best judgment call. To overcome this, artificial intelligence tools provide the ability to quickly integrate data from multiple patients and provide predictions to support the physicist in his/her decision.
With this project, we are developping a software able to extract thousands of data points collected during the clinical trajectory of each patient. We will apply this software to a problem in prostate cancer management which consists of determining which tumor will progress toward an aggressive stage. This is important to tailor interventions to patients and avoid overtreatment. Taken together, our work will represent a first step toward providing tools to support clinicians in their decision making for cancer management.
Identify genetic makers in neurodevelopmental disabilities
Characterization of immune infiltration
Immune checkpoint blockades (ICBs), molecules that restore the immune system’s ability to recognize and eliminate tumor cells, have revolutionized the treatment for some adult cancers. To be efficient, ICBs need the tumor to be primarily infiltrate with immune cells, especially activated “inflamed” T-cells. Only a small percentage of pediatric tumors are sensitive to ICBs. We believe that ICBs sensitivity is highly dependent on tumor immune composition and can be anticipated. For this study, large scale genomics dataset from hundreds of childhood cancers (solid tumors and leukemias) will be analysed by high-throughput bioinformatics and machine learning approaches. Gene expression and mutational landscape analysis will be used to identify subgroups of tumors that harbor a rich immune infiltrate, to explore the B- and T-cell repertoire and to investigate the genes and pathways that drive the tumor immune recruitment/desertion.
Prostate cancer and biological networks
Predicting drug toxicity using multi-layer biological networks
In the past decade, the main strategy for genome-wide mapping of chromatin modifications, histone marks and interactions between DNA and proteins, has been ChIP followed by microarray analysis (ChIP-chip). Recent improvements in the efficiency, quality, and cost of genome-wide sequencing prompted biologists to abandon microarrays in favor of next-generation sequencing, a method referred to as ChIP-Seq. Functional annotation of the noncoding sequences, which account for more than 95% of the genome, is difficult however due to the inherent lack of statistical and computational biology methods and tools available to agnostically interrogate epigenomic changes in humans.
The main goal of our research program is to build new computational tools to comprehensively characterize and functionally annotate the human epigenome. This research programs builds on the power of next- generation sequencing (NGS) coupled with chromatin immunoprecipitation (ChIP), an approach called ChIP-Seq to detect epigenetic variations at an unprecedented level of resolution.
Personalised Risk Stratification for Prevention and Early Detection of Breast Cancer
Each year, over 22,000 Canadian women are diagnosed with breast cancer, a disease that will claim the lives of 5,000 of them. The routine screening program currently in place is more accessible to women over the age of 50. However, one in five women diagnosed with breast cancer are under the age of 50.
The project aims to develop a decision-making support tool that will help extend the benefits of the current screening program to those women most at risk for breast cancer.
Through involvement with the largest international consortium on the study of breast cancer, the project will help broaden existing knowledge in order to provide better risk stratification tools, fine tune intervention strategies and offer the population more effective tools.
The Kibio.science project (link) is an online biohub portal that combines Elasticsearch, a fast search engine designed to manage very large amounts of data, and Siren, a web visualization plugin, that creates relational links between biological databases. This solution enables biologists to extract meaningful information from available biological research data repositories. It also removes boundaries by solving compatibility issues between resources (i.e. different data types, separation into specialized repositories) and performs complex searches on many resources simultaneously.
Statistics in bioinformatics
Data resulting from novel high throughput technologies have led to novel statistical problems and challenges. As a consequence, it is essential that analytical tools and statistical methods evolve in parallel with the assay technologies and datasets. Hence, the development of efficient statistical algorithms is a priority in our projects.