Welcome to Arnaud Droit Laboratory


Where genomics meets computer science.

Welcome to Arnaud Droit Lab, the computational biology platform of the research center of Quebec CHU-Université Laval.

The projects in this lab aim to better understand how inter-individual variability shaped by epigenetic modifications in regulating cellular processes such as transcription and differentiation. We develop cutting edge bioinformatics and statistical methods for high throughput biological analyses. Most projects in the lab are multidisciplinary and combine genomics (Chip-Seq, RNA-Seq, Exome Sequencing), proteomics (Mass Spectrometry identifications) and semantic web (bio2rdf).

Our expertise


We perform large scale study of the structure and functions of proteins

Cancer research

We identify causes and develop strategies for diagnosis and treatment of breast and uro-oncological cancers

Big data

We manage to handle terabytes of biological data of many types: Genomics, proteomics, etc.

R package development

We develop various R packages to help process biological data

Semantic web

We host Bio2RDF, which provides the largest network of Linked Data for the Life Sciences


We can exploit data from various sequencing experiments (RNAseq, miRseq, ExomeSeq…)

Machine learning

We create programs using machine learning algorithms to classify biological data

Current projects



The Kibio.science project (link) is an online biohub portal that combines Elasticsearch, a fast
search engine designed to manage very large amounts of data, and Siren, a web visualization plugin, that
creates relational links between biological databases. This solution enables biologists to extract
meaningful information from available biological research data repositories. It also removes boundaries
by solving compatibility issues between resources (i.e. different data types, separation into specialized
repositories) and performs complex searches on many resources simultaneously.

Biomarker discovery by machine learning

Machine learning

We use machine learning approaches in different projects to analyze omics data. We also develop machine learning tools to help biomarker signature identification from disease-derived omics datasets. Omics datasets are generally highly unbalanced, where features largely outnumber samples and the patients are unequally distributed among measured outcomes. The data is also often heterogeneous (e.g. cancer data), of diverse types (e.g. categorical, numerical), and are often sparse. Thus, specific machine learning strategies have to be developed to adapt the special characteristics of omics data. 

Personalised Risk Stratification for Prevention and Early Detection of Breast Cancer

Personalised Risk Stratification for Prevention and Early Detection of Breast Cancer (collaboration)

Each year, over 22,000 Canadian women are diagnosed with breast cancer, a disease that will claim the lives of 5,000 of them. The routine screening program currently in place is more accessible to women over the age of 50. However, one in five women diagnosed with breast cancer are under the age of 50.
The project aims to develop a decision-making support tool that will help extend the benefits of the current screening program to those women most at risk for breast cancer.

Through involvement with the largest international consortium on the study of breast cancer, the project will help broaden existing knowledge in order to provide better risk stratification tools, fine tune intervention strategies and offer the population more effective tools.


Development of proteomics tools

Our team aims at bringing the power and flexibility of the R/Bioconductor statistical plateform to mass spectrometry based proteomics. The Bioconductor plateform is a repertory of softwares, data and annotation packages based on the R statistical language. This plateform allows to quickly build new analytical pipelines by seamlessly connecting various tools for data manipulation, statistical analysis, annotation or visualisation. The bioconductor plateform also facilitates the deployment of a pipeline on HPC servers or on cloud computing services.

Two packages on this plateform are currently developed: rTANDEM and shinyTANDEM. rTANDEM is the first protein identification algorithm implemented in R. It includes the tandem algorithm as well as many associated scoring functions like the k-score, hrk-score and PTMTreeSearch-score. The package also provides converter functions allowing quick conversions between R-object and XML files.



In the past decade, the main strategy for genome-wide mapping of chromatin modifications, histone marks and interactions between DNA and proteins, has been ChIP followed by microarray analysis (ChIP-chip). Recent improvements in the efficiency, quality, and cost of genome-wide sequencing prompted biologists to abandon microarrays in favor of next-generation sequencing, a method referred to as ChIP-Seq. Functional annotation of the noncoding sequences, which account for more than 95% of the genome, is difficult however due to the inherent lack of statistical and computational biology methods and tools available to agnostically interrogate epigenomic changes in humans.

The main goal of our research program is to build new computational tools to comprehensively characterize and functionally annotate the human epigenome. This research programs builds on the power of next- generation sequencing (NGS) coupled with chromatin immunoprecipitation (ChIP), an approach called ChIP-Seq to detect epigenetic variations at an unprecedented level of resolution.

Laboratory affiliations and partners


we're hiring !

We are always interested to recruit new students with a lot of talent