Génomique computationnelle

Applications disponibles par le biais du pipeline MUGQIC (McGill University and Génome Québec Innovation Centre)

L'équipe du Dr Guillaume Bourque, directeur de la bioinformatique au Centre d'Innovation Génome Québec, a développé un pipeline permettant d'automatiser, dans un environnement HPC, trois types d'analyses utilisant les séquences next generation :

  • RNASeq
  • ChipSeq
  • DNASeq

Sur Guillimin et Mammouth

Le pipeline est disponible et maintenu sur les serveurs Guillimin et Mammouth de Calcul Québec. MUGQIC Pipeline Home (en anglais)

Sur Colosse

Le pipeline MUGQIC et les applications qu'il utilise sont maintenant disponibles sur Colosse. L'équipe du Dr Arnaud Droit , qui utilise régulièrement celui-ci, est responsable de sa maintenance.

Charger le module principal

Il faut d'abord s'assurer que le compilateur GCC est chargé :

[nom@serveur $] module swap compilers/intel/14.0 compilers/gcc

Le module MUGQIC est alors visible et peut ainsi être chargé :

[nom@serveur $] module load apps/mugqic_pipeline

Accéder aux applications

Une fois le module principal chargé, les applications reliées au pipeline seront visibles et pourront être utilisées.

Pour obtenir la liste des applications disponibles :

[nom@serveur $] module avail mugqic

Liste des applications

En date du 25/11/2015, les applications disponbles sont :

Module Nom Version Description
BAMTOOLS mugqic/bamtools 2.4.0 BamTools provides both a programmer's API and an end-user's toolkit for handling BAM files.
bcl2fastq mugqic/bcl2fastq 1.8.4 Convert BCL files from MiSeq and HiSeq sequencing systems running RTA versions earlier than 1.8.
BEAGLE mugqic/beagle 09Nov15.d2a Java tool to phase genomes.
Bedtools mugqic/bedtools 2.25.0 Tool that can intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats.
Bismark mugqic/bismark 0.14.5

A tool to map bisulfite converted sequence reads and determine cytosine methylation states.

blast+ mugqic/blast 2.2.31+ Basic Local Alignment Search Tool to find regions of similarity between biological sequences.
blat mugqic/blat 36 Pairwise sequence alignment algorithm.
bowtie mugqic/bowtie 1.1.2 An ultrafast memory-efficient short read aligner.
bowtie2 mugqic/bowtie2 2.2.6 An ultrafast and memory-efficient tool for aligning sequencing reads, about 50 up to 100s or 1,000s of characters, to long reference sequences.
breakdancer mugqic/breakdancer 1.1.2 Genome-wide detection of structural variants from next generation paired-end sequencing reads.
BVATools mugqic/bvatools 1.6 Bam and Variant Analysis Tools.
BWA mugqic/bwa 0.7.12 Software package for mapping low-divergent sequences against a large reference genome.
CD-HIT mugqic/cd-hit 4.6.1-2012-08-27 CD-HIT stands for Cluster Database at High Identity with Tolerance. The program takes a fasta format sequence database as input and produces a set of 'non-redundant' (nr) representative sequences as output.
cufflinks mugqic/cufflinks 2.2.1 Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples.
DNACLUST mugqic/dnaclust 3 DNACLUST is a tool for clustering millions of short DNA sequences.
EMBOSS mugqic/emboss 6.6.0 EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology (e.g. EMBnet) user community. The software automatically copes with data in a variety of formats and even allows transparent retrieval of sequence data from the web. Also, as extensive libraries are provided with the package, it is a platform to allow other scientists to develop and release software in true open source spirit.
exonerate mugqic/exonerate 2.2.0 Various forms of alignment (including Smith-Waterman-Gotoh) of DNA/protein against a reference.
FastQC mugqic/fastqc 0.11.3 FastQC is an application which reads raw sequence data from high throughput sequencers and runs a set of quality checks to produce a report which allows you to quickly assess the overall quality of your run, and to spot any potential problems or biases.
FLASH mugqic/FLASH FLASH (Fast Length Adjustment of SHort reads) is a very fast and accurate software tool to merge paired-end reads from next-generation sequencing experiments. FLASH is designed to merge pairs of reads when the original DNA fragments are shorter than twice the length of reads. The resulting longer reads can significantly improve genome assemblies.
FreeType mugqic/freetype 2.6.1 FreeType is a software font engine that is designed to be small, efficient, highly customizable, and portable while capable of producing high-quality output (glyph images). It can be used in graphics libraries, display servers, font conversion tools, text image generation tools, and many other products as well.
GenomeAnalysisTK mugqic/GenomeAnalysisTK 3.5 The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyse next-generation resequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance.
Ghostscript mugqic/ghostscript 9.18 Ghostscript is an interpreter for the PostScript language and for PDF.
Homer mugqic/homer 4.7 Software for motif discovery and next generation sequencing analysis. Currently installed databases are: human, mouse, rat, yeast, arabidopsis and rice.
igvtools mugqic/igvtools 2.3.66 The igvtools utility provides a set of tools for pre-processing data files. Converts a sorted data input file to a binary tiled data (.tdf) file. Computes average alignment or feature density for over a specified window size across the genome. Creates an index file for an ASCII alignment or feature file. Sorts the input file by start position.
java mugqic/java 1.8.0_40 Java is a computer programming language that is concurrent, class-based, object-oriented. It is used by some bionformatics sofware.
JELLYFISH mugqic/jellyfish 2.2.0 JELLYFISH is a tool for fast, memory-efficient counting of k-mers in DNA.
MACS mugqic/MACS We present a novel algorithm, named Model-based Analysis of ChIP-Seq (MACS), for identifying transcript factor binding sites. MACS captures the influence of genome complexity to evaluate the significance of enriched ChIP regions, and MACS improves the spatial resolution of binding sites through combining the information of both sequencing tag position and orientation. MACS can be easily used for ChIP-Seq data alone, or with control sample with the increase of specificity.
MOSAIK mugqic/mosaik 2.2.30 MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT.
MUSCLE mugqic/MUSCLE 3.8.31 MUSCLE is one of the best-performing multiple alignment programs according to published benchmark tests, with accuracy and speed that are consistently better than CLUSTALW. MUSCLE can align hundreds of sequences in seconds. Most users learn everything they need to know about MUSCLE in a few minutes—only a handful of command-line options are needed to perform common alignment tasks.
MuTect mugqic/mutect 1.1.7 MuTect is a method developed at the Broad Institute for the reliable and accurate identification of somatic point mutations in next generation sequencing data of cancer genomes.
GNU Parallel mugqic/parallel 20150322 GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. A job can also be a command that reads from a pipe. GNU parallel can then split the input and pipe it into commands in parallel.
Perl mugqic/perl 5.18.2 Perl 5 is a highly capable, feature-rich programming language with over 26 years of development. Perl 5 is suitable for both rapid prototyping and large scale development projects.
Picard mugqic/picard 1.130 Picard comprises Java-based command-line utilities that manipulate SAM files, and a Java API (SAM-JDK) for creating new programs that read and write SAM files. Both SAM text format and SAM binary (BAM) format are supported.
Python mugqic/python 2.7.3 Python is a widely used general-purpose, high-level programming language. Extra modules related or not to bioinformatics are installed: CYTHON, NUMPHY, BIOPYTHON, MATPLOTLIB, HTSEQ, BEDTOOLS-PYTHON, VCF, PYVCF, DATEUTIL, PYPARSING, MATPLOTLIB, RSeQC
R mugqic/R 3.2.0 R is a free software environment for statistical computing and graphics.
RNA-SeQC mugqic/rnaseqc 1.1.8 RNA-SeQC is a java program which computes a series of quality control metrics for RNA-seq data. The input can be one or more BAM files. The output consists of HTML reports and tab delimited files of metrics data. This program can be valuable for comparing sequencing quality across different samples or experiments to evaluate different experimental parameters. It can also be run on individual samples as a means of quality control before continuing with downstream analysis.
SAMtools mugqic/samtools 1.2 SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
SnpEff mugqic/snpEff 4.1d Genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes).
STAR mugqic/STAR 2.4.0k Spliced Transcripts Alignment to a Reference.
tabix mugqic/tabix 0.2.6 Tabix indexes a TAB-delimited genome position file in.tab.bgz and creates an index file in.tab.bgz.tbi when region is absent from the command-line. The input data file must be position sorted and compressed by bgzip which has a gzip(1) like interface.
MUGQIC Tools mugqic/tools 2.1.1 Perl, python, R, awk and sh scripts use in several bioinfomatics pipelines of the MUGQIC PIPELINE.
TopHat mugqic/tophat 2.0.14 TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
Trimmomatic mugqic/trimmomatic 0.33 Trimmomatic performs a variety of useful trimming tasks for illumina paired-end and single ended data. The selection of trimming steps and their associated parameters are supplied on the command line.
UCSC mugqic/ucsc 20150414 Genome Browser and Blat application binaries built for standalone

command-line use on various supported Linux and UNIX platforms.

VarScan mugqic/varscan 2.3.7 VarScan is a platform-independent software tool developed at the Genome Institute at Washington University to detect variants in NGS data. It can be used to detect different types of variation: germline variants (SNPs an dindels) in individual samples or pools of samples, multi-sample variants (shared or private) in multi-sample datasets (with mpileup), somatic mutations, LOH events, and germline variants in tumor-normal pairs, somatic copy number alterations (CNAs) in tumor-normal exome data.
VCFtools mugqic/vcftools 0.1.12b VCFtools is a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project. The aim of VCFtools is to provide easily accessible methods for working with complex genetic variation data in the form of VCF files.
ViennaRNA mugqic/ViennaRNA 2.1.9 The ViennaRNA Package consists of a C code library and several stand-alone programs for the prediction and comparison of RNA secondary structures.
WebLogo mugqic/weblogo 2.8.2 WebLogo is an application designed to make the generation of sequence logos as easy and painless as possible.
Yaggo mugqic/yaggo 1.5.4 Yaggo is a tool to generate command line parsers for C++. Yaggo stands for "Yet Another GenGetOpt" and is inspired by GNU Gengetopt. It reads a configuration file describing the switches and argument for a C++ program and it generates one header file that parses the command line using getopt_long(3). See the Example section below for more details.

Suppléments d'information sur certaines applications


Une variable d'environnement facilite l'utilisation du logiciel, un fois le module chargé :

[nom@serveur $] java -jar $BEAGLE_JAR [options]

La liste des commandes disponibles :

[nom@serveur $] java -jar $BEAGLE_JAR


Une variable d'environnement facilite l'utilisation du logiciel, un fois le module chargé :

[nom@serveur $] java -jar $BVATOOLS_JAR [options]

La liste des commandes disponibles :

[nom@serveur $] java -jar $BVATOOLS_JAR

Genome Analysis Toolkit - GATK

Une variable d'environnement facilite l'utilisation du logiciel, un fois le module chargé :

[nom@serveur $] java -jar $GATK_JAR [options]

La liste des commandes disponibles :

[nom@serveur $] java -jar $GATK_JAR --help

Interactive Genomics Viewer Tools - igvtools

Une variable d'environnement facilite l'utilisation du logiciel, un fois le module chargé :

[nom@serveur $] java -jar $IGVTOOLS_JAR [options]

La liste des commandes disponibles :

[nom@serveur $] java -jar $IGVTOOLS_JAR


Une variable d'environnement facilite l'utilisation du logiciel, un fois le module chargé :

[nom@serveur $] java -jar $MUTECT_JAR [options]

La liste des commandes disponibles :

[nom@serveur $] java -jar $MUTECT_JAR --help

Pour obtenir la liste des outils disponibles : MuTect

Picard version 1.124 et plus

Une variable d'environnement facilite l'utilisation des outils reliés à Picard, un fois le module chargé :

[nom@serveur $] java -jar $PICARD_JAR [outil désiré]

Par exemple :

[nom@serveur $] java -jar $PICARD_JAR SortVcf

Pour obtenir la liste des outils disponibles :

[nom@serveur $] java -jar $PICARD_JAR

Informations complémentaires : Picard

Picard version 1.123 et moins

Une variable d'environnement facilite l'utilisation des outils reliés à Picard, un fois le module chargé :

[nom@serveur $] java -jar $PICARD_HOME/[outil désiré]

Par exemple :

[nom@serveur $] java -jar $PICARD_HOME/FastqToSam.jar

Pour obtenir la liste des outils disponibles : Picard


Une variable d'environnement facilite l'utilisation du logiciel un fois le module chargé :

[nom@serveur $] java -jar $RNASEQC_JAR [options]

Pour obtenir la liste des commandes disponibles :

[nom@serveur $] java -jar $RNASEQC_JAR


Une variable d'environnement facilite l'utilisation du logiciel un fois le module chargé :

[nom@serveur $] java -jar $SNPEFF_JAR  [options]

Pour obtenir la liste des commandes disponibles :

[nom@serveur $] java -jar $SNPEFF_JAR


Une variable d'environnement facilite l'utilisation du logiciel un fois le module chargé :

[nom@serveur $] java -jar $TRIMMOMATIC_JAR [options]

Pour obtenir la liste des commandes disponibles :

[nom@serveur $] java -jar $TRIMMOMATIC_JAR

Accéder aux génomes

Lors du chargement du module du pipeline MUGQIC, une variable d'environnement qui pointe vers les répertoires des bases de données est ajouté dans votre environnement. Celle-ci facilite l'accès aux bases de données et peut être utilisées dans vos scripts.

[nom@serveur $] echo $MUGQIC_GENOMES_PATH 

[nom@serveur $] ls -1 $MUGQIC_GENOMES_PATH 


Présentation faite dans le cadre des midis conférences Calcul Québec, le 8 décembre 2014:


Comment citer MUGQIC Colosse dans vos publications

Voici le texte que nous recommandons d'utiliser lorsque vous citez les pipelines MUGQIC Colosse dans l'une de vos publications :

Les pipelines MUGQIC, développés par l'équipe de bioinformatique du Centre d'innovation Génome Québec et Université McGill, Montréal, Canada, ont été utilisés pour les analyses de [nom du type d'analyses : RNASeq, ChIPSeq, DNASeq, etc...]. Ces pipelines sont installés sur le supercalculateur Colosse de l'université Laval. L'installation est la maintenance sont sous la responsabilité partagée des équipes du Dr Arnaud Droit, Québec, Canada et de Calcul Québec.

Le texte citant Colosse doit aussi être présent : Citer Colosse


N'hésitez pas à contacter l'équipe de bioinformatique du CHUL de Québec :


