Computational Genomics

De Wiki de Calcul Québec
Aller à : Navigation, rechercher
Cette page est une traduction de la page Génomique computationnelle et la traduction est complétée à 100 % et à jour.

Autres langues :anglais 100% • ‎français 100%


Applications Available by means of the MUGQIC (McGill University and Génome Québec Innovation Centre) Pipeline

The team of Dr. Guillaume Bourque, director of bio-informatics at the Génome Québec Innovation Centre, has developed a pipeline permitting the automation in an HPC environment of three types of analysis that use next generation sequences:

  • RNASeq
  • ChipSeq
  • DNASeq

On Guillimin and Mammouth

The pipeline is available and maintained on the Calcul Québec servers Guillimin and Mammouth. MUGQIC Pipeline Home

On Colosse

The MUGQIC pipeline and the applications that it uses are now available on Colosse. The team of Dr Arnaud Droit , which regularly uses this software, is responsible for its maintenance.

Load the Main Module

You first need to make sure that the GCC compiler has been loaded:

[name@server $] module swap compilers/intel/14.0 compilers/gcc

The MUGQIC module is now visible and can thus be loaded:

[name@server $] module load apps/mugqic_pipeline

Accessing the Applications

Once the main module has been loaded, the applications linked to the pipeline will be visible and can be used.

To obtain the list of available applications:

[name@server $] module avail mugqic

List of Applications

As of December 3, 2014, the available applications are:

Module Name Version Description
BEAGLE mugqic/beagle 4.r1274 Java tool to phase genomes.
Bedtools mugqic/bedtools 2.21.0 Tool that can intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats.
blast+ mugqic/blast 2.2.29+ Basic Local Alignment Search Tool to find regions of similarity between biological sequences.
blat mugqic/blat 35 Pairwise sequence alignment algorithm.
bowtie mugqic/bowtie 1.0.1 An ultrafast memory-efficient short read aligner.
bowtie2 mugqic/bowtie2 2.2.3 An ultrafast and memory-efficient tool for aligning sequencing reads, about 50 up to 100s or 1,000s of characters, to long reference sequences.
breakdancer mugqic/breakdancer 1.1.2 Genome-wide detection of structural variants from next generation paired-end sequencing reads.
BVATools mugqic/bvatools 1.3 Bam and Variant Analysis Tools.
BWA mugqic/bwa 0.7.10 Software package for mapping low-divergent sequences against a large reference genome.
CD-HIT mugqic/cd-hit 4.6.1-2012-08-27 CD-HIT stands for Cluster Database at High Identity with Tolerance. The program takes a fasta format sequence database as input and produces a set of 'non-redundant' (nr) representative sequences as output.
cufflinks mugqic/cufflinks 2.2.1 Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples.
DNACLUST mugqic/dnaclust 3 DNACLUST is a tool for clustering millions of short DNA sequences.
exonerate mugqic/exonerate 2.2.0 Various forms of alignment (including Smith-Waterman-Gotoh) of DNA/protein against a reference.
FastQC mugqic/fastqc 0.11.2 FastQC is an application which reads raw sequence data from high throughput sequencers and runs a set of quality checks to produce a report which allows you to quickly assess the overall quality of your run, and to spot any potential problems or biases.
FLASH mugqic/FLASH FLASH (Fast Length Adjustment of SHort reads) is a very fast and accurate software tool to merge paired-end reads from next-generation sequencing experiments. FLASH is designed to merge pairs of reads when the original DNA fragments are shorter than twice the length of reads. The resulting longer reads can significantly improve genome assemblies.
FreeType mugqic/freetype 2.5.3 FreeType is a software font engine that is designed to be small, efficient, highly customizable, and portable while capable of producing high-quality output (glyph images). It can be used in graphics libraries, display servers, font conversion tools, text image generation tools, and many other products as well.
GenomeAnalysisTK mugqic/GenomeAnalysisTK 3.3-0 The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyse next-generation resequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance.
Ghostscript mugqic/ghostscript 9.15 Ghostscript is an interpreter for the PostScript language and for PDF.
Homer mugqic/homer 4.7 Software for motif discovery and next generation sequencing analysis. Currently installed databases are: human, mouse, rat, yeast, arabidopsis and rice.
igvtools mugqic/igvtools 2.3.32 The igvtools utility provides a set of tools for pre-processing data files. Converts a sorted data input file to a binary tiled data (.tdf) file. Computes average alignment or feature density for over a specified window size across the genome. Creates an index file for an ASCII alignment or feature file. Sorts the input file by start position.
java mugqic/java 1.7.0_60 Java is a computer programming language that is concurrent, class-based, object-oriented. It is used by some bionformatics sofware.
JELLYFISH mugqic/jellyfish 2.1.3 JELLYFISH is a tool for fast, memory-efficient counting of k-mers in DNA.
MACS mugqic/MACS We present a novel algorithm, named Model-based Analysis of ChIP-Seq (MACS), for identifying transcript factor binding sites. MACS captures the influence of genome complexity to evaluate the significance of enriched ChIP regions, and MACS improves the spatial resolution of binding sites through combining the information of both sequencing tag position and orientation. MACS can be easily used for ChIP-Seq data alone, or with control sample with the increase of specificity.
MOSAIK mugqic/mosaik 2.2.0 MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT.
MUSCLE mugqic/MUSCLE 3.8.31 MUSCLE is one of the best-performing multiple alignment programs according to published benchmark tests, with accuracy and speed that are consistently better than CLUSTALW. MUSCLE can align hundreds of sequences in seconds. Most users learn everything they need to know about MUSCLE in a few minutes—only a handful of command-line options are needed to perform common alignment tasks.
MuTect mugqic/mutect 1.1.5 MuTect is a method developed at the Broad Institute for the reliable and accurate identification of somatic point mutations in next generation sequencing data of cancer genomes.
GNU Parallel mugqic/parallel 20140922 GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. A job can also be a command that reads from a pipe. GNU parallel can then split the input and pipe it into commands in parallel.
Perl mugqic/perl 5.18.2 Perl 5 is a highly capable, feature-rich programming language with over 26 years of development. Perl 5 is suitable for both rapid prototyping and large scale development projects.
Picard mugqic/picard 1.126 Picard comprises Java-based command-line utilities that manipulate SAM files, and a Java API (SAM-JDK) for creating new programs that read and write SAM files. Both SAM text format and SAM binary (BAM) format are supported.
Python mugqic/python 2.7.3 Python is a widely used general-purpose, high-level programming language. Extra modules related or not to bioinformatics are installed: CYTHON, NUMPHY, BIOPYTHON, MATPLOTLIB, HTSEQ, BEDTOOLS-PYTHON, VCF, PYVCF, DATEUTIL, PYPARSING, MATPLOTLIB, RSeQC
R mugqic/R 3.0.2 R is a free software environment for statistical computing and graphics.
RNA-SeQC mugqic/rnaseqc 1.1.8 RNA-SeQC is a java program which computes a series of quality control metrics for RNA-seq data. The input can be one or more BAM files. The output consists of HTML reports and tab delimited files of metrics data. This program can be valuable for comparing sequencing quality across different samples or experiments to evaluate different experimental parameters. It can also be run on individual samples as a means of quality control before continuing with downstream analysis.
SAMtools mugqic/samtools 1.1 SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
SnpEff mugqic/snpEff 4.0 Genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes).
tabix mugqic/tabix 0.2.6 Tabix indexes a TAB-delimited genome position file and creates an index file when region is absent from the command-line. The input data file must be position sorted and compressed by bgzip which has a gzip(1) like interface.
MUGQIC Tools mugqic/tools 1.9 Perl, python, R, awk and sh scripts use in several bioinfomatics pipelines of the MUGQIC PIPELINE.
TopHat mugqic/tophat 2.0.13 TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
Trimmomatic mugqic/trimmomatic 0.32 Trimmomatic performs a variety of useful trimming tasks for illumina paired-end and single ended data. The selection of trimming steps and their associated parameters are supplied on the command line.
UCSC mugqic/ucsc 20141112 Genome Browser and Blat application binaries built for standalone

command-line use on various supported Linux and UNIX platforms.

VarScan mugqic/varscan 2.3.7 VarScan is a platform-independent software tool developed at the Genome Institute at Washington University to detect variants in NGS data. It can be used to detect different types of variation: germline variants (SNPs an dindels) in individual samples or pools of samples, multi-sample variants (shared or private) in multi-sample datasets (with mpileup), somatic mutations, LOH events, and germline variants in tumor-normal pairs, somatic copy number alterations (CNAs) in tumor-normal exome data.
VCFtools mugqic/vcftools 0.1.12b VCFtools is a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project. The aim of VCFtools is to provide easily accessible methods for working with complex genetic variation data in the form of VCF files.
WebLogo mugqic/weblogo 2.8.2 WebLogo is an application designed to make the generation of sequence logos as easy and painless as possible.
Yaggo mugqic/yaggo 1.5.4 Yaggo is a tool to generate command line parsers for C++. Yaggo stands for "Yet Another GenGetOpt" and is inspired by GNU Gengetopt. It reads a configuration file describing the switches and argument for a C++ program and it generates one header file that parses the command line using getopt_long(3). See the Example section below for more details.

Supplementary Information for Certain Applications


An environment variable simplifies the use of the software, once the module has been loaded:

[name@server $] java -jar $BEAGLE_JAR [options]

The list of available commands:

[name@server $] java -jar $BEAGLE_JAR


An environment variable simplifies the use of the software, once the module has been loaded:

[name@server $] java -jar $BVATOOLS_JAR [options]

The list of available commands:

[name@server $] java -jar $BVATOOLS_JAR

Genome Analysis Toolkit - GATK

An environment variable simplifies the use of the software, once the module has been loaded:

[name@server $] java -jar $GATK_JAR [options]

The list of available commands:

[name@server $] java -jar $GATK_JAR --help

Interactive Genomics Viewer Tools - igvtools

An environment variable simplifies the use of the software, once the module has been loaded:

[name@server $] java -jar $IGVTOOLS_JAR [options]

The list of available commands:

[name@server $] java -jar $IGVTOOLS_JAR


An environment variable simplifies the use of the software, once the module has been loaded:

[name@server $] java -jar $MUTECT_JAR [options]

The list of available commands:

[name@server $] java -jar $MUTECT_JAR --help

To get the list of available tools: MuTect

Picard version 1.124 and higher

An environment variable simplifies the use of the Picard tools, once the module has been loaded:

[name@server $] java -jar $PICARD_JAR [tool]

For example:

[name@server $] java -jar $PICARD_JAR SortVcf

To get the list of available tools:

[name@server $] java -jar $PICARD_JAR

Additional information: Picard

Picard version 1.123 and lower

An environment variable simplifies the use of the Picard tools, once the module has been loaded:

[name@server $] java -jar $PICARD_HOME/[tool]

For example:

[name@server $] java -jar $PICARD_HOME/FastqToSam.jar

To get a list of the available tools: Picard


An environment variable simplifies the use of the software, once the module has been loaded:

[name@server $] java -jar $RNASEQC_JAR [options]

To get a list of the available commands:

[name@server $] java -jar $RNASEQC_JAR


An environment variable simplifies the use of the software, once the module has been loaded:

[name@server $] java -jar $TRIMMOMATIC_JAR [options]

To get the list of available commands:

[name@server $] java -jar $TRIMMOMATIC_JAR

Database Access

When MUGQIC pipeline module is loaded, an environment variable which points towards the database directories is added to your environment. This simplifies access to the databases and can be used in your scripts.

[name@server $] echo $MUGQIC_GENOMES_PATH 

[name@server $] ls -1 $MUGQIC_GENOMES_PATH 

How to Cite MUGQIC Colosse in Your Publications

Here's the text that we recommend you use when you cite the MUGQIC Colosse pipelines in one of your publications:

The MUGQIC pipelines, developed by the bio-informatics team of the Centre d'innovation Génome Québec and McGill University, Montreal, Canada, have been used for the analysis of [name of the analysis type: RNASeq, ChIPSeq, DNASeq, etc.]. These pipelines are installed on the Colosse supercomputer at Laval University. The installation and maintenance are under the shared responsibility of the teams of Dr. Arnaud Droit, Quebec City, Canada and of Calcul Québec.

The text citing Colosse should also be present: Citer Colosse


For further information, feel free to contact the CHUL de Québec Bioinformatics team:

Outils personnels
Espaces de noms

Ressources de Calcul Québec