Skip to content
Snippets Groups Projects
README.md 11.6 KiB
Newer Older
Joanna Fourquet's avatar
Joanna Fourquet committed
# **metagWGS**
Celine Noirot's avatar
Celine Noirot committed

Joanna Fourquet's avatar
Joanna Fourquet committed
# Introduction

Joanna Fourquet's avatar
Joanna Fourquet committed
**metagWGS** is a Nextflow bioinformatics analysis pipeline used for Metagenomic Shotgun Sequencing data (Illumina HiSeq3000, paired, 2*150bp).
Joanna Fourquet's avatar
Joanna Fourquet committed
The workflow processes raw data from FastQ inputs and do following steps:
Joanna Fourquet's avatar
Joanna Fourquet committed
* controls the quality of data (FastQC and MultiQC)
Joanna Fourquet's avatar
Joanna Fourquet committed
* trims adapters sequences and deletes low quality reads (Cutadapt, Sickle)
Joanna Fourquet's avatar
Joanna Fourquet committed
* suppresses contaminants (BWA mem, samtools, bedtools)
Joanna Fourquet's avatar
Joanna Fourquet committed
* makes a taxonomic classification of reads (kaiju MEM and kronaTools)
Joanna Fourquet's avatar
Joanna Fourquet committed
* assembles cleaned reads (metaSPAdes or megahit)
* annotates contigs (prokka)
* renames contigs and genes (home-made python script)
* clusterizes at sample and global level (cd-hit and home-made python script)
* quantifies reads on genes (BWA index, BWA-MEM and featureCounts)
Joanna Fourquet's avatar
Joanna Fourquet committed
* makes a quantification table (home-made python script)
Joanna Fourquet's avatar
Joanna Fourquet committed
* makes a taxonomic affiliation of contigs (DIAMOND and home-made python script)
Joanna Fourquet's avatar
Joanna Fourquet committed
The pipeline is built using [Nextflow,](https://www.nextflow.io/docs/latest/index.html#) a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner.
Joanna Fourquet's avatar
Joanna Fourquet committed
It can be run with a [Singularity](https://sylabs.io/docs/) container making installation trivial and results highly reproducible.
Joanna Fourquet's avatar
Joanna Fourquet committed

Joanna Fourquet's avatar
Joanna Fourquet committed
# Schematic representation

Joanna Fourquet's avatar
Joanna Fourquet committed
![](/docs/Pipeline.png)
Joanna Fourquet's avatar
Joanna Fourquet committed

Joanna Fourquet's avatar
Joanna Fourquet committed
# Prerequisites
Joanna Fourquet's avatar
Joanna Fourquet committed
metagWGS requires all the following tools. They must be installed and copied or moved to a directory in your $PATH:
Joanna Fourquet's avatar
Joanna Fourquet committed

Joanna Fourquet's avatar
Joanna Fourquet committed
* [Nextflow](https://www.nextflow.io/docs/latest/index.html#) v19.01.0
Joanna Fourquet's avatar
Joanna Fourquet committed
* [Cutadapt](https://cutadapt.readthedocs.io/en/stable/#) v1.15
* [Sickle](https://github.com/najoshi/sickle) v1.33
* [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) v0.11.7
* [MultiQC](https://multiqc.info/) v1.5
Joanna Fourquet's avatar
Joanna Fourquet committed
* [BWA](http://bio-bwa.sourceforge.net/) v0.7.17
Joanna Fourquet's avatar
Joanna Fourquet committed
* [Python](https://www.python.org/) v3.6.3
* [Kaiju](https://github.com/bioinformatics-centre/kaiju) v1.7.0
* [SPAdes](https://github.com/ablab/spades) v3.11.1
Joanna Fourquet's avatar
Joanna Fourquet committed
* [Megahit](https://github.com/voutcn/megahit) v1.1.3
* [Prokka](https://github.com/tseemann/prokka) v1.13.4 - WARNING : always have the new release
* [Cd-hit](http://weizhongli-lab.org/cd-hit/) v4.6.8
Joanna Fourquet's avatar
Joanna Fourquet committed
* [Samtools](http://www.htslib.org/) v1.9
Joanna Fourquet's avatar
Joanna Fourquet committed
* [Bedtools](https://bedtools.readthedocs.io/en/latest/) v2.27.1
* [Subread](http://subread.sourceforge.net/) v1.6.0
* Python3 BCBio (+ BCBio.GFF), Bio (+ Bio.Seq, Bio.SeqRecord, Bio.SeqFeature) and pprint libraries
Joanna Fourquet's avatar
Joanna Fourquet committed
* [Singularity](https://sylabs.io/docs/) v3.0.1
Joanna Fourquet's avatar
Joanna Fourquet committed
* [DIAMOND](https://github.com/bbuchfink/diamond) v0.9.22
Joanna Fourquet's avatar
Joanna Fourquet committed
# Installation
Joanna Fourquet's avatar
Joanna Fourquet committed
## Install NextFlow
Nextflow runs on most POSIX systems (Linux, Mac OSX etc). It can be installed by running the following commands:
Joanna Fourquet's avatar
Joanna Fourquet committed

Joanna Fourquet's avatar
Joanna Fourquet committed
```bash
# Make sure that Java v8+ is installed:
java -version
Joanna Fourquet's avatar
Joanna Fourquet committed

Joanna Fourquet's avatar
Joanna Fourquet committed
# Install Nextflow
curl -fsSL get.nextflow.io | bash
Joanna Fourquet's avatar
Joanna Fourquet committed

Joanna Fourquet's avatar
Joanna Fourquet committed
# Add Nextflow binary to your PATH:
mv nextflow ~/bin/
# OR system-wide installation:
# sudo mv nextflow /usr/local/bin
```
## Install workflow

* **Retrieve workflow sources**
```
git clone git@forgemia.inra.fr:genotoul-bioinfo/metagwgs.git
```
Joanna Fourquet's avatar
Joanna Fourquet committed
* **Configure profiles with modules**
Joanna Fourquet's avatar
Joanna Fourquet committed

A configuration file has been developped ([nextflow.config](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/blob/dev/nextflow.config)) to run the pipeline on a local machine or on a SLURM cluster.
Joanna Fourquet's avatar
Joanna Fourquet committed

Joanna Fourquet's avatar
Joanna Fourquet committed
To use these configurations run the pipeline with following parameters:
Joanna Fourquet's avatar
Joanna Fourquet committed
    * `-profile local,modules` runs metagWGS on a local machine without Singularity container (with modules).
Joanna Fourquet's avatar
Joanna Fourquet committed
    * `-profile slurm,modules` runs metagWGS on a SLURM cluster without Singularity container (with modules).
Joanna Fourquet's avatar
Joanna Fourquet committed

Joanna Fourquet's avatar
Joanna Fourquet committed
* **Configure profiles with a Singularity container**
Joanna Fourquet's avatar
Joanna Fourquet committed

Joanna Fourquet's avatar
Joanna Fourquet committed
A [Singularity](https://sylabs.io/docs/) container is available for this pipeline.
Joanna Fourquet's avatar
Joanna Fourquet committed
All informations about how we built it is in [this Wiki page](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/wikis/Singularity%20container).
Joanna Fourquet's avatar
Joanna Fourquet committed

To use the configuration profiles with Singularity container run the pipeline with following parameters:

Joanna Fourquet's avatar
Joanna Fourquet committed
    * `-profile local,singularity` runs metagWGS on a local machine with our Singularity container (in addition to -with-singularity singularity/metagWGS_with_dependancies.img parameter).
Joanna Fourquet's avatar
Joanna Fourquet committed
    * `-profile slurm,singularity` runs metagWGS on a SLURM cluster with our Singularity container (in addition to -with-singularity singularity/metagWGS_with_dependancies.img parameter).
Joanna Fourquet's avatar
Joanna Fourquet committed

Joanna Fourquet's avatar
Joanna Fourquet committed
# Usage
Joanna Fourquet's avatar
Joanna Fourquet committed
## Basic usage
Joanna Fourquet's avatar
Joanna Fourquet committed

A basic command line running the pipeline is:

Joanna Fourquet's avatar
Joanna Fourquet committed
```python
Joanna Fourquet's avatar
Joanna Fourquet committed
./nextflow run -profile [local,modules or slurm,modules or local,singularity or slurm,singularity] main.nf --reads '*_{R1,R2}.fastq.gz' --assembly [metaspades or megahit] [-with-singularity singularity/metagWGS_with_dependancies.img]
Joanna Fourquet's avatar
Joanna Fourquet committed
'*_{R1,R2}.fastq.gz' run the pipeline with all the R1.fastq.gz and R2.fastq.gz files available in your working directory.
Joanna Fourquet's avatar
Joanna Fourquet committed
WARNING: the user has choice between metaspades or megahit for assembly step.

The choice can be based on CPUs and memory availability: metaspades needs more CPUs and memory than megahit but our tests showed that assembly metrics are better than megahit.
Joanna Fourquet's avatar
Joanna Fourquet committed
## Other parameters
Joanna Fourquet's avatar
Joanna Fourquet committed
Other parameters are available:
Joanna Fourquet's avatar
Joanna Fourquet committed
```
Joanna Fourquet's avatar
Joanna Fourquet committed
    Usage:

     The typical command for running the pipeline is as follows:

       nextflow run -profile standard main.nf --reads '*_{R1,R2}.fastq.gz' --assembly metaspades

     Mandatory arguments:
       --reads                       Path to input data (must be surrounded with quotes).
       --assembly                    Tool used for assembly: 'metaspades' or 'megahit'.
       -profile                      Configuration profile to use.
                                     Available: standard_modules (local with module load), cluster_slurm_modules (run pipeline on slurm cluster with module load),
                                     standard_singularity (local with Singularity container) and cluster_slurm_singularity (run pipeline on slurm cluster with Singularity container)

     Options:

Joanna Fourquet's avatar
Joanna Fourquet committed
     Mode:
       --mode:                       Paired-end ('pe') or single-end ('se') reads. Default: 'pe'.

     Trimming options:

       --adapter1                    Sequence of adapter 1. Default: Illumina TruSeq adapter.
       --adapter2                    Sequence of adapter 2. Default: Illumina TruSeq adapter.

     Quality option:
       --qualityType                 Sickle supports three types of quality values: Illumina, Solexa, and Sanger. Default: 'sanger'.

     Alignment options:
Joanna Fourquet's avatar
Joanna Fourquet committed
       --db_host                     Host database already indexed (bwa index).
Joanna Fourquet's avatar
Joanna Fourquet committed

     Taxonomic classification options:
       --kaiju_nodes                 File nodes.dmp built with kaiju-makedb.
       --kaiju_db                    File kaiju_db_refseq.fmi built with kaiju-makedb.
       --kaiju_names                 File names.dmp built with kaiju-makedb.
       --diamond_bank                NR Diamond bank
       --accession2taxid             FTP adress of file prot.accession2taxid.gz
       --taxdump                     FTP adress of file taxdump.tar.gz
     
     Clustering option:
       --percentage_identity         Sequence identity threshold. Default: 0.95.
                                     
     Other options:
       --outdir                      The output directory where the results will be saved.
       --help                        Show this message and exit.

     Softwares versions:

     Cutadapt v1.15
     Sickle v1.33
     FastQC v0.11.7
     MultiQC v1.5
     BWA 0.7.17
     Python v3.6.3
     Kaiju v1.7.0
     SPAdes v3.11.1
     megahit v1.1.3
Joanna Fourquet's avatar
Joanna Fourquet committed
     prokka v1.13.4
Joanna Fourquet's avatar
Joanna Fourquet committed
     cdhit v4.6.8
     samtools v1.9
     bedtools v2.27.1
     subread v1.6.0
     diamond v0.9.22

     Skip
Joanna Fourquet's avatar
Joanna Fourquet committed

     --skip_sickle                   Skip sickle process.
     --skip_kaiju_index              Skip built of kaiju database (index_db_kaiju process).
Joanna Fourquet's avatar
Joanna Fourquet committed
     --skip_taxonomy                 Skip taxonomy process
Joanna Fourquet's avatar
Joanna Fourquet committed
```
Joanna Fourquet's avatar
Joanna Fourquet committed
## Generated files
Joanna Fourquet's avatar
Joanna Fourquet committed

The pipeline will create the following files in your working directory:

```
* work            # Directory containing the nextflow working files
* results         # Directory containing result files
Joanna Fourquet's avatar
Joanna Fourquet committed
    ** results/01_Cleaned_raw_data: cleaned raw data files (after cutadapt or cutadapt+sickle and after human reads removing removing)
    ** results/02_Quality_control: multiQC file
    ** results/03_Classification_Kaiju: index database files (if process index_db_kaiju not skipped) and kaiju files (kaiju result files, kronas, histograms, kaiju results for each node of taxonomy tree)
    ** results/04_Assembly: assembly files and assembly metrics
    ** results/05_Annotation: files .gff, .ffn, .fna, .faa, etc after prokka annotation and .gff, .ffn, .fna, .faa files with renamed contigs and genes
    ** results/06_Clustering: cd-hit results for each sample, correspondance table of intermediate clusters and genes, cd-hit results at global level and correspondance table of global cluster and intermediate clusters (table_clstr.txt)
    ** results/07_Quantification: .bam et .bam.bai file after reads alignment on contigs, .count files (featureCounts count), .summary file (featureCounts summary), .output file (featureCounts output), Correspondence_global_clstr_contigs.txt (correspondance table between global cluster and genes), Clusters_Count_table_all_samples.txt (quantification table of aligned reads for each global cluster and for each sample).
Joanna Fourquet's avatar
Joanna Fourquet committed
    ** results/08_Diamond: diamond results and taxonomic affiliation consensus for proteins and contigs.
Joanna Fourquet's avatar
Joanna Fourquet committed
* .nextflow_log   # Log file from Nextflow
*                 # Other nextflow hidden files, eg. history of pipeline runs and old logs.
Joanna Fourquet's avatar
Joanna Fourquet committed
```
Joanna Fourquet's avatar
Joanna Fourquet committed
## How to run demonstration on genologin cluster
Joanna Fourquet's avatar
Joanna Fourquet committed

* Data test are available [here](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/tree/master/test)
Joanna Fourquet's avatar
Joanna Fourquet committed
* BWA index of human reference genome is available at /work/bank/bwadb/ensembl_homo_sapiens_genome
* kaiju database index file are avaiblable at /work/bank/kaijudb/kaijudb_Juin2019/

Joanna Fourquet's avatar
Joanna Fourquet committed
WARNING: to be noticed, on genologin modules are written "bioinfo/Name-version". We encountered issues with load of several modules at the same time (compatibility). To avoid this problem, we created for some process one module which contains the load of different modules:
* bioinfo_bwa_samtools_subread
* bioinfo_bwa_samtools

Joanna Fourquet's avatar
Joanna Fourquet committed
You can run the pipeline as follow without Singularity container:
```python
Joanna Fourquet's avatar
Joanna Fourquet committed
./nextflow run -profile cluster_slurm_modules main.nf --reads '*_{R1,R2}.fastq.gz' --assembly metaspades --skip_kaiju_index --kaiju_nodes /work/bank/kaijudb/kaijudb_Juin2019/nodes.dmp --kaiju_db /w
Joanna Fourquet's avatar
Joanna Fourquet committed
ork/bank/kaijudb/kaijudb_Juin2019/refseq/kaiju_db_refseq.fmi --kaiju_names /work/bank/kaijudb/kaijudb_Juin2019/names.dmp

Joanna Fourquet's avatar
Joanna Fourquet committed
./nextflow run -profile slurm,modules main.nf --reads '*_{R1,R2}.fastq.gz' --assembly metaspades --skip_kaiju_index --kaiju_nodes /work/bank/kaijudb/kaijudb_Juin2019/nodes.dmp --kaiju_db /work/
bank/kaijudb/kaijudb_Juin2019/refseq/kaiju_db_refseq.fmi --kaiju_names /work/bank/kaijudb/kaijudb_Juin2019/names.dmp
Joanna Fourquet's avatar
Joanna Fourquet committed
```

You can run the pipeline as follow with our Singularity container:
Joanna Fourquet's avatar
Joanna Fourquet committed

```python
Joanna Fourquet's avatar
Joanna Fourquet committed
module load system/singularity-3.0.1
Joanna Fourquet's avatar
Joanna Fourquet committed
./nextflow run -profile slurm,singularity main.nf --reads '*_{R1,R2}.fastq.gz' --assembly metaspades --skip_kaiju_index --kaiju_nodes /work/bank/kaijudb/kaijudb_Juin2019/nodes.dmp --kaiju_db /work/
bank/kaijudb/kaijudb_Juin2019/refseq/kaiju_db_refseq.fmi --kaiju_names /work/bank/kaijudb/kaijudb_Juin2019/names.dmp -with-singularity singularity/metagWGS_with_dependancies.img
Joanna Fourquet's avatar
Joanna Fourquet committed
# License
Joanna Fourquet's avatar
Joanna Fourquet committed
metagWGS is distributed under the GNU General Public License v3.
Joanna Fourquet's avatar
Joanna Fourquet committed
# Copyright
Joanna Fourquet's avatar
Joanna Fourquet committed

2019 INRA

Joanna Fourquet's avatar
Joanna Fourquet committed
# Citation
Joanna Fourquet's avatar
Joanna Fourquet committed
metagWGS has been presented at JOBIM 2019 and at Genotoul Biostat Bioinfo day:
Joanna Fourquet's avatar
Joanna Fourquet committed
Poster "Whole metagenome analysis with metagWGS", J. Fourquet, A. Chaubet, H. Chiapello, C. Gaspin, M. Haenni, C. Klopp, A. Lupo, J. Mainguy, C. Noirot, T. Rochegue, M. Zytnicki, T. Ferry, C. Hoede.