README.md

# **metagWGS**

# Introduction

**metagWGS** is a Nextflow bioinformatics analysis pipeline used for Metagenomic Shotgun Sequencing data (Illumina HiSeq3000, paired, 2*150bp).

The workflow processes raw data from FastQ inputs and do following steps:
* controls the quality of data (FastQC and MultiQC)
* trims adapters sequences and clean reads (Cutadapt, Sickle)
* suppresses contaminants (BWA mem, samtools, bedtools)
* makes a taxonomic classification of reads (kaiju MEM and kronaTools)
* assembles cleaned reads (metaSPAdes or megahit)
* annotates contigs (prokka)
* renames contigs and genes (home-made python script)
* clusterizes at sample and global level (cd-hit and home-made python script)
* quantifies reads on genes (BWA index, BWA-MEM and featureCounts)
* makes a quantification table (home-made python script)

The pipeline is built using [Nextflow,](https://www.nextflow.io/docs/latest/index.html#) a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner.

It will come with a [Singularity](https://sylabs.io/docs/) container making installation trivial and results highly reproducible.

# Schematic representation

![](/docs/Schema_V1.png)

# Prerequisites

metagWGS requires all the following tools. They must be installed and copied or moved to a directory in your $PATH:

* [Nextflow](https://www.nextflow.io/docs/latest/index.html#) v19.01.0
* [Cutadapt](https://cutadapt.readthedocs.io/en/stable/#) v1.15
* [Sickle](https://github.com/najoshi/sickle) v1.33
* [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) v0.11.7
* [MultiQC](https://multiqc.info/) v1.5
* [BWA](http://bio-bwa.sourceforge.net/) v0.7.17
* [Python](https://www.python.org/) v3.6.3
* [Kaiju](https://github.com/bioinformatics-centre/kaiju) v1.7.0
* [SPAdes](https://github.com/ablab/spades) v3.11.1
* [Megahit](https://github.com/voutcn/megahit) v1.1.3
* [Prokka](https://github.com/tseemann/prokka) v1.13.4 - WARNING : always have the new release
* [Cd-hit](http://weizhongli-lab.org/cd-hit/) v4.6.8
* [Samtools](http://www.htslib.org/) v0.1.19
* [Bedtools](https://bedtools.readthedocs.io/en/latest/) v2.27.1
* [Subread](http://subread.sourceforge.net/) v1.6.0

# Installation
## Install NextFlow
Nextflow runs on most POSIX systems (Linux, Mac OSX etc). It can be installed by running the following commands:

```bash
# Make sure that Java v8+ is installed:
java -version

# Install Nextflow
curl -fsSL get.nextflow.io | bash

# Add Nextflow binary to your PATH:
mv nextflow ~/bin/
# OR system-wide installation:
# sudo mv nextflow /usr/local/bin
```
## Install workflow

* **Retrieve workflow sources**
```
git clone git@forgemia.inra.fr:genotoul-bioinfo/metagwgs.git
```
* **Configure profiles**

A configuration file has been developped ([nextflow.config](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/blob/dev/nextflow.config)) to run the pipeline on a local machine or on a SLURM cluster.

To use these configurations run the pipeline with following parameters:

    * `-profile standard` runs metagWGS on a local machine.
    
    * `-profile cluster_slurm` runs metagWGS on a SLURM cluster.

* **Reproducibility with a Singularity container**

A [Singularity](https://sylabs.io/docs/) container will be soon available to run the pipeline metagWGS.

# Usage

## Basic usage

A basic command line running the pipeline is:

```python
./nextflow run -profile [standard or cluster_slurm] main.nf --reads '*_{R1,R2}.fastq.gz' --assembly [metaspades or megahit]
```

'*_{R1,R2}.fastq.gz' run the pipeline with all the R1.fastq.gz and R2.fastq.gz files available in your working directory.

## Other parameters

Other parameters are available:

```
    Mode:
      --mode:                       Paired-end ('pe') or single-end ('se') reads. Default: 'pe'. Single-end mode has not been developped yet.

    Trimming options:

      --adapter1                    Sequence of adapter 1. Default: Illumina TruSeq adapter.
      --adapter2                    Sequence of adapter 2. Default: Illumina TruSeq adapter.

    Quality options:
      --qualityType                 Sickle supports three types of quality values: Illumina, Solexa, and Sanger. Default: 'sanger'.

    Alignment options:
      --db_alignment                Alignment data base.

    Taxonomic classification options (to avoid kaiju indexation provide following files):
      --kaiju_nodes                 File nodes.dmp built with kaiju-makedb.
      --kaiju_db                    File kaiju_db_refseq.fmi built with kaiju-makedb.
      --kaiju_names                 File names.dmp built with kaiju-makedb.

    Other options:
      --outdir                      The output directory where the results will be saved.
      --help                        Show the help message and exit.
      
        Skip

     --skip_sickle                   Skip sickle process.
     --skip_kaiju_index              Skip built of kaiju database (index_db_kaiju process).

```

## Generated files

The pipeline will create the following files in your working directory:

```
* work            # Directory containing the nextflow working files
* results         # Directory containing result files
    ** results/01_Cleaned_raw_data: cleaned raw data files (after cutadapt or cutadapt+sickle and after human reads removing removing)
    ** results/02_Quality_control: multiQC file
    ** results/03_Classification_Kaiju: index database files (if process index_db_kaiju not skipped) and kaiju files (kaiju result files, kronas, histograms, kaiju results for each node of taxonomy tree)
    ** results/04_Assembly: assembly files and assembly metrics
    ** results/05_Annotation: files .gff, .ffn, .fna, .faa, etc after prokka annotation and .gff, .ffn, .fna, .faa files with renamed contigs and genes
    ** results/06_Clustering: cd-hit results for each sample, correspondance table of intermediate clusters and genes, cd-hit results at global level and correspondance table of global cluster and intermediate clusters (table_clstr.txt)
    ** results/07_Quantification: .bam et .bam.bai file after reads alignment on contigs, .count files (featureCounts count), .summary file (featureCounts summary), .output file (featureCounts output), Correspondence_global_clstr_contigs.txt (correspondance table between global cluster and genes), Clusters_Count_table_all_samples.txt (quantification table of aligned reads for each global cluster and for each sample).
    
* .nextflow_log   # Log file from Nextflow
*                 # Other nextflow hidden files, eg. history of pipeline runs and old logs.
```
# How to run demonstration on genologin cluster

* Data test are available [here](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/tree/master/test)
* BWA index of human reference genome is available at /bank/bwadb/ensembl_homo_sapiens_genome
* kaiju database index file are avaiblable at /bank/kaijudb/kaijudb_Juin2019/

You can run the pipeline as follow:
```python
./nextflow run -profile cluster_slurm main.nf --reads '*_{R1,R2}.fastq.gz' --assembly metaspades --skip_kaiju_index --kaiju_nodes /bank/kaijudb/kaijudb_Juin2019/nodes.dmp --kaiju_db /bank/kaijudb/kaijudb_Juin2019/refseq/kaiju_db_refseq.fmi --kaiju_names /bank/kaijudb/kaijudb_Juin2019/names.dmp

```

# License

metagWGS is distributed under the GNU General Public License v3.

# Copyright

2019 INRA

# Citation

metagWGS will be presented at JOBIM 2019:

"Whole metagenome analysis with metagWGS", J. Fourquet, A. Chaubet, H. Chiapello, C. Gaspin, M. Haenni, C. Klopp, A. Lupo, J. Mainguy, C. Noirot, T. Rochegue, M. Zytnicki, T. Ferry, C. Hoede.