Newer
Older
**metagWGS** is a [Nextflow](https://www.nextflow.io/docs/latest/index.html#) bioinformatics analysis pipeline used for **metag**enomic **W**hole **G**enome **S**hotgun sequencing data (Illumina HiSeq3000 or NovaSeq, paired, 2\*150bp).
The workflow processes raw data from `.fastq` or `.fastq.gz` inputs and do the modules represented into this figure:

metagWGS is splitted into different steps, corresponding to different part of bioinformatics analysis :
* trims adapters sequences and deletes low quality reads ([Cutadapt](https://cutadapt.readthedocs.io/en/stable/#), [Sickle](https://github.com/najoshi/sickle))
* suppresses host contaminants ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/))
* controls the quality of raw and cleaned data ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
* makes a taxonomic classification of cleaned reads ([Kaiju MEM](https://github.com/bioinformatics-centre/kaiju) + [kronaTools](https://github.com/marbl/Krona/wiki/KronaTools) + [Generate_barplot_kaiju.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Generate_barplot_kaiju.py) + [merge_kaiju_results.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/merge_kaiju_results.py))
* assembles cleaned reads (combined with `clean_qc` step) or raw reads (combined with `--skip_clean_qc` parameter) ([metaSPAdes](https://github.com/ablab/spades) or [Megahit](https://github.com/voutcn/megahit))
* assess the quality of assembly ([metaQUAST](http://quast.sourceforge.net/metaquast))
* deduplicates cleaned reads (combined with `clean_qc` step) or raw reads (combined with `--skip_clean_qc` parameter) ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/))
* filter contigs with low CPM value ([Filter_contig_per_cpm.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Filter_contig_per_cpm.py) + [metaQUAST](http://quast.sourceforge.net/metaquast))
* structural annotation of genes ([Prokka](https://github.com/tseemann/prokka) + [Rename_contigs_and_genes.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Rename_contigs_and_genes.py))
* alignment of reads on contigs ([BWA](http://bio-bwa.sourceforge.net/) + [Samtools](http://www.htslib.org/))
* alignment of genes against protein database ([DIAMOND](https://github.com/bbuchfink/diamond))
* sample and global clustering of genes ([cd-hit-est](http://weizhongli-lab.org/cd-hit/) + [cd_hit_produce_table_clstr.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/cd_hit_produce_table_clstr.py))
* quantification of reads on genes ([featureCounts](http://subread.sourceforge.net/) + [Quantification_clusters.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/Quantification_clusters.py))
* functional annotation of genes and quantification of reads by function ([eggNOG-mapper](http://eggnog-mapper.embl.de/) + [best_bitscore_diamond.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/best_bitscore_diamond.py) + [merge_abundance_and_functional_annotations.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/merge_abundance_and_functional_annotations.py) + [quantification_by_functional_annotation.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/quantification_by_functional_annotation.py))
* `07_taxo_affi`
* taxonomic affiliation of genes ([aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/aln2taxaffi.py))
* taxonomic affiliation of contigs ([aln2taxaffi.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/aln2taxaffi.py))
* count of reads and contigs, for each taxonomic affiliation, per taxonomic level ([Samtools](http://www.htslib.org/) + [merge_idxstats_percontig_lineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/merge_idxstats_percontig_lineage.py) + [quantification_by_contig_lineage.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/quantification_by_contig_lineage.py))
* `08_binning` from [nf-core/mag 1.0.0](https://github.com/nf-core/mag/releases/tag/1.0.0)
* binning of contigs ([metabat2](https://bitbucket.org/berkeleylab/metabat/src/master/))
* assess bins ([BUSCO](https://busco.ezlab.org/) + [metaQUAST](http://quast.sourceforge.net/metaquast) + [summary_busco.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/summary_busco.py) and [combine_tables.py](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/blob/dev/bin/combine_tables.py) from [nf-core/mag](https://github.com/nf-core/mag))
* taxonomic affiliation of bins ([BAT](https://github.com/dutilh/CAT))
A single report html file is generated at the end of the workflow with [MultiQC](https://multiqc.info/).
The pipeline is built using [Nextflow,](https://www.nextflow.io/docs/latest/index.html#) a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner.
Two [Singularity](https://sylabs.io/docs/) containers are available making installation trivial and results highly reproducible.
metagWGS documentation is available [here](https://forgemia.inra.fr/genotoul-bioinfo/metagwgs/-/tree/dev/docs).
## License
metagWGS is distributed under the GNU General Public License v3.
## Copyright
2021 INRAE
## Citation
metagWGS has been presented at JOBIM 2020:
Poster "Whole metagenome analysis with metagWGS", J. Fourquet, C. Noirot, C. Klopp, P. Pinton, S. Combes, C. Hoede, G. Pascal.
https://www.sfbi.fr/sites/sfbi.fr/files/jobim/jobim2020/posters/compressed/jobim2020_poster_9.pdf
metagWGS has been presented at JOBIM 2019 and at Genotoul Biostat Bioinfo day:
Poster "Whole metagenome analysis with metagWGS", J. Fourquet, A. Chaubet, H. Chiapello, C. Gaspin, M. Haenni, C. Klopp, A. Lupo, J. Mainguy, C. Noirot, T. Rochegue, M. Zytnicki, T. Ferry, C. Hoede.