curation

Command curation generates a ranking for manual curation based on AED scores and associated penalties.

usage

$ ingenannot -v 2 curation genes.gff genes.manual.curation.gff

positional arguments:

Input

GFF File with AED tags

Output

GFF File with new curation tag

optional arguments:

-h, –help

show this help message and exit

–graphout GRAPHOUT

output filename of the graph, default=curation.png

–graphtitle GRAPHTITLE

output title of the graph, default=AED categories for manual curation

inputs

Gff_genes in GFF/GTF format with AED scores added with ingenannot aed.

outputs

We defined 7 categories for transcript confidence based on protein and transcriptomic evidence such as:

  • cat1: high confidence

  • cat2: good confidence

  • cat3: good confidence, supported by one evidence type

  • cat4: moderate confidence

  • cat5: high to moderate confidence, with penalty on structure

  • cat6: bad confidence

  • cat7: no support, only ab-initio prediction

We expect:

  • cat1: gene structures validated by very reliable protein and transcript support

  • cat2: gene structures validated by protein and transcript support, with a lower score for one or other of this evidence

  • cat3: gene structures validated mainly by one evidence type (protein or transcriptomics evidence). Could contain false coding gene structures (only transcriptomics data) or annotation error from protein databank (only protein structure)

  • cat4: gene structures with weak evidence support

  • cat5: gene structures difficult to define. Penalty on junctions not well supported.

  • cat6: gene structurs with very weak evidence support

  • cat7: gene structures inferred by ab-initio methods

The outputs are 1) the gff file annotated with the “curation” category for each transcript and 2) a graphical representation of the AED scores for all transcripts with color of the associated curation category; a AED plot.

# input gff:
chr_1  ingenannot      gene    109690  112065  .       -       .       ID=ZtIPO323_000010;Name=ZtIPO323_000010;locus_tag=ZtIPO323_000010;
chr_1  ingenannot      mRNA    109690  112065  .       -       .       ID=ZtIPO323_000010.1;Name=ZtIPO323_000010.1;Parent=ZtIPO323_000010;ev_tr=SCA3419A90.2.1_357-2395;aed_ev_tr=0.2359;ev_tr_penalty=no;ev_pr=None;aed_ev_pr=1.0000;ev_lg=PB.1.1;aed_ev_lg=0.2734;ev_lg_penalty=no;utr_refine_evidence=PB.1.1;product=uncharacterized protein MYCGRDRAFT_88584;Dbxref=InterPro:-,MobiDBLite:mobidb-lite;locus_tag=ZtIPO323_000010;
chr_1  ingenannot      exon    109690  112065  .       -       .       ID=exon:ZtIPO323_000010.1;Parent=ZtIPO323_000010.1;locus_tag=ZtIPO323_000010;
chr_1  ingenannot      CDS     110070  111146  .       -       0       ID=cds:ZtIPO323_000010.1;Parent=ZtIPO323_000010.1;locus_tag=ZtIPO323_000010;
chr_1  ingenannot      five_prime_UTR  111147  112065  .       -       .       ID=five_prime_UTR_ZtIPO323_000010.1_001;Parent=ZtIPO323_000010.1;locus_tag=ZtIPO323_000010;
chr_1  ingenannot      three_prime_UTR 109690  110069  .       -       .       ID=three_prime_UTR_ZtIPO323_000010.1_001;Parent=ZtIPO323_000010.1;locus_tag=ZtIPO323_000010;
chr_1  ingenannot      gene    112203  116391  .       +       .       ID=ZtIPO323_000020;Name=ZtIPO323_000020;locus_tag=ZtIPO323_000020;
chr_1  ingenannot      mRNA    112203  116391  .       +       .       ID=ZtIPO323_000020.1;Name=ZtIPO323_000020.1;Parent=ZtIPO323_000020;ev_tr=SRR6215485.1.1;aed_ev_tr=0.0152;ev_tr_penalty=no;ev_pr=None;aed_ev_pr=1.0000;ev_lg=PB.2.2;aed_ev_lg=0.0266;ev_lg_penalty=no;utr_refine_evidence=PB.2.2;product=Structure-specific endonuclease subunit SLX4;Dbxref=InterPro:-,InterPro:IPR000637,InterPro:IPR017956,InterPro:IPR018574,MobiDBLite:mobidb-lite,Pfam:PF09494,ProSitePatterns:PS00354,SMART:SM00384;Ontology_term=GO:0003677,GO:0005634,GO:0006260,GO:0006281,GO:0006355,GO:0033557;locus_tag=ZtIPO323_000020;
chr_1  ingenannot      exon    112203  116391  .       +       .       ID=exon:ZtIPO323_000020.1;Parent=ZtIPO323_000020.1;locus_tag=ZtIPO323_000020;
chr_1  ingenannot      CDS     112306  116271  .       +       0       ID=cds:ZtIPO323_000020.1;Parent=ZtIPO323_000020.1;locus_tag=ZtIPO323_000020;
chr_1  ingenannot      five_prime_UTR  112203  112305  .       +       .       ID=five_prime_UTR_ZtIPO323_000020.1_001;Parent=ZtIPO323_000020.1;locus_tag=ZtIPO323_000020;
chr_1  ingenannot      three_prime_UTR 116272  116391  .       +       .       ID=three_prime_UTR_ZtIPO323_000020.1_001;Parent=ZtIPO323_000020.1;locus_tag=ZtIPO323_000020;

# output gff:
chr_1  ingenannot      gene    109690  112065  .       -       .       ID=ZtIPO323_000010;
chr_1  ingenannot      mRNA    109690  112065  .       -       .       ID=ZtIPO323_000010.1;Name=ZtIPO323_000010.1;Parent=ZtIPO323_000010;ev_tr=SCA3419A90.2.1_357-2395;aed_ev_tr=0.2359;ev_tr_penalty=no;ev_pr=None;aed_ev_pr=1.0000;ev_lg=PB.1.1;aed_ev_lg=0.2734;ev_lg_penalty=no;utr_refine_evidence=PB.1.1;product=uncharacterized protein MYCGRDRAFT_88584;Dbxref=InterPro:-,MobiDBLite:mobidb-lite;locus_tag=ZtIPO323_000010;curation=cat6;
chr_1  ingenannot      exon    109690  112065  .       -       .       ID=exon:ZtIPO323_000010.1;Parent=ZtIPO323_000010.1;
chr_1  ingenannot      CDS     110070  111146  .       -       0       ID=cds:ZtIPO323_000010.1;Parent=ZtIPO323_000010.1;
chr_1  ingenannot      five_prime_UTR  111147  112065  .       -       .       ID=five_prime_UTR_ZtIPO323_000010.1_001;Parent=ZtIPO323_000010.1;
chr_1  ingenannot      three_prime_UTR 109690  110069  .       -       .       ID=three_prime_UTR_ZtIPO323_000010.1_001;Parent=ZtIPO323_000010.1;
chr_1  ingenannot      gene    112203  116391  .       +       .       ID=ZtIPO323_000020;
chr_1  ingenannot      mRNA    112203  116391  .       +       .       ID=ZtIPO323_000020.1;Name=ZtIPO323_000020.1;Parent=ZtIPO323_000020;ev_tr=SRR6215485.1.1;aed_ev_tr=0.0152;ev_tr_penalty=no;ev_pr=None;aed_ev_pr=1.0000;ev_lg=PB.2.2;aed_ev_lg=0.0266;ev_lg_penalty=no;utr_refine_evidence=PB.2.2;product=Structure-specific endonuclease subunit SLX4;Dbxref=InterPro:-,InterPro:IPR000637,InterPro:IPR017956,InterPro:IPR018574,MobiDBLite:mobidb-lite,Pfam:PF09494,ProSitePatterns:PS00354,SMART:SM00384;Ontology_term=GO:0003677,GO:0005634,GO:0006260,GO:0006281,GO:0006355,GO:0033557;locus_tag=ZtIPO323_000020;curation=cat5;
chr_1  ingenannot      exon    112203  116391  .       +       .       ID=exon:ZtIPO323_000020.1;Parent=ZtIPO323_000020.1;
chr_1  ingenannot      CDS     112306  116271  .       +       0       ID=cds:ZtIPO323_000020.1;Parent=ZtIPO323_000020.1;
chr_1  ingenannot      five_prime_UTR  112203  112305  .       +       .       ID=five_prime_UTR_ZtIPO323_000020.1_001;Parent=ZtIPO323_000020.1;
chr_1  ingenannot      three_prime_UTR 116272  116391  .       +       .       ID=three_prime_UTR_ZtIPO323_000020.1_001;Parent=ZtIPO323_000020.1;

Output AED plot with curation colors:

Manual curation AED plot