Compare / Evaluate different annotation datasets
InGenAnnot offers several tools to help you in your gene annotation evaluation and improvment. These tools are mainly based on the Annotation Edit Distance (AED). This protocole will compute and analyze:
annotation statistics (gene length, nb introns, …)
overlaps in CDS content
AED scores and associated categories for manual curation
Sequence Ontology Classification (SO)
Workflow:

Steps:
1) Validate your annotations and output statistics
You can validate the format of your annotation file and export global statistics in the same time. You will be able to compare the different datasets on gene length, size of introns, etc. You could compare with other related organisms, look at extrem values (very large UTR, …) For more information on the global statistics see the validate command.
# validate the gene models of each predictor / source
ingenannot -v 2 validate src1.gff -g genome.fasta -s
ingenannot -v 2 validate src2.gff -g genome.fasta -s
ingenannot -v 2 validate src3.gff -g genome.fasta -s
2) Compare on CDS structure / positions
Clusterize the datasets on gene or CDS coordinates and get shared CDS, specific locus … You can export an upsetplot of shared CDS or raw data for further Venn diagramms. See the compare command
# write the file of files: file.fof as below, such as "path to gff<tab>source"
src1.gff<TAB>src1
src2.gff<TAB>src2
src3.gff<TAB>src3
# then run compare with desired options
ingenannot -v 2 compare file.fof --clustranded --export_specific --graphout --export_upsetplot --export_same_cds
3) Use SO classification to detect potential problematic regions
As described in (see soclassif),Sequence Ontology (SO) classification enables the detection of gene and transcript positioning relative to each other. This allows you to identify regions with overlapping genes or problematic isoforms.
# write the file of files: file.fof as below, such as "path to gff<tab>source"
src1.gff<TAB>src1
src2.gff<TAB>src2
src3.gff<TAB>src3
# then run soclassif
ingenannot -v 2 soclassif file.fof --clustranded --clatype exon
4) Compare AED scores and get metrics for all annotations
You can compare gene sets based on their AED scores with transcriptomic and protein evidence. First, annotate your mRNA with AED scores as described in the gene selection process (see select_best_gene_models). Then, use the aed_compare command to compare the annotated gene sets. This will generate several plots with cumulative AED scores and a table of metrics, including the geometric median and the distance to the ideal score (0,0). For more information, refer to the documentation of aed_compare.
# annotate each source with AED (see select_best_gene_models use case for more details)
ingenannot -v 2 -p 10 aed src1.gff src1.aed.gff src1 transcripts.sorted.gff.gz proteins.sorted.gff.gz --longreads longreads.sorted.gff.gz --evtrstranded --longreads_source "PacBio" --penalty_overflow 0.2 --aed_tr_cds_only
ingenannot -v 2 -p 10 aed src2.gff src2.aed.gff src2 transcripts.sorted.gff.gz proteins.sorted.gff.gz --longreads longreads.sorted.gff.gz --evtrstranded --longreads_source "PacBio" --penalty_overflow 0.2 --aed_tr_cds_only
ingenannot -v 2 -p 10 aed src3.gff src3.aed.gff src3 transcripts.sorted.gff.gz proteins.sorted.gff.gz --longreads longreads.sorted.gff.gz --evtrstranded --longreads_source "PacBio" --penalty_overflow 0.2 --aed_tr_cds_only
# write the file of files: file.fof as below, such as "path to gff<tab>source"
src1.aed.gff<TAB>src1
src2.aed.gff<TAB>src2
src3.aed.gff<TAB>src3
# then run aed_compare
ingenannot -v 2 aed_compare file.fof
5) Get manual curation categories
We have defined seven categories to prioritize manual curation. The more transcripts you have in categories 1, 2, and 3, the better your gene annotations align with the provided evidence. For more information, refer to the documentation of the curation command.
# You have to launch the tool for each annotation dataset
ingenannot -v 2 curation src1.aed.gff src1.aed.curation.gff
ingenannot -v 2 curation src2.aed.gff src2.aed.curation.gff
ingenannot -v 2 curation src3.aed.gff src3.aed.curation.gff