select

Command select selects best gene models.

usage

$ ingenannot -v 2 select file.fof selected_genes.gff --noaed --clustranded --penalty_overflow 0.25 --use_ev_lg  --no_cds_overlap

positional arguments:

FoF

File of files, <GFF/GTF>TAB<source>

Output

Output Annotation file in GFF file format

optional arguments:

-h, –help

show this help message and exit

–noaed

If set, use precompute aed info available in gff file, no aed computation

–clutype CLUTYPE

Feature type used to clusterize: [gene, cds], default=cds

–clustranded

Same strand orientation required to cluster features, default=False

–evtr EVTR

Gff file of transcript evidence

–evpr EVPR

Gff file protein evidence, compressed and indexed with tabix

–evtr_source EVTR_SOURCE

Specify source for Gff file transcript evidence ex “stringtie”, default=undefined

–evpr_source EVPR_SOURCE

Specify source for Gff file protein evidence ex “blastx, miniprot”, default=undefined

–evtrstranded

Same strand orientation required to consider match with evidence, default=False

–evprstranded

Same strand orientation required to consider match with evidence, default=False

–penalty_overflow PENALTY_OVERFLOW

In case of a CDS is longer or violate constraint of intron with the tr evidence, add a penalty to the computation of the aed score range[0.0-1.0], default=0.0, no penalty

–nbsrc_filter NBSRC_FILTER

Number of sources required to bypass aedtr and aedpr filtering, default=max number of source + 1

–aedtr AEDTR

Minimum aedtr required when filtering default=1.0,

–aedpr AEDPR

Minimum aedpr required when filtering default=1.0,

–aed_tr_cds_only

For transscripts (short-reads and longreads), compute aed on CDS only, instead of Exon and CDS, with best score selection, default=False

–use_ev_lg

Use aed of long-read instead of aed_tr if better, default=False

–nbsrc_absolute NBSRC_ABSOLUTE

Number of sources required to keep a gene, default=1

–min_cds_len MIN_CDS_LEN

Minimum CDS len required default=90, prot with 30AA

–no_partial

In case of the best selected CDS is partial (no CDS nor STOP codon), export another CDS if possible, default=False

–genome GENOME

Genome in fasta format required with no_partial

–longreads LONGREADS

Gff file longread based transcript evidence, compressed and indexed with tabix

–longreads_source LONGREADS_SOURCE

Specify source for Gff file longread based evidence ex “Iso-Seq”, default=undefined

–longreads_penalty_overflow LONGREADS_PENALTY_OVERFLOW

In case of a CDS is longer or violate constraint of intron with the longread based transcript evidence, add a penalty to the computation of the aed score range[0.0-1.0], default=0.25

–gaeval

expect gaeval tsv file for each annotation file, File of files: <GFF/GTF>TAB<source>TAB<gaeval>

–prefix PREFIX

prefix for gene name, default=G

–no_export

export gff file containing the un-exported best genes

–no_cds_overlap

post process filter to remove worst CDS if overlapping with other CDS

inputs

File of Files (FoF) with all files to analyze. One per line such: <GFF/GTF>TAB<source>. If –gaeval option set, expect gaeval tsv file for each annotation file, File of files: <GFF/GTF>TAB<source>TAB<gaeval>. It is recommended to use option –no_cds_overlap to avoid overlapping CDS on same or opposite strand, especially in case you performed gene prediction on specific strand.

outputs

The main outtput is the gff file with selected genes. Many tags are added on mRNA features as AED scores:

chr_1        ingenannot      gene    615870  616337  .       -       .       ID=G_00002;gene_source=gene:ZtritIPO323_04g00197;source=RRES;
chr_1        ingenannot      mRNA    615870  616337  .       -       .       ID=G_00002.1;transcript_source=mRNA:ZtritIPO323_04t00197;source=RRES;Parent=G_00002;ev_tr=None;aed_ev_tr=1.0000;ev_tr_penalty=undef;ev_pr=None;aed_ev_pr=1.0000;ev_lg=None;aed_ev_lg=1.0000;ev_lg_penalty=undef;
chr_1        ingenannot      exon    615870  616337  .       -       .       ID=exon:G_00002.1;Parent=G_00002.1;
chr_1        ingenannot      CDS     615870  616337  .       -       0       ID=cds:G_00002.1;Parent=G_00002.1;
chr_1        ingenannot      gene    617267  619263  .       -       .       ID=G_00003;gene_source=gene:gene:chr_1g0002351;source=eugene;
chr_1        ingenannot      mRNA    617267  619263  .       -       .       ID=G_00003.1;transcript_source=mRNA:mRNA:chr_1g0002351;source=eugene;Parent=G_00003;ev_tr=SCA3419A35.107.1;aed_ev_tr=0.0920;ev_tr_penalty=no;ev_pr=None;aed_ev_pr=1.0000;ev_lg=PB.103.1;aed_ev_lg=0.0000;ev_lg_penalty=no;
chr_1        ingenannot      exon    617267  617497  .       -       .       ID=exon:G_00003.1;Parent=G_00003.1;
chr_1        ingenannot      exon    617552  619263  .       -       .       ID=exon:G_00003.2;Parent=G_00003.1;
chr_1        ingenannot      CDS     617367  617497  .       -       2       ID=cds:G_00003.1;Parent=G_00003.1;
chr_1        ingenannot      CDS     617552  619004  .       -       0       ID=cds:G_00003.1;Parent=G_00003.1;

If the option –no_export was set, best filtered gene models (aed filter, nb source, cds_overlap) are exported in a separate gff file.