select
Command select selects best gene models.
usage
$ ingenannot -v 2 select file.fof selected_genes.gff --noaed --clustranded --penalty_overflow 0.25 --use_ev_lg --no_cds_overlap
positional arguments:
FoF |
File of files, <GFF/GTF>TAB<source> |
Output |
Output Annotation file in GFF file format |
optional arguments:
-h, –help |
show this help message and exit |
–noaed |
If set, use precompute aed info available in gff file, no aed computation |
–clutype CLUTYPE |
Feature type used to clusterize: [gene, cds], default=cds |
–clustranded |
Same strand orientation required to cluster features, default=False |
–evtr EVTR |
Gff file of transcript evidence |
–evpr EVPR |
Gff file protein evidence, compressed and indexed with tabix |
–evtr_source EVTR_SOURCE |
Specify source for Gff file transcript evidence ex “stringtie”, default=undefined |
–evpr_source EVPR_SOURCE |
Specify source for Gff file protein evidence ex “blastx, miniprot”, default=undefined |
–evtrstranded |
Same strand orientation required to consider match with evidence, default=False |
–evprstranded |
Same strand orientation required to consider match with evidence, default=False |
–penalty_overflow PENALTY_OVERFLOW |
In case of a CDS is longer or violate constraint of intron with the tr evidence, add a penalty to the computation of the aed score range[0.0-1.0], default=0.0, no penalty |
–nbsrc_filter NBSRC_FILTER |
Number of sources required to bypass aedtr and aedpr filtering, default=max number of source + 1 |
–aedtr AEDTR |
Minimum aedtr required when filtering default=1.0, |
–aedpr AEDPR |
Minimum aedpr required when filtering default=1.0, |
–aed_tr_cds_only |
For transscripts (short-reads and longreads), compute aed on CDS only, instead of Exon and CDS, with best score selection, default=False |
–use_ev_lg |
Use aed of long-read instead of aed_tr if better, default=False |
–nbsrc_absolute NBSRC_ABSOLUTE |
Number of sources required to keep a gene, default=1 |
–min_cds_len MIN_CDS_LEN |
Minimum CDS len required default=90, prot with 30AA |
–no_partial |
In case of the best selected CDS is partial (no CDS nor STOP codon), export another CDS if possible, default=False |
–genome GENOME |
Genome in fasta format required with no_partial |
–longreads LONGREADS |
Gff file longread based transcript evidence, compressed and indexed with tabix |
–longreads_source LONGREADS_SOURCE |
Specify source for Gff file longread based evidence ex “Iso-Seq”, default=undefined |
–longreads_penalty_overflow LONGREADS_PENALTY_OVERFLOW |
In case of a CDS is longer or violate constraint of intron with the longread based transcript evidence, add a penalty to the computation of the aed score range[0.0-1.0], default=0.25 |
–gaeval |
expect gaeval tsv file for each annotation file, File of files: <GFF/GTF>TAB<source>TAB<gaeval> |
–prefix PREFIX |
prefix for gene name, default=G |
–no_export |
export gff file containing the un-exported best genes |
–no_cds_overlap |
post process filter to remove worst CDS if overlapping with other CDS |
inputs
File of Files (FoF) with all files to analyze. One per line such: <GFF/GTF>TAB<source>. If –gaeval option set, expect gaeval tsv file for each annotation file, File of files: <GFF/GTF>TAB<source>TAB<gaeval>. It is recommended to use option –no_cds_overlap to avoid overlapping CDS on same or opposite strand, especially in case you performed gene prediction on specific strand.
outputs
The main outtput is the gff file with selected genes. Many tags are added on mRNA features as AED scores:
chr_1 ingenannot gene 615870 616337 . - . ID=G_00002;gene_source=gene:ZtritIPO323_04g00197;source=RRES;
chr_1 ingenannot mRNA 615870 616337 . - . ID=G_00002.1;transcript_source=mRNA:ZtritIPO323_04t00197;source=RRES;Parent=G_00002;ev_tr=None;aed_ev_tr=1.0000;ev_tr_penalty=undef;ev_pr=None;aed_ev_pr=1.0000;ev_lg=None;aed_ev_lg=1.0000;ev_lg_penalty=undef;
chr_1 ingenannot exon 615870 616337 . - . ID=exon:G_00002.1;Parent=G_00002.1;
chr_1 ingenannot CDS 615870 616337 . - 0 ID=cds:G_00002.1;Parent=G_00002.1;
chr_1 ingenannot gene 617267 619263 . - . ID=G_00003;gene_source=gene:gene:chr_1g0002351;source=eugene;
chr_1 ingenannot mRNA 617267 619263 . - . ID=G_00003.1;transcript_source=mRNA:mRNA:chr_1g0002351;source=eugene;Parent=G_00003;ev_tr=SCA3419A35.107.1;aed_ev_tr=0.0920;ev_tr_penalty=no;ev_pr=None;aed_ev_pr=1.0000;ev_lg=PB.103.1;aed_ev_lg=0.0000;ev_lg_penalty=no;
chr_1 ingenannot exon 617267 617497 . - . ID=exon:G_00003.1;Parent=G_00003.1;
chr_1 ingenannot exon 617552 619263 . - . ID=exon:G_00003.2;Parent=G_00003.1;
chr_1 ingenannot CDS 617367 617497 . - 2 ID=cds:G_00003.1;Parent=G_00003.1;
chr_1 ingenannot CDS 617552 619004 . - 0 ID=cds:G_00003.1;Parent=G_00003.1;
If the option –no_export was set, best filtered gene models (aed filter, nb source, cds_overlap) are exported in a separate gff file.