soclassif

Command soclassif performs SO classification.

usage

ingenannot -v 2 soclassif file.fof --clustranded --clatype exon

positional arguments:

fof

File of files, <GFF/GTF>TAB<source>

optional arguments:

-h, –help

show this help message and exit

–clutype CLUTYPE

Feature type used to clusterize: [gene, cds], default=cds

–clustranded

Same strand orientation required to cluster features, default=False

–clatype CLATYPE

Feature type used to classify: [gene, cds], default=cds

inputs

File of Files (FoF) with all files to analyze. One per line such: <GFF/GTF>TAB<source>. If you want to analyze only isoforms of one file, put one line in the file.

outputs

Statistics for each category:

11661 metagenes with only one transcript, not analyzed
Classification:
N:O:O:0
N:N:O:1
N:O:N:0
N:N:N:1
O:N:O:1198
O:N:N:160
O:O:N:384
unclassified:11661
nb classified metagenes with all transcripts sharing the same CDS: 189

Categories defined by the SO such:

Class

definition

N:0:0

No transcript-pairs share any exon sequence

../_images/N_0_0.png

Class

definition

N:N:0

Some transcript-pairs share sequence, but none have common exon boundaries

../_images/N_N_0.png

Class

definition

N:0:N

Some transcript-pairs share no sequence, others have common exon boundaries

../_images/N_0_N.png

Class

definition

N:N:N

Some transcript-pairs share no sequence, others have common sequence and exon boundaries

../_images/N_N_N.png

Class

definition

0:N:0

All transcript-pairs share sequence in common, but none share exon boundaries

../_images/0_N_0.png

Class

definition

0:N:N

All transcript-pairs share sequence in common and some share exon boundaries

../_images/0_N_N.png

Class

definition

0:0:N

All transcript-pairs share some exons in common

../_images/0_0_N.png

As described above, the SO classification was originally based on exon boundaries, that could be highly problematic for de-novo annotations with poorly defined UTR parts. To avoid such problem, you can choose to perform the same classification based on CDS coordinates. In this case you will obtained less biased results. We tried to summarize the pro and cons of classification feature type in the following table.

pros

cons

–clatype gene

complete gene structure analysis

too sensitive in case of divergent set of annotations (ex UTR, vs no-UTR)

–clatype cds

limited to coding sequence, avoid background noise due to UTRs. Usefull in case of poorly predicted UTRs.

structure inspection limited to cds

Each analyzed locus, associated with a category, is exported in the corresponding gff file.