isoform_ranking

Command isoform_ranking ranks isoforms based on RNA-Seq coverage from bam file.

usage

ingenannot -v 2 isoform_ranking transcripts.gff -f file.fof --alt_threshold 0.1 --rescue

positional arguments:

Gff_transcripts

Gff file of transcripts

optional arguments:

-h, –help

show this help message and exit

-p PREFIX, –prefix PREFIX

Prefix for output annotation files in GFF file format, default=isoforms

-b BAM, –bam BAM

bam file to analyze

–paired

The bam file is paired or not, default=False

–stranded

The bam file is stranded or not, default=False

-f FOF, –fof FOF

File of bam files, <bam>TAB<type>TAB<stranded>

–sj_threshold SJ_THRESHOLD

threshold used as ratio of coverage to keep a junction for ranking, default=0.05

–cov_threshold COV_THRESHOLD

threshold of the median use to excluded bases in coverage count , default=0.05

–alt_threshold ALT_THRESHOLD

threshold of the isoform to keep it in the isoform.alternatives.gff, based on junction coverage , default=0.1

–rescue

If set, in case of no transcript was selected due to unsupported junctions, keep at least one, based on the coverage, default=False

–sj_full

Junctions supported by only one side will be analyzed as shared junction, if set both sides need to overlap all transcript to be considered in ranking, default=False

inputs

Gff_file in GFF/GTF format.

outputs

Several outputs are expected:

  • isoforms.ranking.gff: all selected transcripts with rank

  • isoforms.top.gff: top isoform for all selected transcripts

  • isoforms.alternatives.gff: top isoform with best alternatives isoforms

  • isoforms.unclassif.gff: removed isoforms (non-supported junction, below abundance threshold)

isoform_ranking groups together isoforms with the same structure with possible different UTRs. This implies that only one isoform of each structure is conserved in the alternatives isoform file. Let have a look at the example below:

all data

We use 2 bam files to analyze the coverage of each isoform. We have 4 isoforms, among them 2 have the same structure but different UTRs (PB.112.2 and PB.112.3). These both isoforms correspond to the major isoform based on splicing coverage. In the isoforms.ranking.gff file, they will be rank on the top as the 2 most probable isoforms whatever the coordinates of UTRs. The second major structure isoform is the PB.112.4 with a smaller exon, then PB.112.1 with a longer second exon. So the expected rank in a such case is: PB.112.2 / PB.112.3, PB.112.4, PB.112.1. To discriminate the rank 1 and 2 for PB.112.2 and PB.112.3, a coverage analysis is performed to order the isoforms based on the suitablity to the median. At the end we obtain:

ranking data
chr_1   ingenannot-isoform-ranking      transcript      646708  648848  .       -       .       gene_id "PB.112";transcript_id "PB.112.3";rank "1";
chr_1   ingenannot-isoform-ranking      exon    646708  647301  .       -       .       gene_id "PB.112"; transcript_id "PB.112.3"
chr_1   ingenannot-isoform-ranking      exon    647389  647472  .       -       .       gene_id "PB.112"; transcript_id "PB.112.3"
chr_1   ingenannot-isoform-ranking      exon    647529  647573  .       -       .       gene_id "PB.112"; transcript_id "PB.112.3"
chr_1   ingenannot-isoform-ranking      exon    647669  648293  .       -       .       gene_id "PB.112"; transcript_id "PB.112.3"
chr_1   ingenannot-isoform-ranking      exon    648441  648848  .       -       .       gene_id "PB.112"; transcript_id "PB.112.3"
chr_1   ingenannot-isoform-ranking      transcript      646700  648827  .       -       .       gene_id "PB.112";transcript_id "PB.112.2";rank "2";
chr_1   ingenannot-isoform-ranking      exon    646700  647301  .       -       .       gene_id "PB.112"; transcript_id "PB.112.2"
chr_1   ingenannot-isoform-ranking      exon    647389  647472  .       -       .       gene_id "PB.112"; transcript_id "PB.112.2"
chr_1   ingenannot-isoform-ranking      exon    647529  647573  .       -       .       gene_id "PB.112"; transcript_id "PB.112.2"
chr_1   ingenannot-isoform-ranking      exon    647669  648293  .       -       .       gene_id "PB.112"; transcript_id "PB.112.2"
chr_1   ingenannot-isoform-ranking      exon    648441  648827  .       -       .       gene_id "PB.112"; transcript_id "PB.112.2"
chr_1   ingenannot-isoform-ranking      transcript      646716  648836  .       -       .       gene_id "PB.112";transcript_id "PB.112.4";rank "3";
chr_1   ingenannot-isoform-ranking      exon    646716  647301  .       -       .       gene_id "PB.112"; transcript_id "PB.112.4"
chr_1   ingenannot-isoform-ranking      exon    647389  647472  .       -       .       gene_id "PB.112"; transcript_id "PB.112.4"
chr_1   ingenannot-isoform-ranking      exon    647541  647573  .       -       .       gene_id "PB.112"; transcript_id "PB.112.4"
chr_1   ingenannot-isoform-ranking      exon    647669  648293  .       -       .       gene_id "PB.112"; transcript_id "PB.112.4"
chr_1   ingenannot-isoform-ranking      exon    648441  648836  .       -       .       gene_id "PB.112"; transcript_id "PB.112.4"
chr_1   ingenannot-isoform-ranking      transcript      646476  648836  .       -       .       gene_id "PB.112";transcript_id "PB.112.1";rank "4";
chr_1   ingenannot-isoform-ranking      exon    646476  647301  .       -       .       gene_id "PB.112"; transcript_id "PB.112.1"
chr_1   ingenannot-isoform-ranking      exon    647389  647472  .       -       .       gene_id "PB.112"; transcript_id "PB.112.1"
chr_1   ingenannot-isoform-ranking      exon    647529  647573  .       -       .       gene_id "PB.112"; transcript_id "PB.112.1"
chr_1   ingenannot-isoform-ranking      exon    647669  648299  .       -       .       gene_id "PB.112"; transcript_id "PB.112.1"
chr_1   ingenannot-isoform-ranking      exon    648441  648836  .       -       .       gene_id "PB.112"; transcript_id "PB.112.1"

The top isoform is PB.112.3, so the isoforms.top.gff only contains this transcript.

top data
chr_1   ingenannot-isoform-ranking      transcript      646708  648848  .       -       .       gene_id "PB.112";transcript_id "PB.112.3";rank "1";
chr_1   ingenannot-isoform-ranking      exon    646708  647301  .       -       .       gene_id "PB.112"; transcript_id "PB.112.3"
chr_1   ingenannot-isoform-ranking      exon    647389  647472  .       -       .       gene_id "PB.112"; transcript_id "PB.112.3"
chr_1   ingenannot-isoform-ranking      exon    647529  647573  .       -       .       gene_id "PB.112"; transcript_id "PB.112.3"
chr_1   ingenannot-isoform-ranking      exon    647669  648293  .       -       .       gene_id "PB.112"; transcript_id "PB.112.3"
chr_1   ingenannot-isoform-ranking      exon    648441  648848  .       -       .       gene_id "PB.112"; transcript_id "PB.112.3"

The isoforms.alternatives.gff file contains one version of each selected structure, avoiding UTRs isoforms, providing a file more suitable for differential expression analysis or annotation of gene isoforms. In this case, the ranking is reodered to remove UTRs isoforms.

alternatives data
chr_1   ingenannot-isoform-ranking      transcript      646708  648848  .       -       .       gene_id "PB.112";transcript_id "PB.112.3";rank "1";
chr_1   ingenannot-isoform-ranking      exon    646708  647301  .       -       .       gene_id "PB.112"; transcript_id "PB.112.3"
chr_1   ingenannot-isoform-ranking      exon    647389  647472  .       -       .       gene_id "PB.112"; transcript_id "PB.112.3"
chr_1   ingenannot-isoform-ranking      exon    647529  647573  .       -       .       gene_id "PB.112"; transcript_id "PB.112.3"
chr_1   ingenannot-isoform-ranking      exon    647669  648293  .       -       .       gene_id "PB.112"; transcript_id "PB.112.3"
chr_1   ingenannot-isoform-ranking      exon    648441  648848  .       -       .       gene_id "PB.112"; transcript_id "PB.112.3"
chr_1   ingenannot-isoform-ranking      transcript      646716  648836  .       -       .       gene_id "PB.112";transcript_id "PB.112.4";rank "2";
chr_1   ingenannot-isoform-ranking      exon    646716  647301  .       -       .       gene_id "PB.112"; transcript_id "PB.112.4"
chr_1   ingenannot-isoform-ranking      exon    647389  647472  .       -       .       gene_id "PB.112"; transcript_id "PB.112.4"
chr_1   ingenannot-isoform-ranking      exon    647541  647573  .       -       .       gene_id "PB.112"; transcript_id "PB.112.4"
chr_1   ingenannot-isoform-ranking      exon    647669  648293  .       -       .       gene_id "PB.112"; transcript_id "PB.112.4"
chr_1   ingenannot-isoform-ranking      exon    648441  648836  .       -       .       gene_id "PB.112"; transcript_id "PB.112.4"
chr_1   ingenannot-isoform-ranking      transcript      646476  648836  .       -       .       gene_id "PB.112";transcript_id "PB.112.1";rank "3";
chr_1   ingenannot-isoform-ranking      exon    646476  647301  .       -       .       gene_id "PB.112"; transcript_id "PB.112.1"
chr_1   ingenannot-isoform-ranking      exon    647389  647472  .       -       .       gene_id "PB.112"; transcript_id "PB.112.1"
chr_1   ingenannot-isoform-ranking      exon    647529  647573  .       -       .       gene_id "PB.112"; transcript_id "PB.112.1"
chr_1   ingenannot-isoform-ranking      exon    647669  648299  .       -       .       gene_id "PB.112"; transcript_id "PB.112.1"
chr_1   ingenannot-isoform-ranking      exon    648441  648836  .       -       .       gene_id "PB.112"; transcript_id "PB.112.1"

Here no isoform was filtered out, so the isoforms.unclassif.gff file is empty.