Add new gene isoforms to your annotations

Here we propose a protocole to add potential isoforms to your canonical gene models. This protocole is based on the extraction of Sqanti3 [1] results.

Workflow:

digraph iso {
   "Rank long-read transcript isoforms" -> "Annotate isoforms with Sqanti3";
   "Annotate isoforms with Sqanti3" -> "Filter type of isoforms";
   "Filter type of isoforms" -> "Add isoforms";
   "Add isoforms" -> "Explore SO classification of isoforms";
}

Steps:

The starting point is the gff file obtained from select_best_gene_models use case. If want to add isoforms on gff coming from another source, perform a validation of your gff with the validate command before.

1) Extract the most representative isoforms from long-read transcripts

If you have both long-read transcripts and short-read data, we recommend annotating isoforms using long-read transcripts based on coverage from short-reads. If short-read data is not available, you can proceed without filtering or use short-read transcripts but results will be of lower quality.

# write your file of files: bam.fof as below, such as "path to bam<tab>paired<tab>stranded"
/tmp/run1.singleton.stranded.bam    False   True
/tmp/run2.singleton.unstranded.bam  False   False
/tmp/run3.paired.stranded.bam  True  True
/tmp/run4.paired.unstranded.bam  true   False

# your longreads.gff file must look like a gtf file, with several trancript for the same gene, as below:
chr_1       ingenannot-isoform-ranking      transcript      140523  145168  .       -       .       gene_id "PB.12"; transcript_id "PB.12.1";
chr_1       ingenannot-isoform-ranking      exon    140523  145168  .       -       .       gene_id "PB.12"; transcript_id "PB.12.1";
chr_1       ingenannot-isoform-ranking      transcript      140531  145121  .       -       .       gene_id "PB.12"; transcript_id "PB.12.2";
chr_1       ingenannot-isoform-ranking      exon    140531  145121  .       -       .       gene_id "PB.12"; transcript_id "PB.12.2";
chr_1       ingenannot-isoform-ranking      transcript      140579  144883  .       -       .       gene_id "PB.12"; transcript_id "PB.12.3";
chr_1       ingenannot-isoform-ranking      exon    140579  144883  .       -       .       gene_id "PB.12"; transcript_id "PB.12.3";
chr_1       ingenannot-isoform-ranking      transcript      140708  144901  .       -       .       gene_id "PB.12"; transcript_id "PB.12.4";
chr_1       ingenannot-isoform-ranking      exon    140708  144901  .       -       .       gene_id "PB.12"; transcript_id "PB.12.4";
chr_1       ingenannot-isoform-ranking      transcript      202365  205520  .       -       .       gene_id "PB.22"; transcript_id "PB.22.1";
chr_1       ingenannot-isoform-ranking      exon    202365  205031  .       -       .       gene_id "PB.22"; transcript_id "PB.22.1";
chr_1       ingenannot-isoform-ranking      exon    205203  205520  .       -       .       gene_id "PB.22"; transcript_id "PB.22.1";
chr_1       ingenannot-isoform-ranking      transcript      202939  205422  .       -       .       gene_id "PB.22"; transcript_id "PB.22.2";
chr_1       ingenannot-isoform-ranking      exon    202939  205422  .       -       .       gene_id "PB.22"; transcript_id "PB.22.2";
chr_1       ingenannot-isoform-ranking      transcript      202939  205488  .       -       .       gene_id "PB.22"; transcript_id "PB.22.3";
chr_1       ingenannot-isoform-ranking      exon    202939  205031  .       -       .       gene_id "PB.22"; transcript_id "PB.22.3";
chr_1       ingenannot-isoform-ranking      exon    205203  205488  .       -       .       gene_id "PB.22"; transcript_id "PB.22.3";

# rank/filter your long-read transcripts based on junction support and coverage
ingenannot -v 2 isoform_ranking longreads.gff -f bam.fof --alt_threshold 0.1

2) Annotate isoforms with SQANTI3

# If your annotation file is in gff3, transform it in gtf with gffread
gffread genes.gff3 -T > genes.gtf

# run sqanti3 with only the best alternatives isoforms
sqanti3_qc.py --gtf isoforms.alternatives.gff genes.gtf genome.fasta --isoAnnotLite

3) Extract the list of isoforms to add from SQANTI3 results

# get list of transcripts to add
# limit to 3 types of annotations without non-coding with file extraction as below:

cut -f1,6,15,30,37 isoforms.alternatives_classification.txt | grep -v -P "TRUE|non_coding" | grep -P "\tgenic\t" | cut -f 1 > lIDS_genic.txt
cut -f1,6,15,30,37 isoforms.alternatives_classification.txt | grep -v -P "TRUE|non_coding" | grep -P "\tnovel_in_catalog\t" | cut -f 1 > lIDS_nic.txt
cut -f1,6,15,30,37 isoforms.alternatives_classification.txt | grep -v -P "TRUE|non_coding" | grep -P "\tnovel_not_in_catalog\t" | cut -f 1 > lIDS_nnic.txt

cat lIDS*.txt > all_IDs.txt

4) Add Isoforms to your canonical gene models

ingenannot -v 2 add_sqanti3_isoforms genes.gff3 isoforms.alternatives.gff all_IDs.txt > genes.isoforms.gff3

5) Analyze the Sequence Ontology Classification of your transcripts

SO Classification will help you to remove some possible artefacts (overlapping, …).

# write your file of files: genes.fof as below, such as "path to GFF/GTF genes<tab>source"
genes.isoforms.gff3 squanti3

# run classification
ingenannot -v 2 soclassif genes.fof --clustranded --clatype exon

References