Add UTRs to gene models

We propose a protocol to add UTRs to your gene models using long-read and RNA-Seq data. We will prepare the data and utilize both data types to add UTRs. We assume that long-read data provide a more reliable definition of UTRs compared to transcript assemblies obtained from RNA-Seq data. We will firstly use long-read data ranked by the most supported isoforms with RNA-Seq data. Then add potential new UTRs with RNA-Seq transcripts if they are not yet annotated. In case you do not have one type of data, you can limit the protocole to the available data.

Workflow:

digraph UTRs {
   "Rank long-read transcript isoforms" -> "Add / Refine UTRs";
   "Rank short-read transcript isoforms" -> "Add / Refine UTRs";
   "Clusterize transcripts" -> "Rank short-read transcript isoforms";
}

1) Add UTRs from long-read data if available

# write your file of files: bam.fof as below, such as "path to bam<tab>paired<tab>stranded"
/tmp/run1.singleton.stranded.bam    False   True
/tmp/run2.singleton.unstranded.bam  False   False
/tmp/run3.paired.stranded.bam  True  True
/tmp/run4.paired.unstranded.bam  true   False

# your longreads.gff file must look like a gtf file, with several trancript for the same gene, as below:
chr_1       ingenannot-isoform-ranking      transcript      140523  145168  .       -       .       gene_id "PB.12"; transcript_id "PB.12.1";
chr_1       ingenannot-isoform-ranking      exon    140523  145168  .       -       .       gene_id "PB.12"; transcript_id "PB.12.1";
chr_1       ingenannot-isoform-ranking      transcript      140531  145121  .       -       .       gene_id "PB.12"; transcript_id "PB.12.2";
chr_1       ingenannot-isoform-ranking      exon    140531  145121  .       -       .       gene_id "PB.12"; transcript_id "PB.12.2";
chr_1       ingenannot-isoform-ranking      transcript      140579  144883  .       -       .       gene_id "PB.12"; transcript_id "PB.12.3";
chr_1       ingenannot-isoform-ranking      exon    140579  144883  .       -       .       gene_id "PB.12"; transcript_id "PB.12.3";
chr_1       ingenannot-isoform-ranking      transcript      140708  144901  .       -       .       gene_id "PB.12"; transcript_id "PB.12.4";
chr_1       ingenannot-isoform-ranking      exon    140708  144901  .       -       .       gene_id "PB.12"; transcript_id "PB.12.4";
chr_1       ingenannot-isoform-ranking      transcript      202365  205520  .       -       .       gene_id "PB.22"; transcript_id "PB.22.1";
chr_1       ingenannot-isoform-ranking      exon    202365  205031  .       -       .       gene_id "PB.22"; transcript_id "PB.22.1";
chr_1       ingenannot-isoform-ranking      exon    205203  205520  .       -       .       gene_id "PB.22"; transcript_id "PB.22.1";
chr_1       ingenannot-isoform-ranking      transcript      202939  205422  .       -       .       gene_id "PB.22"; transcript_id "PB.22.2";
chr_1       ingenannot-isoform-ranking      exon    202939  205422  .       -       .       gene_id "PB.22"; transcript_id "PB.22.2";
chr_1       ingenannot-isoform-ranking      transcript      202939  205488  .       -       .       gene_id "PB.22"; transcript_id "PB.22.3";
chr_1       ingenannot-isoform-ranking      exon    202939  205031  .       -       .       gene_id "PB.22"; transcript_id "PB.22.3";
chr_1       ingenannot-isoform-ranking      exon    205203  205488  .       -       .       gene_id "PB.22"; transcript_id "PB.22.3";

# rank your long-reads based on junction support and coverage
ingenannot -v 2 isoform_ranking longreads.gff -f bam.fof --alt_threshold 0.1

# add UTR with long reads using rank as preferred isoform
ingenannot.py -v 2 utr_refine genes.gff3 isoforms.alternatives.gff genes.utrs.gff3 --erase --utr_mode rank

2) Add UTRs from short-read assemblies

# add UTR with short reads in onlynew mode
# if you want to combine several transcript assemblies from several runs,
# you have to merge the transcripts comming from the same gene
# To merge your transcript files, you can use Stringtie in merge mode.
# If you want to be sure to avoid trancript with multiple CDS
# overlaps, perform as described below:

cat assembly.1.gff assembly.2.gff assembly.3.gff assembly.4.gff ... > all_transcripts.gff

# clusterize your transcripts removing mutliples CDS overlap
ingenannot clusterize all_transcripts.gff transcripts.gff -f genes.gff3 --keep_atts

# rank your short-reads based on junction support and coverage
ingenannot -v 2 -p 7 isoform_ranking transcripts.gff -f bam.fof

# add UTR with short reads
ingenannot -v 2 utr_refine genes.gff3 isoforms.alternatives.gff genes.utrs.gff3 --utr_mode rank

3) Add UTRs from long-read and short-read data

We will proceed as described above, first using long reads, and then using short reads adding new UTRs to genes that lack UTRs.

# rank your long-reads based on junction support and coverage
ingenannot -v 2 isoform_ranking longreads.gff -f bam.fof --alt_threshold 0.1

# add UTR with long reads using rank as preferred isoform
ingenannot.py -v 2 utr_refine genes.gff3 isoforms.alternatives.gff genes.utrs.gff3 --erase --utr_mode rank

# concatenate your RNA-Seq transcripts if necessary
cat assembly.1.gff assembly.2.gff assembly.3.gff assembly.4.gff ... > all_transcripts.gff

# clusterize your transcripts removing mutliples CDS overlap
ingenannot clusterize all_transcripts.gff transcripts.gff -f genes.gff3 --keep_atts

# rank your short-reads based on junction support and coverage
ingenannot -v 2 -p 7 isoform_ranking transcripts.gff -f bam.fof

# add UTR with short reads in onlynew mode (keep UTR already available in genes.gff3
ingenannot -v 2 utr_refine genes.utrs.gff3 isoforms.alternatives.gff genes.utrs.all.gff3 --utr_mode rank --onlynew