Add UTRs to gene models
We propose a protocol to add UTRs to your gene models using long-read and RNA-Seq data. We will prepare the data and utilize both data types to add UTRs. We assume that long-read data provide a more reliable definition of UTRs compared to transcript assemblies obtained from RNA-Seq data. We will firstly use long-read data ranked by the most supported isoforms with RNA-Seq data. Then add potential new UTRs with RNA-Seq transcripts if they are not yet annotated. In case you do not have one type of data, you can limit the protocole to the available data.
Workflow:

1) Add UTRs from long-read data if available
# write your file of files: bam.fof as below, such as "path to bam<tab>paired<tab>stranded"
/tmp/run1.singleton.stranded.bam False True
/tmp/run2.singleton.unstranded.bam False False
/tmp/run3.paired.stranded.bam True True
/tmp/run4.paired.unstranded.bam true False
# your longreads.gff file must look like a gtf file, with several trancript for the same gene, as below:
chr_1 ingenannot-isoform-ranking transcript 140523 145168 . - . gene_id "PB.12"; transcript_id "PB.12.1";
chr_1 ingenannot-isoform-ranking exon 140523 145168 . - . gene_id "PB.12"; transcript_id "PB.12.1";
chr_1 ingenannot-isoform-ranking transcript 140531 145121 . - . gene_id "PB.12"; transcript_id "PB.12.2";
chr_1 ingenannot-isoform-ranking exon 140531 145121 . - . gene_id "PB.12"; transcript_id "PB.12.2";
chr_1 ingenannot-isoform-ranking transcript 140579 144883 . - . gene_id "PB.12"; transcript_id "PB.12.3";
chr_1 ingenannot-isoform-ranking exon 140579 144883 . - . gene_id "PB.12"; transcript_id "PB.12.3";
chr_1 ingenannot-isoform-ranking transcript 140708 144901 . - . gene_id "PB.12"; transcript_id "PB.12.4";
chr_1 ingenannot-isoform-ranking exon 140708 144901 . - . gene_id "PB.12"; transcript_id "PB.12.4";
chr_1 ingenannot-isoform-ranking transcript 202365 205520 . - . gene_id "PB.22"; transcript_id "PB.22.1";
chr_1 ingenannot-isoform-ranking exon 202365 205031 . - . gene_id "PB.22"; transcript_id "PB.22.1";
chr_1 ingenannot-isoform-ranking exon 205203 205520 . - . gene_id "PB.22"; transcript_id "PB.22.1";
chr_1 ingenannot-isoform-ranking transcript 202939 205422 . - . gene_id "PB.22"; transcript_id "PB.22.2";
chr_1 ingenannot-isoform-ranking exon 202939 205422 . - . gene_id "PB.22"; transcript_id "PB.22.2";
chr_1 ingenannot-isoform-ranking transcript 202939 205488 . - . gene_id "PB.22"; transcript_id "PB.22.3";
chr_1 ingenannot-isoform-ranking exon 202939 205031 . - . gene_id "PB.22"; transcript_id "PB.22.3";
chr_1 ingenannot-isoform-ranking exon 205203 205488 . - . gene_id "PB.22"; transcript_id "PB.22.3";
# rank your long-reads based on junction support and coverage
ingenannot -v 2 isoform_ranking longreads.gff -f bam.fof --alt_threshold 0.1
# add UTR with long reads using rank as preferred isoform
ingenannot.py -v 2 utr_refine genes.gff3 isoforms.alternatives.gff genes.utrs.gff3 --erase --utr_mode rank
2) Add UTRs from short-read assemblies
# add UTR with short reads in onlynew mode
# if you want to combine several transcript assemblies from several runs,
# you have to merge the transcripts comming from the same gene
# To merge your transcript files, you can use Stringtie in merge mode.
# If you want to be sure to avoid trancript with multiple CDS
# overlaps, perform as described below:
cat assembly.1.gff assembly.2.gff assembly.3.gff assembly.4.gff ... > all_transcripts.gff
# clusterize your transcripts removing mutliples CDS overlap
ingenannot clusterize all_transcripts.gff transcripts.gff -f genes.gff3 --keep_atts
# rank your short-reads based on junction support and coverage
ingenannot -v 2 -p 7 isoform_ranking transcripts.gff -f bam.fof
# add UTR with short reads
ingenannot -v 2 utr_refine genes.gff3 isoforms.alternatives.gff genes.utrs.gff3 --utr_mode rank
3) Add UTRs from long-read and short-read data
We will proceed as described above, first using long reads, and then using short reads adding new UTRs to genes that lack UTRs.
# rank your long-reads based on junction support and coverage
ingenannot -v 2 isoform_ranking longreads.gff -f bam.fof --alt_threshold 0.1
# add UTR with long reads using rank as preferred isoform
ingenannot.py -v 2 utr_refine genes.gff3 isoforms.alternatives.gff genes.utrs.gff3 --erase --utr_mode rank
# concatenate your RNA-Seq transcripts if necessary
cat assembly.1.gff assembly.2.gff assembly.3.gff assembly.4.gff ... > all_transcripts.gff
# clusterize your transcripts removing mutliples CDS overlap
ingenannot clusterize all_transcripts.gff transcripts.gff -f genes.gff3 --keep_atts
# rank your short-reads based on junction support and coverage
ingenannot -v 2 -p 7 isoform_ranking transcripts.gff -f bam.fof
# add UTR with short reads in onlynew mode (keep UTR already available in genes.gff3
ingenannot -v 2 utr_refine genes.utrs.gff3 isoforms.alternatives.gff genes.utrs.all.gff3 --utr_mode rank --onlynew