Find new potential Small Secreted Proteins (SSP)

Predicting Small Secreted Proteins (SSPs) or small effectors can be challenging and requires specific tools with adapted training parameters for associated machine learning models. Few tools account for the particularities of these genes. For instance, CodingQuarry [1] may prevent the prediction of such genes despite transcriptional evidence. Here, we propose a naive tool that examines transcriptomic evidence not associated with any coding gene and attempts to predict potential SSPs based on the presence of signal peptides and other elements.

Workflow:

digraph SSP {
   "Assemble transcripts" -> "Prepare / Validate data";
   "Prepare / Validate data" -> "Rescue effector";
   "Rescue effector" -> "Select / Validate effectors";
   "Select / Validate effectors" -> "Merge with other genes";
   "Rescue effector" -> "Merge with other genes";
}

Steps:

1) Generate / Assemble new transcripts from RNA-Seq / long reads

Exemple: assembly of new transcripts with StringTie [2] from paired-oriented reads mapped with STAR [3]:

stringtie Aligned.sortedByCoord.out.bam -l 11DPI -o transcripts.gff -m 1 50 --rf -p 12 -g 0 -f 0.1 -j 4 -a 10

2) Prepare your data and validate them

# validate the data
ingenannot -v 2 validate genes.gff
ingenannot -v 2 validate transcripts.gff

# prepare the data
sort -k1,1 -k4g,4 transcripts.gff > transcripts.sorted.gff
bgzip transcripts.sorted.gff
tabix -p gff transcripts.sorted.gff.gz

3) Run `rescue_effector` on genes with new transcripts

ingenannot -v2 rescue_effectors genes.gff transcripts.sorted.gff.gz genome.fasta

4) Validate the new potential genes

If no specific output file is specified, new potential effectors are saved in effectors.gff. You now have several options.

Manual Validation: If you have a limited number of candidates, you can perform manual validation. Selection Step: You can perform a selection step using the same criteria as those used for gene selection with ingenannot select.If you do not have protein matches, you can create a fake one (an empty file) and follow this process:

# create a fake protein evidence file
touch proteins.empty.gff
bgzip proteins.empty.gff
tabix proteins.empty.gff.gz

# select effectors based on aed score
# write a genes.fof file such as:
effectors.gff3<TAB>rescue_effectors

# then run select, without --noaed, you perform AED annotation and selection in one step
# You can set the thresholds for transcript AED and protein AED (see select documentation)
ingenannot -v 2 select genes.fof effectors.select.gff3 --evtr transcripts.sorted.gff.gz --evpr proteins.empty.gff.gz --penalty_overflow 0.25 --evtrstranded --no_cds_overlap --prefix Effector

You can have a look at the distribution of aed scores in the graphics: effectors.select.gff3.scatter_hist_aed.png

5) Add the new gene models

You are now able to merge your 2 annotation files. You can simply concatenate your old annotation file with the new effectors. If you are in a full process of annotation from raw data, you can be interested to reorder/rank these new effector genes with the other genes. To do that you can you use ingenannot selection with no stringent selection

# write a genes.fof file such:
#   gene.gff<TAB>select
#   effectors.select.gff3<TAB>rescue_effectors
ingenannot -v 2 select genes.fof all.genes.gff3 --noaed --no_cds_overlap --clustranded

References