Gene annotation

CGA

This module accepts as input one or multiple FASTA files and implements the following steps: 1) a tblastn (cga_expect1=) step, that using one or more protein sequences provided by the user in a reference file (cga_reference1=), returns a FASTA file for each input file and protein used as query, containing all Blast hit regions plus a user defined region around the hits (cga_hit_region_window1=). A second tblastn blast can be performed using the results of the first one as the database, by declaring the following parameters: cga_expect2=,`` cga_reference2=``, and cga_hit_region_window2=. SEDA-CLI operations ([1];https://hub.docker.com/r/pegi3s/seda/) are performed to achieve this; 2) a grow sequences step, as implement in SEDA-CLI, is used to merge sequences in the same file that show a user defined overlap (cga_min_overlap=). This parameter must be specified in the config file. 3) the CGA pipeline (https://hub.docker.com/r/pegi3s/cga/) is then used to perform CDS annotations, using as reference a single amino acid sequence provided by the user in a FASTA file (cga_reference3=). The following parameters must be declared in the config file: the maximum distance between exons from the same gene (cga_max_dist=), the distance around the junction point between two sequences where to look for splicing signals (cga_intron_bp=), the minimum size for reporting CDS (cga_min_full_nucleotide_size=), the selection model to be used (cga_selection_criterion=; the following options are available: 1) similarity with reference sequence first, in case of a tie, percentage of gaps relative to reference sequence; 2) percentage of gaps relative to reference sequence first, in case of a tie, similarity with reference sequence; 3) a mixed model with similarity with reference sequence first, but if fewer gaps relative to reference sequence, similarity gets a bonus (selection correction) defined by the user). In case cga_selection_criterion=3, the selection correction (cga_selection_correction=; a bonus percentage times 10, for instance, 20 means 2% bonus, meaning that a sequence with 18% similarity with the reference sequence behaves as having 20% similarity) must be declared as well. The latter option must be declared even if it is not used (the declared value will be ignored).