run_dbCAN
The standalone version of the dbCAN3 annotation tool for automated CAZyme annotation
This file type requires the parsomics-plugin-dbcan
plugin
As of now, parsomics
was only tested on run_dbCAN v3 and v4. The
compatibility with v5 or later is unknown and not guaranteed.
File naming
The file names must adhere to one of the following patterns:
"<MAG-name>.OUT.overview.txt"
,"<MAG-name>.overview.txt"
,"<MAG-name>_rundbcanoverview.txt"
,
File format
For more information on the InterProScan TSV output file format, check out the run_dbCAN documentation. and source code
The files must include a header (i.e. they should include column names at the top). They must have the following columns, in that exact order:
Column name | Column obligatoriness | Data type | Data nullability |
---|---|---|---|
Gene ID | Mandatory | String | Not nullable |
EC# | Optional | String | Nullable |
HMMER | Optional | String | Nullable |
dbCAN_sub | Optional | String | Nullable |
DIAMOND | Optional | String | Nullable |
eCAMI | Optional | String | Nullable |
Signalp | Optional (ignored) | String | N/A |
#ofTools | Optional (ignored) | Integer | N/A |
A few things to keep in mind:
-
The
Gene ID
should actually contain the name of the protein that the annotation refers to. Also, remember that primary and foreign keys inparsomics
are named withkey
not withid
, to avoid mixing up names and keys. -
Different versions of dbCAN use different sources for the Enzyme Commision Number (EC#) column. From run_dbCAN 4.0.0 onwards, EC# are predicted using dbCAN_sub instead of eCAMI.
parsomics
is able to adapt to both cases. -
There are so many optional columns because the run_dbCAN output format is not normalized. This is further explained below.
Normalization
The run_dbCAN overview.txt
file is not normalized, because it doesn't have
"one property per column, one observation per row".
In the example below, notice how each row contains multiple observations (e.g. the first row contains three annotations from three different sources) and how the property of annotation source is spread across multiple columns (i.e. HMMER, dbCAN_sub, DIAMOND).
Gene ID | EC# | HMMER | dbCAN_sub | DIAMOND | #ofTools |
---|---|---|---|---|---|
AIFGPLGP_01443 | - | CE4(38-162) | CE4_e21 | CE4 | 3 |
AIFGPLGP_01587 | - | GT105(88-212) | GT105_e6 | - | 2 |
AIFGPLGP_00229 | 2.4.2.43:1 | - | CBM48_e59 | - | 1 |
A more normalized representation of the same data would look like this:
Gene ID | Source name | Description | Annotation type |
---|---|---|---|
AIFGPLGP_01443 | HMMER | CE4(38-162) | DOMAIN |
AIFGPLGP_01443 | dbCAN_sub | CE4_e21 | DOMAIN |
AIFGPLGP_01443 | DIAMOND | CE4 | DOMAIN |
AIFGPLGP_01587 | HMMER | GT105(88-212) | DOMAIN |
AIFGPLGP_01587 | dbCAN_sub | GT105_e6 | DOMAIN |
AIFGPLGP_00229 | dbCAN_sub | 2.4.2.43:1 | EC_NUMBER |
AIFGPLGP_00229 | dbCAN_sub | CBM48_e59 | DOMAIN |
That is still not full normalized though, because some annotations include start and stop coordinates in their descriptions. For example, in "CE4(38-162)", 38 and 162 are the start and stop coordinates, respectively. As different properties, these should be in their own columns, as shown below:
Gene ID | Source name | Description | Start coordinate | Stop coordinate | Annotation type |
---|---|---|---|---|---|
AIFGPLGP_01443 | HMMER | CE4 | 38 | 162 | DOMAIN |
AIFGPLGP_01443 | dbCAN_sub | CE4_e21 | N/A | N/A | DOMAIN |
AIFGPLGP_01443 | DIAMOND | CE4 | N/A | N/A | DOMAIN |
AIFGPLGP_01587 | HMMER | GT105 | 88 | 212 | DOMAIN |
AIFGPLGP_01587 | dbCAN_sub | GT105_e6 | N/A | N/A | DOMAIN |
AIFGPLGP_00229 | dbCAN_sub | 2.4.2.43:1 | N/A | N/A | EC_NUMBER |
AIFGPLGP_00229 | dbCAN_sub | CBM48_e59 | N/A | N/A | DOMAIN |
Mapping to database
ProteinAnnotationFile
Original data | ProteinAnnotationFile field |
---|---|
run_dbCAN overview.txt file path | path |
ProteinAnnotationEntry
Normalized data | ProteinAnnotationEntry field |
---|---|
Gene ID | protein_key 1 |
Source name | source_key 2 |
Description | description |
Start coordinate | coord_start |
Stop coordinate | coord_stop |
Annotation type | annotation_type |
The run_dbCAN and CLEAN plugins treat Enzyme Commision Number (EC#) annotations
differently. The former stores EC# as a annotation description
, while the
latter stores EC# as an annotation accession
. EC# are accession strings to
the Expasy database.
Footnotes
-
As previosly stated, the "Gene ID" column in the run_dbCAN
overview.txt
file contains a protein names, which are used to query the primary key of the corresponding proteins in the database ↩ -
The "Source name" column in the run_dbCAN
overview.txt
file is used to query the primary key of the corresponding sources in the database ↩