run_dbCAN

The standalone version of the dbCAN3 annotation tool for automated CAZyme annotation

Important

This file type requires the parsomics-plugin-dbcan plugin

Important

As of now, parsomics was only tested on run_dbCAN v3 and v4. The compatibility with v5 or later is unknown and not guaranteed.

File naming

The file names must adhere to one of the following patterns:

"<MAG-name>.OUT.overview.txt",
"<MAG-name>.overview.txt",
"<MAG-name>_rundbcanoverview.txt",

File format

tip

For more information on the InterProScan TSV output file format, check out the run_dbCAN documentation. and source code

The files must include a header (i.e. they should include column names at the top). They must have the following columns, in that exact order:

Column name	Column obligatoriness	Data type	Data nullability
`Gene ID`	Mandatory	String	Not nullable
`EC#`	Optional	String	Nullable
`HMMER`	Optional	String	Nullable
`dbCAN_sub`	Optional	String	Nullable
`DIAMOND`	Optional	String	Nullable
`eCAMI`	Optional	String	Nullable
`Signalp`	Optional (ignored)	String	N/A
`#ofTools`	Optional (ignored)	Integer	N/A

A few things to keep in mind:

The Gene ID should actually contain the name of the protein that the annotation refers to. Also, remember that primary and foreign keys in parsomics are named with key not with id, to avoid mixing up names and keys.
Different versions of dbCAN use different sources for the Enzyme Commision Number (EC#) column. From run_dbCAN 4.0.0 onwards, EC# are predicted using dbCAN_sub instead of eCAMI. parsomics is able to adapt to both cases.
There are so many optional columns because the run_dbCAN output format is not normalized. This is further explained below.

info

Normalization

The run_dbCAN overview.txt file is not normalized, because it doesn't have "one property per column, one observation per row".

In the example below, notice how each row contains multiple observations (e.g. the first row contains three annotations from three different sources) and how the property of annotation source is spread across multiple columns (i.e. HMMER, dbCAN_sub, DIAMOND).

Gene ID	EC#	HMMER	dbCAN_sub	DIAMOND	#ofTools
AIFGPLGP_01443	-	CE4(38-162)	CE4_e21	CE4	3
AIFGPLGP_01587	-	GT105(88-212)	GT105_e6	-	2
AIFGPLGP_00229	2.4.2.43:1	-	CBM48_e59	-	1

A more normalized representation of the same data would look like this:

Gene ID	Source name	Description	Annotation type
AIFGPLGP_01443	HMMER	CE4(38-162)	DOMAIN
AIFGPLGP_01443	dbCAN_sub	CE4_e21	DOMAIN
AIFGPLGP_01443	DIAMOND	CE4	DOMAIN
AIFGPLGP_01587	HMMER	GT105(88-212)	DOMAIN
AIFGPLGP_01587	dbCAN_sub	GT105_e6	DOMAIN
AIFGPLGP_00229	dbCAN_sub	2.4.2.43:1	EC_NUMBER
AIFGPLGP_00229	dbCAN_sub	CBM48_e59	DOMAIN

That is still not full normalized though, because some annotations include start and stop coordinates in their descriptions. For example, in "CE4(38-162)", 38 and 162 are the start and stop coordinates, respectively. As different properties, these should be in their own columns, as shown below:

Gene ID	Source name	Description	Start coordinate	Stop coordinate	Annotation type
AIFGPLGP_01443	HMMER	CE4	38	162	DOMAIN
AIFGPLGP_01443	dbCAN_sub	CE4_e21	N/A	N/A	DOMAIN
AIFGPLGP_01443	DIAMOND	CE4	N/A	N/A	DOMAIN
AIFGPLGP_01587	HMMER	GT105	88	212	DOMAIN
AIFGPLGP_01587	dbCAN_sub	GT105_e6	N/A	N/A	DOMAIN
AIFGPLGP_00229	dbCAN_sub	2.4.2.43:1	N/A	N/A	EC_NUMBER
AIFGPLGP_00229	dbCAN_sub	CBM48_e59	N/A	N/A	DOMAIN

Mapping to database

`ProteinAnnotationFile`

Original data	`ProteinAnnotationFile` field
run_dbCAN `overview.txt` file path	`path`

`ProteinAnnotationEntry`

Normalized data	`ProteinAnnotationEntry` field
Gene ID	`protein_key` ¹
Source name	`source_key` ²
Description	`description`
Start coordinate	`coord_start`
Stop coordinate	`coord_stop`
Annotation type	`annotation_type`

Important

The run_dbCAN and CLEAN plugins treat Enzyme Commision Number (EC#) annotations differently. The former stores EC# as a annotation description, while the latter stores EC# as an annotation accession. EC# are accession strings to the Expasy database.

As previosly stated, the "Gene ID" column in the run_dbCAN overview.txt file contains a protein names, which are used to query the primary key of the corresponding proteins in the database ↩
The "Source name" column in the run_dbCAN overview.txt file is used to query the primary key of the corresponding sources in the database ↩

File naming​

File format​

Normalization​

Mapping to database​

ProteinAnnotationFile​

ProteinAnnotationEntry​

Footnotes​

File naming

File format

Normalization

Mapping to database

`ProteinAnnotationFile`

`ProteinAnnotationEntry`

Footnotes