CLEAN
A contrastive learning model for high-quality functional prediction of proteins
Important
This file type requires the parsomics-plugin-clean
plugin
File naming
The file names must adhere to one of the following patterns:
<MAG-name>_maxsep.csv
<MAG-name>_pvalue.csv
File format
The file must NOT include a header (i.e. it should not include column names at the top). It must have the following columns:
Column property | Column obligatoriness | Data type | Data nullability |
---|---|---|---|
Protein name | Mandatory | String | Not nullable |
EC# description | Optional (ignored) | N/A | N/A |
EC#/Score | Mandatory | String | Nullable |
A few things to keep in mind:
- The "EC#/Score" column must be formatted like so
EC:<EC#>/<Score>
. For example:EC:7.6.2.2/0.0321
.
Mapping to database
ProteinAnnotationFile
Original data | ProteinAnnotationFile field |
---|---|
CLEAN CSV file path | path |
ProteinAnnotationEntry
Original data | ProteinAnnotationEntry field |
---|---|
Protein name | protein_key 1 |
EC#/Score | accession and score |
Important
The run_dbCAN and CLEAN plugins treat Enzyme Commision Number (EC#) annotations
differently. The former stores EC# as a annotation description
, while the
latter stores EC# as an annotation accession
. EC# are accession strings to
the Expasy database.
Footnotes
-
The protein name in the CLEAN TSV file name is used to query the primary key of the corresponding protein in the database ↩