CLEAN
A contrastive learning model for high-quality functional prediction of proteins
Important
This file type requires the parsomics-plugin-clean plugin
File naming
The file names must adhere to one of the following patterns:
<MAG-name>_maxsep.csv<MAG-name>_pvalue.csv
File format
The file must NOT include a header (i.e. it should not include column names at the top). It must have the following columns:
| Column property | Column obligatoriness | Data type | Data nullability |
|---|---|---|---|
| Protein name | Mandatory | String | Not nullable |
| EC# description | Optional (ignored) | N/A | N/A |
| EC#/Score | Mandatory | String | Nullable |
A few things to keep in mind:
- The "EC#/Score" column must be formatted like so
EC:<EC#>/<Score>. For example:EC:7.6.2.2/0.0321.
Mapping to database
ProteinAnnotationFile
| Original data | ProteinAnnotationFile field |
|---|---|
| CLEAN CSV file path | path |
ProteinAnnotationEntry
| Original data | ProteinAnnotationEntry field |
|---|---|
| Protein name | protein_key 1 |
| EC#/Score | accession and score |
Important
The run_dbCAN and CLEAN plugins treat Enzyme Commision Number (EC#) annotations
differently. The former stores EC# as a annotation description, while the
latter stores EC# as an annotation accession. EC# are accession strings to
the Expasy database.
Footnotes
-
The protein name in the CLEAN TSV file name is used to query the primary key of the corresponding protein in the database ↩