CLEAN

A contrastive learning model for high-quality functional prediction of proteins

Important

This file type requires the parsomics-plugin-clean plugin

File naming

The file names must adhere to one of the following patterns:

<MAG-name>_maxsep.csv
<MAG-name>_pvalue.csv

File format

The file must NOT include a header (i.e. it should not include column names at the top). It must have the following columns:

Column property	Column obligatoriness	Data type	Data nullability
Protein name	Mandatory	String	Not nullable
EC# description	Optional (ignored)	N/A	N/A
EC#/Score	Mandatory	String	Nullable

A few things to keep in mind:

The "EC#/Score" column must be formatted like so EC:<EC#>/<Score>. For example: EC:7.6.2.2/0.0321.

Mapping to database

`ProteinAnnotationFile`

Original data	`ProteinAnnotationFile` field
CLEAN CSV file path	`path`

`ProteinAnnotationEntry`

Original data	`ProteinAnnotationEntry` field
Protein name	`protein_key` ¹
EC#/Score	`accession` and `score`

Important

The run_dbCAN and CLEAN plugins treat Enzyme Commision Number (EC#) annotations differently. The former stores EC# as a annotation description, while the latter stores EC# as an annotation accession. EC# are accession strings to the Expasy database.

The protein name in the CLEAN TSV file name is used to query the primary key of the corresponding protein in the database ↩

File naming​

File format​

Mapping to database​

ProteinAnnotationFile​

ProteinAnnotationEntry​

Footnotes​

File naming

File format

Mapping to database

`ProteinAnnotationFile`

`ProteinAnnotationEntry`

Footnotes