Skip to main content

CLEAN

A contrastive learning model for high-quality functional prediction of proteins


Important

This file type requires the parsomics-plugin-clean plugin

File naming

The file names must adhere to one of the following patterns:

  • <MAG-name>_maxsep.csv
  • <MAG-name>_pvalue.csv

File format

The file must NOT include a header (i.e. it should not include column names at the top). It must have the following columns:

Column propertyColumn obligatorinessData typeData nullability
Protein nameMandatoryStringNot nullable
EC# descriptionOptional (ignored)N/AN/A
EC#/ScoreMandatoryStringNullable

A few things to keep in mind:

  • The "EC#/Score" column must be formatted like so EC:<EC#>/<Score>. For example: EC:7.6.2.2/0.0321.

Mapping to database

ProteinAnnotationFile

Original dataProteinAnnotationFile field
CLEAN CSV file pathpath

ProteinAnnotationEntry

Original dataProteinAnnotationEntry field
Protein nameprotein_key 1
EC#/Scoreaccession and score
Important

The run_dbCAN and CLEAN plugins treat Enzyme Commision Number (EC#) annotations differently. The former stores EC# as a annotation description, while the latter stores EC# as an annotation accession. EC# are accession strings to the Expasy database.

Footnotes

  1. The protein name in the CLEAN TSV file name is used to query the primary key of the corresponding protein in the database