GTDB-Tk
File naming
The file name must end with ".summary.tsv".
File format
tip
For more information on the GTDB-Tk output format, visit the GTDB-Tk documentation.
The file must include a header (i.e. the column names at the top). The column names can be anything, as long as the order is exactly the same. It must have the following columns, in that order:
Column name | Column obligatoriness | Data type | Data nullability |
---|---|---|---|
user_genome | Mandatory | String | Not nullable |
classification | Mandatory | String | Not nullable |
closest_genome_reference | Mandatory | String | Nullable |
closest_genome_reference_radius | Mandatory | Float | Nullable |
closest_genome_taxonomy | Mandatory (ignored) | N/A | N/A |
closest_genome_ani | Mandatory | Float | Nullable |
closest_genome_af | Mandatory | Float | Nullable |
closest_placement_reference | Mandatory | String | Nullable |
closest_placement_radius | Mandatory | Float | Nullable |
closest_placement_taxonomy | Mandatory (ignored) | N/A | N/A |
closest_placement_ani | Mandatory | Float | Nullable |
closest_placement_af | Mandatory | Float | Nullable |
pplacer_taxonomy | Mandatory (ignored) | N/A | N/A |
classification_method | Mandatory | String | Not nullable |
note | Mandatory | String | Nullable |
other_related_references | Mandatory (ignored) | N/A | N/A |
msa_percent | Mandatory (ignored) | N/A | N/A |
translation_table | Mandatory (ignored) | N/A | N/A |
red_value | Mandatory | Float | Nullable |
warnings | Mandatory | String | Nullable |
info
Why are there mandatory columns that are ignored?
That has to do with the way the GTDB-Tk file parser is written. When the file is read, it must comply with a pre-defined schema (column order and types), even though some of these columns end up being dropped later.
Mapping to database
GTDBTkTsvFile
Original data | GTDBTkTsvFile field | Notes |
---|---|---|
GTDB-Tk file path | path |
GTDBTkTsvEntry
Original data | GTDBTkTsvEntry field | Notes |
---|---|---|
user_genome | genome_key | The MAG name in the GFF file name is used to query the primary key of the corresponding genome in the database |
classification | domain , phylum , klass , order , family , genus , species | The classification column is broken down into multiple fields for better readability |
closest_genome_reference or closest_placement_reference column | Mandatory | |
closest_genome_reference_radius or closest_placement_radius column | Mandatory | |
closest_genome_ani or closest_placement_ani column | Mandatory | |
closest_genome_af or closest_placement_af column | Mandatory | |
classification_method | classification_method | |
note | note | |
red_value | red_value | |
warnings | warnings |