GTDB-Tk

A toolkit for assigning objective taxonomic classifications to prokaryote genomes

File naming

The file name must end with ".summary.tsv".

File format

tip

For more information on the GTDB-Tk output format, visit the GTDB-Tk documentation.

Results Summary (`.tsv`)

The file must include a header (i.e. the column names at the top). The column names can be anything, as long as the order is exactly the same. It must have the following columns, in that order:

Column name	Column obligatoriness	Data type	Data nullability
`user_genome`	Mandatory	String	Not nullable
`classification`	Mandatory	String	Not nullable
`closest_genome_reference`	Mandatory	String	Nullable
`closest_genome_reference_radius`	Mandatory	Float	Nullable
`closest_genome_taxonomy`	Mandatory (ignored)	N/A	N/A
`closest_genome_ani`	Mandatory	Float	Nullable
`closest_genome_af`	Mandatory	Float	Nullable
`closest_placement_reference`	Mandatory	String	Nullable
`closest_placement_radius`	Mandatory	Float	Nullable
`closest_placement_taxonomy`	Mandatory (ignored)	N/A	N/A
`closest_placement_ani`	Mandatory	Float	Nullable
`closest_placement_af`	Mandatory	Float	Nullable
`pplacer_taxonomy`	Mandatory (ignored)	N/A	N/A
`classification_method`	Mandatory	String	Not nullable
`note`	Mandatory	String	Nullable
`other_related_references`	Mandatory (ignored)	N/A	N/A
`msa_percent`	Mandatory (ignored)	N/A	N/A
`translation_table`	Mandatory (ignored)	N/A	N/A
`red_value`	Mandatory	Float	Nullable
`warnings`	Mandatory	String	Nullable

info

Why are there mandatory columns that are ignored?

That has to do with the way the GTDB-Tk file parser is written. When the file is read, it must comply with a pre-defined schema (column order and types), even though some of these columns end up being dropped later.

Mapping to database

`GTDBTkTsvFile`

Original data	`GTDBTkTsvFile` field	Notes
GTDB-Tk file path	`path`

`GTDBTkTsvEntry`

Original data	`GTDBTkTsvEntry` field
`user_genome`	`genome_key` ¹
`classification`	`domain`, `phylum`, `klass`, `order`, `family`, `genus`, `species` ²
`closest_genome_reference` or `closest_placement_reference` column	`reference` ³
`closest_genome_reference_radius` or `closest_placement_radius` column	`radius` ³
`closest_genome_ani` or `closest_placement_ani` column	`ani` ³
`closest_genome_af` or `closest_placement_af` column	`af` ³
`classification_method`	`classification_method`
`note`	`note`
`red_value`	`red_value`
`warnings`	`warnings`

The MAG name in the GTDB-Tk file name is used to query the primary key of the corresponding genome in the database. ↩
The classification column is broken down into multiple fields for better readability. ↩
The closes_placement_* columns are only filled when the classification method used by GTDB-Tk is ANI screen. Otherwise, the closest_genome_* columns are filled. With that in mind, parsomics includes only the relevant metrics to each classification method. ↩ ↩² ↩³ ↩⁴

File naming​

File format​

Results Summary (.tsv)​

Mapping to database​

GTDBTkTsvFile​

GTDBTkTsvEntry​

Footnotes​