prokka

Rapid prokaryotic genome annotation

File naming

The prokka format requires four files per MAG:

<MAG-name>.fna
<MAG-name>.ffn
<MAG-name>.faa
<MAG-name>.gff

info

FASTA files must have unambiguous file extensions that indicate what kind of sequence they hold. The accepted extensions are .fna, .ffn, .faa for contig, gene, and protein sequences, respectively.

File format

tip

For more information on the prokka output files, visit the prokka repository.

FASTA of contigs (`.fna`)

The file must follow the FASTA format standard, with an additional requirement for the heading: it must start with the contig name, then (optionally) a space followed by a description.

FASTA of genes (`.ffn`)

The file must follow the FASTA format standard, with an additional requirement for the heading: it must start with the gene name, then (optionally) a space followed by a description.

FASTA of proteins (`.faa`)

The file must follow the FASTA format standard, with an additional requirement for the heading: it must start with the protein name, then (optionally) a space followed by a description.

GFF (`.gff` or `.gff3`)

The file must follow the General Feature Format (GFF). It must have columns representing the following data, in that order and without a header:

Column name	Column obligatoriness	Data type	Data nullability
`seqid`	Mandatory	String	Not nullable
`source`	Mandatory	String	Nullable
`type`	Mandatory	String	Not nullable
`start`	Mandatory	Integer	Not nullable
`end`	Mandatory	Integer	Not nullable
`score`	Mandatory	Float	Nullable
`strand`	Mandatory	String	Nullable
`phase`	Mandatory	Integer	Nullable
`attributes`	Mandatory	String	Nullable

GFF entries gene fragments (e.g. CDS, exon, etc) must include either locus_tag or ID in their attributes column. This is what parsomics uses to link GFF entries to genes.

Mapping to database

`FASTAFile`

Original data	`FASTAFile` field
FASTA file path	`path`
FASTA file extension	`sequence_type` ¹
FASTA file name	`genome_key` ²

`FASTAEntry`

Original data	`FASTAEntry` field
FASTA entry ID	`sequence_name` ³
FASTA entry Description	`description`
FASTA entry Sequence	`sequence`

`GFFFile`

Original data	`GFFFile` field
GFF file path	`path`
GFF file name	`genome_key` ⁴

`GFFEntry`

Original data	`GFFEntry` field
GFF entry `seqid` column	`sequence_name`
GFF entry `source` column	`source_name`
GFF entry `type` column	`fragment_type`
GFF entry `start` column	`coord_start`
GFF entry `end` column	`coord_stop`
GFF entry `score` column	`score`
GFF entry `strand` column	`strand`
GFF entry `phase` column	`phase`
GFF entry `attributes` column	`attributes` ⁵

.fna for SequenceType.CONTIG ("CONTIG"), .ffn for SequenceType.GENE ("GENE"), .faa for FragmentType.PROTEIN ("PROTEIN"). ↩
The MAG name in the FASTA file name is used to query the primary key of the corresponding genome in the database. ↩
The "ID" refers to the sequence name, not the primary key! To avoid confusion, primary keys in parsomics are named key, not id. ↩
The MAG name in the GFF file name is used to query the primary key of the corresponding genome in the database ↩
For easier access to data, this column is converted from a string to a JSONB property in the database. ↩

File naming​

File format​

FASTA of contigs (.fna)​

FASTA of genes (.ffn)​

FASTA of proteins (.faa)​

GFF (.gff or .gff3)​

Mapping to database​

FASTAFile​

FASTAEntry​

GFFFile​

GFFEntry​

Footnotes​