prokka
Rapid prokaryotic genome annotation
File naming
The prokka format requires four files per MAG:
<MAG-name>.fna<MAG-name>.ffn<MAG-name>.faa<MAG-name>.gff
FASTA files must have unambiguous file extensions that indicate what kind of
sequence they hold. The accepted extensions are .fna, .ffn, .faa for
contig, gene, and protein sequences, respectively.
File format
For more information on the prokka output files, visit the prokka repository.
FASTA of contigs (.fna)
The file must follow the FASTA format standard, with an additional requirement for the heading: it must start with the contig name, then (optionally) a space followed by a description.
FASTA of genes (.ffn)
The file must follow the FASTA format standard, with an additional requirement for the heading: it must start with the gene name, then (optionally) a space followed by a description.
FASTA of proteins (.faa)
The file must follow the FASTA format standard, with an additional requirement for the heading: it must start with the protein name, then (optionally) a space followed by a description.
GFF (.gff or .gff3)
The file must follow the General Feature Format (GFF). It must have columns representing the following data, in that order and without a header:
| Column name | Column obligatoriness | Data type | Data nullability |
|---|---|---|---|
seqid | Mandatory | String | Not nullable |
source | Mandatory | String | Nullable |
type | Mandatory | String | Not nullable |
start | Mandatory | Integer | Not nullable |
end | Mandatory | Integer | Not nullable |
score | Mandatory | Float | Nullable |
strand | Mandatory | String | Nullable |
phase | Mandatory | Integer | Nullable |
attributes | Mandatory | String | Nullable |
GFF entries gene fragments (e.g. CDS, exon, etc) must include either
locus_tag or ID in their attributes column. This is what parsomics uses
to link GFF entries to genes.
Mapping to database
FASTAFile
| Original data | FASTAFile field |
|---|---|
| FASTA file path | path |
| FASTA file extension | sequence_type 1 |
| FASTA file name | genome_key 2 |
FASTAEntry
| Original data | FASTAEntry field |
|---|---|
| FASTA entry ID | sequence_name 3 |
| FASTA entry Description | description |
| FASTA entry Sequence | sequence |
GFFFile
| Original data | GFFFile field |
|---|---|
| GFF file path | path |
| GFF file name | genome_key 4 |
GFFEntry
| Original data | GFFEntry field |
|---|---|
GFF entry seqid column | sequence_name |
GFF entry source column | source_name |
GFF entry type column | fragment_type |
GFF entry start column | coord_start |
GFF entry end column | coord_stop |
GFF entry score column | score |
GFF entry strand column | strand |
GFF entry phase column | phase |
GFF entry attributes column | attributes 5 |
Footnotes
-
.fnaforSequenceType.CONTIG("CONTIG"),.ffnforSequenceType.GENE("GENE"),.faaforFragmentType.PROTEIN("PROTEIN"). ↩ -
The MAG name in the FASTA file name is used to query the primary key of the corresponding genome in the database. ↩
-
The "ID" refers to the sequence name, not the primary key! To avoid confusion, primary keys in parsomics are named
key, notid. ↩ -
The MAG name in the GFF file name is used to query the primary key of the corresponding genome in the database ↩
-
For easier access to data, this column is converted from a string to a JSONB property in the database. ↩