Sequence File Formats
Fast5
The FAST5 format is the standard sequencing output for Oxford Nanopore sequencers such as the MinION. It is based on the hierarchical data format HDF5 format which enables storage of large and comples data. In contrast to fasta and fastq files a FAST5 file is binary and can not be opened with a normal text editor.
Data stored in nanopore FAST5 files can contain the sequence of a read in fastq format (after basecalling), the raw signal of the pore as well as several log files and other information.
FastA
The fasta format is one of the simplest and most common file formats to store sequence data. A fasta file can contain one or many nucleotide or amino acid sequences. The first line of a sequence in a fasta file starts with a “>” followed by a a series of sequence identifiers or attributes. Subsequent lines contain the nucleotide or amino acid sequence.
>NC_012064.1 Thalassiosira pseudonana CCMP1335
TCCAAGAGTCGAAgtagtttcttcttcttatctctTTCAATCAAATAGTGATCTTGGTATGCCAGAAGTTGTGGTTTGTT
TCGTTTATACTCCACAAAACGTCTGTCTAGCTGTGTCATTTCTGATGCAAGAAGGAAGCTATCTGGGCCATGAAGAATTG
TGTTTCGCATCTTCCATTGTCCTTCAAAAATTTTCCATGTTTCCCCGATTAGCACCGTGGAGAGTTCGAAGGGgtctctt
ttcttctccattgtaccatcatcatatcgTCTGGGTGGTATCCACGTAGATTGTAGTGTTTATGCCCATTCCACATGATG
GAATCCCCGGAGAAGTGCATCACTACCGAGAGACTCTTGTCGCTCGATTGCTCGCAACGTATGTGAGCAGTGTAAAGCAT
ATGGATTCCGAGGGAGATGAACAAGTTTGCAAATCGCGTCA
Typical file extensions for fastA files are
- .fnt
- .fna (nucleotide)
- .faa (amino acid)
- .fasta
- .fa
- .fas
FastQ
The fastq format is the de-facto standard of 2nd generation sequencing technology such as Illumina sequencers. It is similar to the fasta format but in addition to the sequence itself a fastq file also stores quality scores of the sequence. A fastq file stores every sequence in exactly 4 lines:
- The name/ID line starting with “@” followed by a sequence identifier
- The sequence itself
- A line starting with “+” (optionally followed by additional information, e.g., the read names again) which is an artifact that can be ignored nowadays
- The quality line with one character per sequence residue encoding the probability of a possible sequencing error (Phred score)
@4e131bcf-f814-485a-b02f-3d133030b06e
TTGTTATGCCGCTTCGTTCAGTTACGTATTGCTCGACGGTTCCACTTTGAACGTTTGCGTTCAAATACTATAACTAGTTTTGCTCTCGTTTTAATCTTCCCCGTCTCTCCCAAA
+
??????+(+*(%%#()&.67:58D8.10;01.4.8.)(*'(.-,2,?<79C97:?2()((,%()**())'&&IIICCCIIII**(%%#()&.67:58D8.10;01.4.8II?%?