Sequence File Formats


Fast5

The FAST5 format is the standard sequencing output for Oxford Nanopore sequencers such as the MinION. It is based on the hierarchical data format HDF5 format which enables storage of large and comples data. In contrast to fasta and fastq files a FAST5 file is binary and can not be opened with a normal text editor.

Data stored in nanopore FAST5 files can contain the sequence of a read in fastq format (after basecalling), the raw signal of the pore as well as several log files and other information.


FastA

The fasta format is one of the simplest and most common file formats to store sequence data. A fasta file can contain one or many nucleotide or amino acid sequences. The first line of a sequence in a fasta file starts with a “>” followed by a a series of sequence identifiers or attributes. Subsequent lines contain the nucleotide or amino acid sequence.

Some fastA files contain the sequence in one long line. Other files only show 60 or 80 nucleotides per line and thus safe the sequence in multiple lines.
>NC_012064.1 Thalassiosira pseudonana CCMP1335 
TCCAAGAGTCGAAgtagtttcttcttcttatctctTTCAATCAAATAGTGATCTTGGTATGCCAGAAGTTGTGGTTTGTT
TCGTTTATACTCCACAAAACGTCTGTCTAGCTGTGTCATTTCTGATGCAAGAAGGAAGCTATCTGGGCCATGAAGAATTG
TGTTTCGCATCTTCCATTGTCCTTCAAAAATTTTCCATGTTTCCCCGATTAGCACCGTGGAGAGTTCGAAGGGgtctctt
ttcttctccattgtaccatcatcatatcgTCTGGGTGGTATCCACGTAGATTGTAGTGTTTATGCCCATTCCACATGATG
GAATCCCCGGAGAAGTGCATCACTACCGAGAGACTCTTGTCGCTCGATTGCTCGCAACGTATGTGAGCAGTGTAAAGCAT
ATGGATTCCGAGGGAGATGAACAAGTTTGCAAATCGCGTCA

Typical file extensions for fastA files are


FastQ

The fastq format is the de-facto standard of 2nd generation sequencing technology such as Illumina sequencers. It is similar to the fasta format but in addition to the sequence itself a fastq file also stores quality scores of the sequence. A fastq file stores every sequence in exactly 4 lines:

  1. The name/ID line starting with “@” followed by a sequence identifier
  2. The sequence itself
  3. A line starting with “+” (optionally followed by additional information, e.g., the read names again) which is an artifact that can be ignored nowadays
  4. The quality line with one character per sequence residue encoding the probability of a possible sequencing error (Phred score)
@4e131bcf-f814-485a-b02f-3d133030b06e 
TTGTTATGCCGCTTCGTTCAGTTACGTATTGCTCGACGGTTCCACTTTGAACGTTTGCGTTCAAATACTATAACTAGTTTTGCTCTCGTTTTAATCTTCCCCGTCTCTCCCAAA
+
??????+(+*(%%#()&.67:58D8.10;01.4.8.)(*'(.-,2,?<79C97:?2()((,%()**())'&&IIICCCIIII**(%%#()&.67:58D8.10;01.4.8II?%?