Quality Control using FastQC
FastQC is one of the most common tools for quality control of sequencing data including Illumina, Ion Torrent, Oxford Nanopore and PacBio data. Input to FastQC is a sequencing file in (compressed) fastQ format containing reads and quality information.
fastqc on the command-line to open its graphical user interface and load the fastq file produced by guppy (either a single one or the other fastq files).
After loading the first tab will show you some basic statistics about your fastq file and raw sequences.
The second tab “Per Base sequence quality” shows mean and standard deviation of sequencing quality for each position in all reads of your data set. Low quality regions may have to be removed, low quality reads filtered.
The tab “Per Sequence Quality scores” shows the average quality score distribution of your nanopore reads.
The tab “Per base sequence contents” show the average ratio of As, Ts, Cs and Gs in your data set. ”Clean” data without sequencing should show almost parallel/flat lines for all four nucleotides. Trailing or leading peaks indicate indicate sequencing problems and may have to be trimmed.
Another useful feature of FastQC is the “Overrepresented sequences” feature. FastQC analysis the raw reads and tries to identify identical sub-sequences that are “over-represented” in the data. These could indicate PCR primers, sequencing adapters, multiplexing barcodes etc that are still contained in the data and will have to be trimmed/removed.
Load the provided illumina test data (illumina.fq.gz) located in the practical directory and check the “Overrepresented sequences” tab.