Basecalling using Guppy
Base calling is the process of translating the electronic raw signal of the sequencer into bases, i.e., ATCG. As for most bioinformatic tasks there are many different tools to solve this problem. Here, we will only focus on the current state-of-the-art basecaller Guppy, which is the current “official” ONT basecaller. Although basecalling can be performed “live”, i.e., in real time while sequencing, it is often useful to separate the sequencing from basecalling. One advantage of “offline” basecalling is that the basecaller can use significant amounts of compute and read/write resources which may slow the sequencing process and, in rare cases, even lead to loss of sequencing data.
Guppy is a neural network based basecaller that in addition to basecalling also performs filtering of low quality reads, clipping of Oxford Nanopore adapters and estimation of methylation probabilities per base.
As input, Guppy, as well as the now deprecated Albacore and all other basecallers, uses files in fast5 format as input.
First, change into the directory /course_data/practicals/basecalling_practical. It contains a sub-directory called fast5 with fast5 files of a recent MinION run.
Apart from the input fast5 files guppy requires a configuration file that sets the basecalling parameters depending on which flowcell and library preparation kit was used to produce the data. To list all supported flowcell and library kits type
guppy_basecaller --print_workflows
The provided data was prepared using the SQK-LSK108 library preparation kit that was sequenced on a MIN106 flowcell.
To start the basecalling you can either specify the flowcell and the library kit using the parameters -f FLO-MIN106 and -k SQK-LSK108 or use option –c and the name of the configuration file. For convenience we’ll use the option -c.
guppy_basecaller –i ./fast5 –s ./guppy_out –c dna_r9.4.1_450bps_hac.cfg \
--num_callers 2 --cpu_threads_per_caller 1
The command above will call guppy on the input fast5 directory option (-i) and write the output to the directory given with option -s.
The above command will run for several hours! …. therefore please stop Guppy now by pressing Ctrl-c (hold the control key and then press ‘c’) and copy the folder guppy_output from directory course_data/precompiled into the folder basecalling_practical.
- what directory you are in (can check with command pwd)
- where the directory is that you want to copy
- what the correct path for the destination is
To copy a directory use the following command
cp –R SOURCE_TO_COPY DESTINATION
where
- SOURCE_TO_COPY = the directory you want to copy
- DESTINATION = the folder/directory you want to copy it to
Change into the directory guppy_output and have a look what is in there. As you can see there are several .fastq files with the basecalled reads, one or more .log files that contains log messages from guppy as well as a sequencing_summary.txt file.
It is often useful to concatenate all the different fastq files into one big file for downstream analyses. From within the guppy_output directory use the cat command to do this:
cat *.fastq > all_guppy.fastq
This will create one fastq file called all_guppy.fastq with all your basecalled reads in it.