Genome assembly using Flye
Another assembler that can be used for long-reads such as PacBio and Oxford Nanopore is Flye. In contrast to the minimap and miniasm pipeline Flye also produces a polished consensus sequence for the assembly which significantly reduces the error rate (more about consensus sequences and polishing in the next practicals).
Change into the Flye directory in the assembler_practical folder and run flye on the raw basecalled reads
flye --nano-raw \
~/course_data/precompiled/all_guppy.fastq \
--genome-size 1m --out-dir ./flye_output
As you can see, flye requires the input reads (–nano-raw) as well as an output directory and the (expected) size of the final assembly which, in this case is set to 1 megabase (1,000,000 bases). The output of flye are several files including the assembly in fasta format.
When Flye is finished use assembly-stats to get a first overview over the finished assembly.
- Does the assembly differ from the miniasm assembly, e.g., wrt total length, number of contigs and length of the contigs?
Now align the flye assembly to the reference chromosome using dnadiff
dnadiff –p flye_dnadiff ~/course_data/precompiled/chr17.fasta \
flye_output/assembly.fasta
Open the flye_dnadiff.report file (e.g. double-click on the file).
- How many contigs aligned with the reference? What is the error rate?
Now upload the flye_dnadiff.delta to Assemblytics and inspect the dot plots.
- How many contigs align well with the reference?
- Is the Flye assembly more or less fragmented than the miniasm assembly? Why?
- Does the alignment differ from the reference, e.g., does the Flye assembly extend the start or stop of the reference? Are there inversions?
- clicking the name of a contig: this will only show the dot plot for this contig with the reference
- double-clicking on a specific region in the dot plot To zoom out simply right-click (click with 2 fingers on a touch pad)