QUAST stands for QUality ASsessment Tool.
The tool evaluates genome assemblies by computing various metrics. This document provides instructions
for the general QUAST tool for genome assemblies, MetaQUAST, the extension for metagenomic datasets,
and Icarus, interactive visualizer for these tools.
You can find all project news and the latest version of the tool at http://quast.sf.net/.
QUAST utilizes E-MEM (an improvement over MUMmer), GeneMarkS, GeneMark-ES, GlimmerHMM, GAGE and Gnuplot. In addition, MetaQUAST uses MetaGeneMark, Krona tools, BLAST, and SILVA 16S rRNA database. Starting from version 3.2, QUAST package includes reads processing tools for finding structural variants between the reference genome and actual organism. These tools are BWA, Sambamba, and Manta. Also we use bedtools for calculating raw and physical read coverage, which is shown in Icarus contig alignment viewer.
All tools above are built in into the QUAST package which is ready for use by academic, non-profit institutions and U.S. Government agencies. If you are not in one of these categories please refer to LICENSE section 'Third-party tools incorporated into QUAST' for guidelines on how to complete the licensing process.
Version 4.6.1 of QUAST was released under GPL v2 on 6 December 2017. Note that some of build-in third-party tools are not under GPL v2. See LICENSE for details.
QUAST can be run on Linux or macOS (OS X).
Its default pipeline requires:
In addition, QUAST submodules require:
ar, so you will have to install Xcode (or only Command Line Tools for Xcode) to make them available.
sudo apt-get install -y pkg-config libfreetype6-dev libpng-dev python-matplotlib
wget https://downloads.sourceforge.net/project/quast/quast-4.6.1.tar.gz tar -xzf quast-4.6.1.tar.gz cd quast-4.6.1
QUAST automatically compiles all its sub-parts when needed (on the first use).
Thus, installation is not required. However, if you want to precompile everything and add quast.py to your
PATH, you may choose either:
Basic installation (about 120 MB):
Full installation (about 540 MB, additionally includes (1) tools for SV detection based on read pairs, which is used for more precise misassembly detection, and (2) tools/data for reference genome detection in metagenomic datasets):
./setup.py install_fullThe default installation location is
/usr/local/bin/for the executable scripts, and
/usr/local/lib/for the python modules and auxiliary files. If you are getting a permission error during the installation, consider running
sudo, or creating a virtual python environment and install into it. Alternatively, you may use old-style installation scripts (
./install_full.sh), which build QUAST package inplace.
./quast.py test_data/contigs_1.fasta \ test_data/contigs_2.fasta \ -R test_data/reference.fasta.gz \ -G test_data/genes.gffView the summary of the evaluation results with the less utility:
test_data directory contains examples of assemblies, reference genomes, gene and operon annotations, and raw reads files.
The tool accepts assemblies and reference genomes in FASTA format. Files may be compressed with zip, gzip, or bzip2.
A reference genome with multiple chromosomes can be provided as a single FASTA file with separate sequence for each chromosome inside.
Maximum total assembly length is 4.29 Gbp.
Maximum length of a reference sequence (e.g. a chromosome) is 536 Mbp. The number of sequences in a reference file is not limited.
Genes and operons
One can also specify files with gene and operon positions in the reference genome. QUAST will count fully and partially aligned regions, and output total values as well as cumulative plots.
The following file formats are supported:
GAGE is an assessment tool used in the well-known homonymous evaluation study (Salzberg et al., 2011). However, it has several important limitations:
--gage). QUAST filters contigs according to a specified threshold and runs GAGE on each assembly. GAGE statistics (see GAGE website and GAGE paper for the descriptions) are reported in addition to standard QUAST report (saved in
python quast.py [options] <contig_file(s)>Options:
metaquast.py script accepts multiple reference genomes.
One can provide several files or directories with multiple reference files inside with
-R may be specified multiple times or all references may be specified as a comma-separated list (without spaces!)
with a single
-R option beforehand. Another way is to use
python metaquast.py contigs_1 contigs_2 ... -R reference_1,reference_2,reference_3,...
The tool partitions all contigs into groups aligned to each reference genome.
Note that a contig may belong to several groups simultaneously if it aligns to several references.
MetaQUAST runs quast.py for each of the following:
--ambiguity-usage 'all' when running
the combined reference until
--unique-mapping is specified.
For gene prediction (
--gene-finding), MetaQUAST uses MetaGeneMark software.
If you run MetaQUAST without providing reference genomes, the tool will try to identify genome content of the metagenome.
MetaQUAST uses BLASTN for aligning contigs to SILVA 16S rRNA database, i.e. FASTA file containing small subunit ribosomal RNA sequences.
For each assembly, 50 reference genomes with top scores are chosen.
Maximum number of references to download can be specified with
Reference genomes for the chosen genomes are downloaded from the NCBI database to
After that, MetaQUAST runs
quast.py on all of them and removes reference genomes with low genome fraction (less than 10%) and
proceeds the usual MetaQUAST analysis with the remaining references.
In addition to standard QUAST options,
metaquast.py also accepts:
If an output path is not specified manually (with
-o), QUAST generates its output into
quast_results/result_<DATE> directory and
latest symlink to it under
QUAST output contains:
|report.txt||assessment summary in plain text format,|
|report.tsv||tab-separated version of the summary, suitable for spreadsheets (Google Docs, Excel, etc),|
|report.tex||LaTeX version of the summary,|
|icarus.html||Icarus main menu with links to interactive viewers. See section 3.4 for details,|
|report.pdf||all other plots combined with all tables (file is created if matplotlib python library is installed),|
|report.html||HTML version of the report with interactive plots inside,|
|misassemblies_report||detailed report on misassemblies. See section 3.1.2 for details,|
|unaligned_report||detailed report on unaligned and partially unaligned contigs. See section 3.1.3 for details.|
# contigs (≥ x bp)
is total number of contigs of length
≥ x bp.
Not affected by the
--min-contig parameter (see section 2.4).
Total length (≥ x bp)
is the total number of bases in contigs of length
≥ x bp.
Not affected by the
--min-contig parameter (see section 2.4).
All remaining metrics are computed for contigs that exceed the threshold specified with
--min-contig (see section 2.4, default is 500 bp).
# contigs is the total number of contigs in the assembly.
Reference length is the total number of bases in the reference genome.
Reference GC (%) is the percentage of G and C nucleotides in the reference genome.
L50 (L75, LG50, LG75) is the number of contigs equal to or longer than N50 (N75, NG50, NG75)
In other words, L50, for example, is the minimal number of contigs that cover half the assembly.
--extensive-mis-size. See more details about misassemblies in section 3.1.2. Important note: this metric does not sum up
# local misassemblies, # scaffold gap size misassemblies, # structural variants, and
# unaligned mis. contigsdescribed below.
# misassembled contigs is the number of contigs that contain misassembly events
# misassemblies above).
Misassembled contigs length is the total number of bases in misassembled contigs.
# local misassemblies is the number of positions in the contigs (breakpoints) that satisfy the following conditions:
# scaffold gap size misassemblies is the number of positions in the scaffolds (breakpoints)
where the flanking sequences are combined in scaffold on the wrong distance (
Max allowed distance inconsistency is controlled by
--scaffold-gap-max-size option (default is 10 kbp).
# unaligned mis. contigs is the number of contigs that have the number of unaligned bases more than 50% of contig length and at least one misassembly event in their aligned fragment. Such contigs are probably not related to the reference genome, thus their misassemblies may be not real errors but differences between the assembled organism and the reference.
# unaligned contigs is the number of contigs that have no alignment to the
reference sequence. The value "X + Y part" means X totally unaligned contigs plus Y partially unaligned contigs.
This metric sums up
# unaligned mis. contigs described above.
Unaligned length is the total length of all unaligned regions in the assembly (sum of lengths of fully unaligned contigs and unaligned parts of partially unaligned ones).
Genome fraction (%) is the percentage of aligned bases in the reference genome.
A base in the reference genome is aligned if there is at least one contig with at least one alignment to this base.
Contigs from repetitive regions may map to multiple places, and thus may be counted multiple times (see
Duplication ratio is the total number of aligned bases in the assembly divided by the total number of aligned bases in the reference genome (see Genome fraction (%) for the 'aligned base' definition). If the assembly contains many contigs that cover the same regions of the reference, its duplication ratio may be much larger than 1. This may occur due to overestimating repeat multiplicities and due to small overlaps between contigs, among other reasons.
# N's per 100 kbp is the average number of uncalled bases (N's) per 100000 assembly bases.
# mismatches per 100 kbp is the average number of mismatches per 100000 aligned bases. True SNPs and sequencing errors are not distinguished and are counted equally.
# indels per 100 kbp is the average number of indels per 100000 aligned bases. Several consecutive single nucleotide indels are counted as one indel.
# genes is the number of genes in the assembly (complete and partial), based on a user-provided
list of gene positions in the reference genome. A gene 'partially covered' if the assembly contains at least 100 bp
of this gene but not the whole one.
This metric is computed only if a reference genome and an annotated list of gene positions are provided (see section 2.4).
# operons is defined similarly to # genes, but an operon positions file required instead.
# predicted genes is the number of genes in the assembly
found by GeneMarkS, GeneMark-ES, MetaGeneMark, or GlimmerHMM. See the description of
--gene-finding option for details.
Total aligned length is the total number of aligned bases in the assembly. A value is usually smaller than a value of total length because some of the contigs may be unaligned or partially unaligned.
Largest alignment is the length of the largest continuous alignment in the assembly. A value can be smaller than a value of largest contig if the largest contig is misassembled or partially unaligned.
NA50, NGA50, NA75, NGA75, LA50, LA75, LGA50, LGA75 ("A" stands for "aligned") are similar to
the corresponding metrics without "A", but in this case aligned blocks instead of contigs are considered.
Aligned blocks are obtained by breaking contigs at misassembly events and removing all unaligned bases.
# misassemblies is the same as # misassemblies from section 3.1.1. However, this report also contains a classification of all misassembly events into three groups: relocations, translocations, and inversions (see below). For metagenomic assemblies, this classification also includes interspecies translocation.
Relocation is a misassembly event (breakpoint) where the left flanking sequence aligns over 1 kbp away from the right flanking
sequence on the reference genome, or they overlap by more than 1 kbp, and both flanking sequences align on the same chromosome. Note that default threshold of 1 kbp can be
Translocation is a misassembly event (breakpoint) where the flanking sequences align on different chromosomes.
Interspecies translocation is a misassembly event (breakpoint) where the flanking sequences align on different reference genomes (MetaQUAST only).
Inversion is a misassembly event (breakpoint) where the flanking sequences align on opposite strands of the same chromosome.
# misassembled contigs and misassembled contigs length are the same as the metrics from section 3.1.1 and are counted among all contigs with any type of a misassembly event described above (relocation, translocation, interspecies translocation or inversion).
# possibly misassembled contigs is the number of contigs that contain large unaligned fragment and thus could possibly contain interspecies translocation
with unknown reference (MetaQUAST only, combined reference only). Minimal length of the consecutive unaligned fragment (excluding N's) is controlled by
--unaligned-part-size, default value is 500 bp.
# possible misassemblies is the number of putative interspecies translocations in possibly misassembled contigs if each large unaligned fragment is supposed to be a fragment of unknown reference (MetaQUAST only, combined reference only).
The next metrics are the same to homonymous metrics from section 3.1.1. Note that all of them are excluded from
# misassemblies and related metrics:
# mismatches is the number of mismatches in all aligned bases.
# indels is the number of indels in all aligned bases. Several consecutive single nucleotide indels are counted as one indel. Note: default maximum length of indel is 85 bp. All indels larger than 85 bp are considered as local misassemblies.
# indels (≤ 5 bp) is the number of indels of length
≤ 5 bp.
# indels (> 5 bp) is the number of indels of length
> 5 bp.
Indels length is the total number of bases contained in all indels.
# fully unaligned contigs is the number of contigs that have no alignment to the reference sequence.
Fully unaligned length is the total number of bases in all unaligned contigs.
# partially unaligned contigs is the number of contigs that are not fully unaligned (i.e. have at least one alignment), but have at least one unaligned fragment
≥ the threshold defined by
--unaligned-part-size (default value is 500 bp).
Partially unaligned length is the total number of unaligned bases in all partially unaligned contigs.
# N's is the total number of uncalled bases (N's) in the assembly.
This section describes PDF and HTML plots. For Icarus interactive contig alignment and size visualization see section 3.4.
Cumulative length plot shows the growth of contig lengths. On the x-axis, contigs are ordered from the largest to smallest. The y-axis gives the size of the x largest contigs in the assembly.
Nx plot shows Nx values as x varies from 0 to 100 %.
NGx plot shows NGx values as x varies from 0 to 100 %.
GC content plot shows the distribution of GC content in the contigs.
The x value is the GC percentage (0 to 100 %).
The y value is the number of non-overlapping 100 bp windows which GC content equals x %.
For a single genome, the distribution is typically Gaussian. However, for assemblies with contaminants, the GC distribution appears to be a superposition of Gaussian distributions, giving a plot with multiple peaks.
GC content plot (by contigs) shows the distribution of # contigs with GC percentage in a certain range.
The x value is the GC percentage intervals (width is 5 %).
The y value is the number of contigs which GC content lies in the corresponding interval.
These plots are particularly useful for looking at metagenome assemblies, but also a potential good indicator of contamination for single organism assemblies.
Coverage histogram shows distribution of total contig lengths (y-axis) at different read coverage depths (x-axis, grouped in bins). Coverage bin size is automatically selected based on the number of contigs and coverage deviation.
Note: these histograms are only available for assemblies with SPAdes/Velvet-like contig naming style (..._length_X_cov_Y_...).
Cumulative length plot for aligned contigs shows the growth of lengths of aligned blocks.
If a contig has a misassembly event, QUAST breaks it into smaller pieces called aligned blocks.
On the x-axis, blocks are ordered from the largest to smallest. The y-axis gives the size of the x largest aligned blocks.
This plot is created only if a reference genome is provided.
NAx and NGAx plots
These plots are similar to the Nx and NGx plots but for the NAx and NGAx metrics respectively. These plots are created only if a reference genome is provided.
Genes plot shows the growth rate of full genes in assemblies.
The y-axis is the number of full genes in the assembly, and the x-axis is the number of contigs in the assembly (from the largest one to the smallest one).
This plot could be created only if a reference genome and genes annotations files are given.
Operons plot is similar to the previous one but for operons.
Feature-Response Curves (FRCurves)
Our FRCurves are inspired by AMOS FRCurve definition: Given any such set of features, the response (quality) of the assembler output is then analyzed as a function of the maximum number of possible errors (features) allowed in the contigs. We plot FRCurves as following:
The x value (Feature space) is the total maximum number of features in the contigs.
The y value (Genome coverage %) is the total number of aligned bases in the contigs, divided by the reference length.
Note: since some contigs may overlap with other contigs or map to the same regions of the reference (see Duplication ratio), the total number of aligned bases may exceed the reference length and cause Genome coverage % larger than 100%.
FRCurves plots are currently available for # misassemblies (both PDF and interactive HTML formats) and # genes/operons (PDF version only). We probably will extend this set in the future.
MUMmer plots are alignment dotplots where a sequence is laid out on each axis and a point is plotted at every position where the two sequences show similarity. See mummerplot utility description for more details.
Note: these plots are saved to
as gnuplot requires additional dependencies for outputting in PDF/PNG and many other image formats. However,
this type of plots is suppresed by
--no-plots option (as other plots)
and NOT suppressed by
Output for combined reference genome is located inside
combined_reference subdirectory of the output directory provided with
-o (or in quast_results/latest).
An output for each reference genome is placed into separate directory inside
Also, plots and reports for key metrics are saved under
Combined HTML report is saved to
These plots are created for each key metric to show its values for all assemblies vs all reference genomes. References on the plot are sorted by the mean value of this metric in all assemblies. References are always sorted from the best results to the worst ones, thus the plot can be descending or ascending depend on the metric.
Metric-level reports (TXT, TSV and TEX versions)
These files contain the same information as the metric-level plots, but in different formats: simple text format, tab-separated format, and LaTeX.
Summary HTML-report is created on the basis of HTML-report in
combined_quast_output/. Each row is expandable and contains data for all reference genomes.
You can view results separately for each reference genome by clicking on a row preceded by plus sign:
Note that values for some metrics like # contigs may not sum up, because one contig may be aligned to multiple reference genomes.
Krona pie charts show assemblies and dataset taxonomic profiles. Relative species abundance is calculated based on the total aligned length of contigs aligned to corresponding reference genome. Charts are created for each assembly and one additional chart is created for all assemblies altogether.
Note: these plots are created only in de novo evaluation mode (MetaQUAST without reference genomes).
Icarus generates contig size viewer and one or more contig alignment viewers (if reference genome/genomes are provided).
All of them are located in
<quast_output_dir>/icarus_viewers/. The links to the viewers and other auxiliary
information are provided in Icarus main menu which is saved in
<quast_output_dir>/icarus.html. Note that
QUAST HTML report also contains a link to Icarus output.
All Icarus viewers contain a legend with color scheme description. For moving and zooming interactive window you can use mouse, Icarus controls (top panel) or keyboard shortcuts (+, -, ←, →, use Shift to speed up the action).
Contig size viewer
This type of viewer draws contigs ordered from longest to shortest. This ordering is suitable for comparing only largest contigs or number of contigs longer than a specific threshold. The viewer shows N50 and N75 with color and textual indication. If the reference genome is available or at least approximate genome length is known (see
NG50 and NG75 are also shown.
You can also tone down contigs shorter than a specified threshold using Icarus control panel.
Contig alignment viewer
This type of viewer is available only if a reference genome is provided. For large genomes (≥ 50 Mbp) each chromosome is displayed in a separate viewer. This is also true for multiple reference genomes (see section 2.5).
The viewer places contigs according to their mapping to the reference genome. The viewer can additionally visualize genes, operons, and read coverage distribution along the genome, if any of those were fed to QUAST.
Note: We recommend to use Icarus in Chrome, however it was tested in other popular web browsers as well (see FAQ, Q9 for the exact list with versions).
You can easily change content, order of metrics, and metric names in all QUAST reports. In order to do this,
CONFIGURABLE PARAMETERS section in
quast_libs/reporting.py. It contains a lot of informative comments,
which will help you to adjust QUAST reports easily even if you are new to Python.
You can also adjust plot colors, style and width of lines, legend font, etc.
CONFIGURABLE PARAMETERS section in
Note: if you restart QUAST on the same directory with new parameters, is will reuse existing alignments and run much faster.
See the description of
-o option in section 2.4.
We will be thankful if you help us make QUAST better by sending your comments, bug reports, and suggestions to firstname.lastname@example.org.
We kindly ask you to attach the
quast.log file from output directory (or an entire archive of the folder) if you have troubles running QUAST.
Note that if you didn't specify the output directory manually, it is going to be automatically set to
quast_results/results_<date_time>, with a symbolic link
quast_results/latest to that directory.
This section contains frequent questions about QUAST. Read answers below for deeper understanding of the results generated by the tool.
For the simplicity of explanation we further refer to the directory containing all results as
If you use the command-line version of QUAST you can specify
<quast_output_dir> with -o option (
"quast_results/latest" if not specified).
If you use http://quast.bioinf.spbau.ru/ you should download full report by pressing
"Download report" button (at the top-right corner),
decompress result and go to
--scaffoldsoption but got counter-intuitive results. The
number of misassembliesin "broken" version of the assembly is higher than in the original one while I expected vice versa or at max the similar numbers of misassemblies. Could you explain this?
Q1. It seems that QUAST is giving me a differing number of misassemblies and misassembled contigs. Does this imply that QUAST looks for multiple misassemblies within one contig?
Yes, you are right, QUAST looks for multiple misassembly events within one contig. Thus, number of misassembled contigs is always less or equal to number of misassemblies.
Q2. Is there a way to get only misassembled contigs of the assembly?
Yes, there is such way.
QUAST copies all misassembled contigs of
"<assembly_name>" assembly into
E.g. if your assembly is "contigs.fasta" then the file is "contigs.mis_contigs.fa", if your assembly is "ecoli_assembly_1.fa.gz" then the file is "ecoli_assembly_1.mis_contigs.fa".
Q3. Is it possible to find which misassembly corresponds to each contig and which kind of a misassembly event it is?
Yes, it is possible. QUAST produces report with detailed info about each contig alignments and the short version with only extensive misassemblies records.
Let's start with the short one. It is saved to
E.g. if your assembly is "contigs.fasta" then the file is "contigs_report_contigs.mis_contigs.info",
if your assembly is "ecoli_assembly_1.fasta" then the file is "contigs_report_ecoli_assembly_1.mis_contigs.info".
The content of this file looks like this:
Extensive misassembly ( inversion ) between 287 575 and 296 1
Extensive misassembly ( relocation, inconsistency = 2655 ) between 16800 18907 and 18905 20382
Let's move to the detailed report. Here you can find information about all misassembled, unaligned and correctly aligned contigs. This report is saved to
<quast_output_dir>/contigs_reports/contigs_report_<assembly_name>.stdout file. E.g. if your assembly is "contigs.fasta" then the file is "contigs_report_contigs.stdout", if your assembly is "ecoli_assembly_1.fasta" then the file is "contigs_report_ecoli_assembly_1.mis_contigs.info".
To get info about misassemblies, you should look for "Extensive misassembly" words in the report and look around to detect contig name which corresponds this misassembly.
Look at the following example:
CONTIG: NODE_772 (575bp)
Top Length: 296 Top ID: 100.0
Skipping redundant alignment 1096745 1096882 | 138 1 | 138 138 | 98.55 | Escherichia_coli NODE_772
This contig is misassembled. 3 total aligns.
Real Alignment 1: 924846 925134 | 287 575 | 289 289 | 100.0 | Escherichia_coli NODE_772
Extensive misassembly ( inversion ) between these two alignments
Real Alignment 2: 924906 925201 | 296 1 | 296 296 | 100.0 | Escherichia_coli NODE_772
Here is another example:
CONTIG: Contig_753 (140518bp)
Top Length: 121089 Top ID: 99.98
Skipping redundant alignments after choosing the best set of alignments
Skipping redundant alignment 273398 273468 | 18977 18907 | 71 71 | 100.0 | Escherichia_coli Contig_753
Skipping redundant alignment 3363797 3363867 | 18977 18907 | 71 71 | 100.0 | Escherichia_coli Contig_753
This contig is misassembled. 14 total aligns.
Real Alignment 1: 1425621 1426074 | 19431 18978 | 454 454 | 100.0 | Escherichia_coli Contig_753
Gap between these two alignments (local misassembly). Inconsistency = 148
Real Alignment 2: 1426295 1426818 | 18905 18382 | 524 524 | 100.0 | Escherichia_coli Contig_753
Extensive misassembly ( relocation, inconsistency = 2224055 ) between these two alignments
Real Alignment 3: 3650278 3650348 | 18977 18907 | 71 71 | 100.0 | Escherichia_coli Contig_753
Extensive misassembly ( relocation, inconsistency = 236807 ) between these two alignments
Real Alignment 4: 3765544 3886652 | 140518 19430 | 121109 121089 | 99.98 | Escherichia_coli Contig_753
Extensive misassembly ( relocation, inconsistency = -1052 ) between these two alignments
Real Alignment 5: 3886649 3905037 | 18381 1 | 18389 18381 | 99.96 | Escherichia_coli Contig_753
Q4. Could you explain the format of Real Alignments in contigs report files (see the answer for Q3)?
Yes, sure. Let's look at the following example:
Real Alignment 1: 19796 20513 | 29511 30228 | 718 718 | 100.0 | ENA|U00096|U00096.2_Escherichia_coli contig-710
The next two numbers (in this case: 718 718) mean "the number of aligned bases on the target" and "the number of aligned bases on the query". They are usually equal to each other but they can be slightly different because of short insertions and deletions. Actually, these numbers are excessive because they can be easily calculated based on the first two pairs of numbers (positions on the target and positions on the query). However, sometimes it is convenient to look at these numbers.
The last number (in this case: 100.0) is the aligner quality metric. It is called "identity %" (IDY %) and it describes the quality of the alignment (the number of mismatches and indels between the target and the query). If IDY% = 100.0 then the alignment is perfect, i.e. all bases on the target and on the query are equal to each other. If IDY% is less than 100.0 then the target and the query are slightly different. Quast has a threshold on IDY% which is 95%. Thus we don't use alignments with IDY% less than 95% (they are considered to be relatively bad).
And finally, the last two columns are the name of the target sequence (i.e. reference genome name) and the name of the query (i.e. contig name).
Q5. Where does QUAST save information about SNPs?
There are two output files containing SNP information. Both of them are saved in
The first one has extension ".all_snps" and it is raw aligner output. Its format is:
[P1] [SUB] [SUB] [P2] [BUFF] [DIST] [R] [Q] [FRM] [TAGS]
15383 T G 3339560 1 15383 3 2 1 -1 Escherichia_coli contig_15
R and Q specify the number of other alignments, which overlap this position (in Reference and Query (i.e. contig) respectively). FRM and TAGS are not documented in Nucmer help message, and the last two columns are reference name and contig name.
The second file ("*.used_snps") is generated by QUAST.
We analyse all alignments and filter them by skipping some "uninformative" alignments (redundant, duplicated) and after that include in ".used_snps" file only those of all SNPs, which were actually appear in filtered alignments. Thus, values of "# mismatches per 100 kbp", "# indels per 100 kbp" reported by QUAST include statistics from USED SNPs, not ALL SNPs.
In addition, we use our own format of ".used_snps" file.
Escherichia_coli contig_15 728803 C . 3217983
Q6. What does "broken" version of an assembly refer to while assessing scaffolds' quality (
Actually, the difference between "broken" and original assembly (scaffolds) is very simple. QUAST splits input fasta by continuous fragments of N's of length ≥ 10 and call this a "_broken" assembly. By doing this we try to reconstruct "contigs" which were used for construction of the scaffolds. After that, user can compare results for real scaffolds and "reconstructed contigs" and find out whether scaffolding step was useful or not.
If you have both contigs.fasta and scaffolds.fasta it is better to specify both files to QUAST and don't set
The comparison of real contigs vs real scaffolds is more honest and informative than scaffolds vs scaffolds_broken.
To sum up, you should use
--scaffolds option if you don't have original file with contigs but want to compare your scaffolds with it.
Also note, that
--scaffolds option implies QUAST to search for scaffold gap size misassemblies.
Q7. Can I add new assemblies to existing QUAST report without need to realign already processed assemblies? Or can I at least rerun existing QUAST report with slightly modified options set?
Yes, sure! You just need to specify existing QUAST output directory with
-o option. Our tool
will reuse already generated alignments and will run alignment process only for new assemblies.
Note that all of QUAST options except
--min-contig do not affect alignment process,
so you can rerun previous QUAST command with modified options and QUAST will reuse existing alignments also.
Hint: if you did not specify QUAST output dir with
-o option you can rerun QUAST on the same directory
Q8. Which types of structural variations (SV) are handled by QUAST?
Can you give examples of correct BEDPE files for
QUAST can detect and correctly resolve inversions, deletions, and translocations. We also plan to add support for insertions soon.
BEDPE format specification is here. We process first seven columns of the file (chrom1, start1, end1, chrom2, start2, end2, name), the rest are optional and not read by QUAST. Note that columns should be tab-separated!
Chrom1, start1, end1 define confidence interval around SV start, chrom2, start2, end2 define confidence interval around SV end. Name defines SV type and it should contain 'INV' substring for inversions or 'DEL' for deletions; translocations are automatically identified if chrom1 is not equal to chrom2.
Example of BEDPE line for inversion on positions 1000-1200 of 'E.coli' chromosome (confidence interval is 11 bp long):
Example of BEDPE line for deletion of fragment between 1000 and 1200 of 'S.aureus' chromosome:
E.coli 995 1010 E.coli 1195 1205 This_is_INVersion The Rest Columns Are Optional
Example of BEDPE line for translocation from position 500 of 'chr1' chromosome to position 100 of 'chr2' chromosome (confidence interval is different for both ends):
S.aureus 995 1010 S.aureus 1195 1205 DEL
chr1 450 550 chr2 100 100 name_does_not_matter_here
Q9. Which versions of web browsers are suitable for Icarus output?
We recommend to use Icarus in Chrome (tested with v49.0.x), however it also works properly in Safari (tested with v8.0.x) and Firefox (tested with v41.0.x and v45.0.x). Most of the functionality works in Internet Explorer 9 and higher, but we do not recommend this browser due to slow animation.
Q10. Could you show a sample file suitable for
--references-list MetaQUAST option?
The file is just a list of reference names (one per line) to be searched in the NCBI database.
Feel free to use spaces or underscores inside these names. Correct and working example is below:
Note that the first three references should normally be found, downloaded and used for your assemblies evaluation.
At the same time you will be notified that Harry Potter reference genome is not found in the NCBI database yet.
Lactobacillus reuteri DSM 20016
Q11. Sometimes the "# contigs" and the "# contigs ≥ 0 bp" do not agree. Should not these be equal? What is the difference?
# contigs reports number of contigs above specified threshold.
Default threshold is 500 bp and it can be changed with
Most of the other statistics are also based on all contigs larger than this threshold
(we actually remove all shorts contigs in the beginning of the pipeline). For example, # misassemblies is essentially
# misassemblies in contigs ≥ min-contig threshold.
However, all metrics containing length specification in parenthesis are not affected by
--min-contig! For example, # contigs ≥ 0 bp is the number of contigs before the filtration.
The value of the threshold is written in the very first line of the text report (like "All statistics are based on contigs of size ≥ 500 bp, unless otherwise noted" ). It is also present in the header section of HTML report and on the bottom of PDF report.
To sum up, by default, # contigs is the same to # contigs >= 500 bp, so there will be difference between # contigs and # contigs ≥ 0 bp if your assembly has contigs shorter than 500 bp.
Q12. Can I use custom BLAST database instead of SILVA 16S rRNA for reference searching?
Yes. If you want to blast your contigs against a local BLAST database, you can specify path to the database with
To create a BLAST database, you need
makeblastdb from BLAST+ package.
You can also use
makeblastdb from <quast_installation_dir>/blast/ or ~/.quast/blast/ (depending on your installation).
MetaQUAST automatically creates this directory and downloads the binary into it when you run full QUAST installation or
metaquast.py without reference for the first time.
You can create a BLAST database from your FASTA file by running
makeblastdb -in <path_to_fasta_file> -dbtype nucl.
If you have multiple FASTA files, you should concatenate them into one.
Note: MetaQUAST will try to search references in the NCBI database based on headers from your FASTA file. Ensure that the headers contain species names in simple parsable format without spaces, for example:
>Escherichia_coli, complete genome
>NZ_CP015308.1|Lactobacillus_plantarum_strain_LY-78, complete genome
Q13. Where can I find details about unaligned fragments of my assembly?
Starting from v.4.4, we have added detailed reports with this information. These reports are generated for all assemblies and saved to
E.g. if your assembly is "contigs.fasta" then the file is "contigs_report_contigs.unaligned.info",
if your assembly is "ecoli_assembly_1.fasta" then the file is "contigs_report_ecoli_assembly_1.unaligned.info".
The report include all fully unaligned and partially unaligned contigs, i.e. contigs that have at least one unaligned fragment ≥ 500 bp. This is the default threshold and it can be changed with
The following values are reported for each contig:
Q14. I have very large assemblies and reference genome but my computational resources are limited. Could you recommend an optimal set of QUAST options to use?
Current QUAST version is not optimised for large genomes yet. However, we have some useful suggestions. You may be interested in adding the following options to your command line to reduce RAM consumption, disk space usage, and running time. Note that each of them have some negative effect, so you should choose what is more important for you and may be try them one by one until you find the optimal set for your particular case.
--memory-efficient(drawback: increased running time); set lower number of threads (
-t, drawback: increased running time),
--space-efficient(drawback: removed aux files with some details for advanced analysis),
--fast(drawback: missed plots, HTML reports, some metrics); set greater number of threads (
-t, drawback: increased memory consumption)
--no-snps(drawback: missed SNP-related metrics in reports); set greater minimal contig length threshold (
--min-contig, drawback: not analysed short contigs); set greater minimal alignment length and quality thresholds (
--min-identityrespectively, drawback: not analysed contigs with relatively bad alignments)
Q15. I evaluated my assembly using
--scaffolds option but got counter-intuitive results.
number of misassemblies in "broken" version of the assembly is higher than in the original one while
I expected vice versa or at max the similar numbers of misassemblies. Could you explain this?
You are absolutely right that normally a broken version of an assembly should have smaller (or the same) number of misassemblies
comparing to the scaffolded version. Scaffolding is an additional step and it could introduce new errors
by connecting not related contigs into a single scaffold. At the same time it could not fix misassemblies already present in contigs.
Thus, the number of misassemblies may only increase in comparing to contigs, i.e. the broken version.
However, your case is probably a little bit more complicated. If your reference genome is not very close to the sequenced organism, you usually get many partially unaligned contigs (scaffolds). If such scaffold has more than 50% of unaligned bases, its misassemblies are excluded from
# misassemblies as untrustworthy ones and counted in
# unaligned mis. contigs metric. At the same time, the broken version
may include this scaffold as a set of short contigs split by continues fragments of N's. Some of these contings
may be fully unaligned while some of them may be considered as normal (less than 50% is unaligned).
The misassemblies in normal contigs will be counted in
To sum up, please take a look at
# unaligned mis. contigs values.
We suggest that the number of such contigs is higher for the scaffolded version and
this is the probable source of the higher number of misassemblies in the broken assembly. You may also be interested in
Icarus visualization where all
unaligned mis. contigs are highlighted with grey-red color.
Q16. I evaluated multiple assemblies against the reference genome and opened Icarus Contig Alignment Viewer. Color scheme is almost self explanatory but could you explain the meaning of "similar correct contigs" (colored blue) and "similar misassembled blocks" (colored orange)? How do you define the similarity?
The algorithm is described in details in Icarus paper
(see Supplementary Material, Section 1.2). The brief definition is below.
Two blocks are considered "similar" if they satisfy the following conditions altogether: