Denoising 16S Sequence Reads using UNOISE3 via VSEARCH and QIIME2

Previously, I shared some workflows for denoising 16S rRNA gene sequence data used by our lab. One of the examples used the popular UNOISE3 algorithm developed by Robert Edgar and implemented in USEARCH. However, use of the 64-bit version of USEARCH requires a license.

VSEARCH is an open-source alternative to USEARCH developed by Rognes, Mahe, and others, and is available as a binary file for Linux, Mac, or Windows and as a QIIME2 plug-in. The motivation for developing VSEARCH and related details are described here.

While VSEARCH is impressive software, I have hesitated providing examples of its use. However, I have received numerous inquires from investigators who, for one reason or another, are unable to access a licensed version of USEARCH. Given VSEARCH is, at least in part, now integrated with QIIME2 (and Mothur), I figured implementing the UNOISE3 algorithm would be straight forward within QIIME2 (i.e., this has already been done). However, looking over the documentation it was not clear whether the –cluster_unoise command is actually available to users.

Therefore, I provide an example script to denoise 16S rRNA reads using UNOISE3 via VSEARCH and taxonomic classification and fragment insertion via QIIME2 below. The workflow assumes 2x250 PE sequencing and use of the V4 515F-806R primer set. Modification of the commands below is recommended for other set ups. VSEARCH accepts compressed files (which is nice), so the workflow expects gzipped files. I wrote the script using the example data from Pat Schloss’s Mothur MiSeq SOP where the files all end with “_001.fastq”, and since the VSEARCH –fastq_mergepairs does not look to directly accept files from multiple samples, the for loop used in the merge reads step assumes the files have this naming convention. So one may need to modify the file names, or the for loop, for your specific use case. In addition, the script assumes that primers and all non-biologic sequences have been removed and that all the PE files reside in a single folder (here named “seqs” on my desktop). You will also want to modify the number of threads used in each step depending on the number of cores you have available.

Since I am working on a mac, I installed the program into a new conda environment (named vsearch) from the (bioconda repository)[]. If you are working on another platform, you may need to make the appropriate modifications. Also, if working interactively, one would want to load the vsearch environment via “conda activate vsearch” rather than “source”. I performed the taxonomic classification and fragment insertion into the GGv13.8 tree within QIIME2 to correspond with the previous workflow, but make no claim that this workflow is optimal (see here for example). See the appropriate Q2 links to access the taxonomy training data and phylogeny.

The number of zOTUS called by this VSEARCH workflow, and the number of non-zero cells, were slightly higher than I obtained using similar, but not identical, commands in USEARCH. They are much higher than what I obtained using dada2, but performance is expected to vary across datasets. Some perhaps minor improvements may be made by more stringent filtering reads in the merged data based the max expected errors…or when defining the centroids…or by imposing a stricter threshold for sequence similarity when mapping reads back to centroids…or by imposing stricter thresholds when merging the PE reads…etc.

Given that I do not typically use VSEARCH in our work, please let me know if I missed any steps or settings that should be further modified from the defaults.


#Script to run VSEARCH UNOISE3 workflow for 16S V4 reads on IMAC (desktop)
# - Denoising: VSEARCH (version 2.15.2) UNOISE3
# - Taxonomic classification: GG13_8 Naive Bayesian classifier
# - Phylogenetic tree: SEPP

#Program Assumes:
# - demultiplexed F and R fastq files are placed in "seqs" folder folder on desktop
# - files are compressed (.gz)
# - primers and all non-biologic sequences have been removed
# - files end in "R1_001.fastq.gz" format
# - use following if need to modify: for file in *; do mv "$file" ${file//_R1/_R1_001}; done

#To run the script:
# - open terminal window
# - navigate to the script (after copied to desktop)
# - run: caffeinate sh

##Navigate to folder with PE fastq.gz files (CHANGE ME if needed)
cd /users/olljt2/desktop/seqs

##Load vsearch
source activate vsearch

##Run vsearch workflow
#Merge reads
for R1 in *_R1_*.fastq.gz; do
    vsearch \
        --fastq_mergepairs ${R1} \
        --reverse ${R1/_R1_/_R2_} \
        --fastqout ${R1/_R1_*/merged.fq} \
        --fastaout ${R1/_R1_*/merged.fa} \
        --fastq_maxdiffs 10 --fastq_minmergelen 240 --fastq_maxmergelen 270 \
        --relabel ${R1} \
         --threads 14

cat *fa > my.fasta
cat *fq > my.fastq
rm *.fa
rm *fq

sed 's/fastq.gz//g' < my.fasta > merged.fasta
sed 's/fastq.gz//g' < my.fastq > merged.fastq
rm my.fasta
rm my.fastq

#Filter low quality reads
vsearch --fastq_filter merged.fastq --fastq_maxee 1.0 --fastaout filtered.fa --relabel filt --log filter_log.txt

#Get unquies
vsearch  --derep_fulllength filtered.fa --sizein --fasta_width 0 --sizeout --output uniques.fa --minuniquesize 1 --relabel unique --log uniques_log.txt

#Get zOTUs
vsearch --cluster_unoise uniques.fa --centroids zotus.fa --uc uc_zotus.uc --log unoise_log.txt --threads 14

#Remove chimeras
vsearch  --uchime3_denovo zotus.fa --nonchimeras zotus_nochime.fa --fasta_width 0 --relabel zotu --xsize --log unchime3_log.txt

#Removing any masking (convert any lower to upper case)
awk 'BEGIN{FS=" "}{if(!/>/){print toupper($0)}else{print $1}}' zotus_nochime.fa > zotus_nochime_nomask.fa

#Map reads back to centriods
vsearch --usearch_global merged.fasta --db zotus_nochime_nomask.fa --id 0.97 --otutabout unoise3_zotu_table.txt --biomout unoise3_zotu_table.biom  --threads 14 --log otutab_log.txt

#Move log files
mkdir logs
mv *log.txt logs

#Deactivating vseach conda env.
conda deactivate

##Load QIIME2
source activate qiime2-2022.8

##Convert zOTU table and rep-seqs to q2 artifacts
mkdir q2
mkdir q2/reports

qiime tools import \
--input-path zotus_nochime_nomask.fa \
--output-path q2/rep-seqs.qza \
--type "FeatureData[Sequence]"

qiime tools import \
  --input-path unoise3_zotu_table.biom \
  --type 'FeatureTable[Frequency]' \
  --input-format BIOMV100Format \
  --output-path q2/table.qza

##Assign taxonomy using pre-trained (515F-806R) GG ref db
cd q2

qiime feature-classifier classify-sklearn \
  --i-classifier /Users/olljt2/Documents/q2_db/gg-13-8-99-515-806-nb-classifier.qza \
  --i-reads rep-seqs.qza \
  --o-classification taxonomy.qza \
  --p-confidence .8 \
  --p-n-jobs -14

qiime metadata tabulate --m-input-file taxonomy.qza --o-visualization reports/taxonomy.qzv

##Fragment insertion via SEPP for tree
qiime fragment-insertion sepp \
  --i-representative-sequences rep-seqs.qza \
  --i-reference-database /Users/olljt2/Documents/q2_db/sepp-refs-gg-13-8.qza \
  --o-tree tree.qza \
  --o-placements insertion-placements.qza \
  --p-threads 14

#Deactivating conda env.
source deactivate

One could now begin to work with these data in QIIME2. For those who prefer to work in R, I like to use the qiime2R package developed by Jordan Bisanz to convert the QIIME2 artifacts to a phyloseq object.

Below is some example code that can be used to build the phyloseq object. Note that the .qza objects are compressed files containing the data and additional provenance information etc. (which is very nice). So here I just manually decompress the rep-seqs.qza file and grab the dna-sequences.fasta file from the data folder before running the code and all works as expected.

library.path <- .libPaths()
library(qiime2R, lib.loc = library.path); packageVersion("qiime2R")      #version: 0.99.6
library(phyloseq); packageVersion("phyloseq")                            #version: 1.36.0
library(Biostrings); packageVersion("Biostrings")                        #version: 2.60.2

(ps_object <- qza_to_phyloseq(features = "/Users/olljt2/desktop/seqs/q2/table.qza",
                        tree = "/Users/olljt2/desktop/seqs/q2/tree.qza",
                        taxonomy = "/Users/olljt2/desktop/seqs/q2/taxonomy.qza"))

#need to first manually decompress qza file
rep <- Biostrings::readDNAStringSet("/Users/olljt2/desktop/seqs/q2/dna-sequences.fasta")
(ps_object <- merge_phyloseq(ps_object, rep))

saveRDS(ps_object, "/Users/olljt2/desktop/ps_vsearch_unoise3_gg138_sepp.rds")

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## loaded via a namespace (and not attached):
##  [1] bookdown_0.24   digest_0.6.28   R6_2.5.1        jsonlite_1.7.2 
##  [5] magrittr_2.0.3  evaluate_0.16   blogdown_1.7    stringi_1.7.5  
##  [9] rlang_1.0.4     cli_3.3.0       rstudioapi_0.13 jquerylib_0.1.4
## [13] bslib_0.3.1     rmarkdown_2.11  tools_4.1.1     stringr_1.4.0  
## [17] xfun_0.29       yaml_2.2.1      fastmap_1.1.0   compiler_4.1.1 
## [21] htmltools_0.5.2 knitr_1.40      sass_0.4.2
Associate Professor of Pediatrics