Taxonomic and Functional Profiling using Biobakery Workflows


Below I provide scripts to implement the current default workflow for taxonomic and functional profiling using the Huttenhower Lab’s Biobakery Tool Suite used by the Microbial Metagenomics Analysis Center (MMAC) at CCHMC for paired-end data. These scripts are written to run on the CCHMC high-performance computing (HPC) cluster. This post is intended to provide collaborators working with the MMAC detailed information on how their sequence data was processed. If we did not discuss any modifications to this workflow for your project, this is how we generated the taxonomic and functional profiles. We continually work to improve our workflows, so please check with us before citing these packages and versions to ensure the website is up-to-date.

For those investigators looking to run this workflow on your own, you would simply need to convert the scripts to .bat files to be run on the cluster. See here for some instructions or reach out to us if you have questions.

The scripts below perform QC and profiling of compressed paired-end fastq files (.fastq.gz) obtained from next-generation sequencing using the biobakery workflows framework for simple and reproducible execution developed by Lauren McIver and colleagues as described here.

The primary steps include:

  • Removal of reads mapping to the human reference database and basic quality control using KneadData
  • Taxonomic profiling with MetAPhlAn 3
  • Functional profiling with HUMAnN 3

Detailed information on each process can be found at the links above and in the published papers.

Strain and pangenome profiling is not currently supported on the cluster, but can be run on our local workstations. Please reach out to us if this might be of interest.

The scripts assume you are working with paired-end files with all non-biologic sequences already removed since this is the most common set up used by the Microbial Genomics and Metagenomics Laboratory. However, the biobakery workflow steps will also run on single-end files. Just be sure to follow the expected naming conventions. You may also need to change the paths of the BSUB -e and -o if you do not have access to the MMAC drive to access your log and error files.

I have annotated the script below to provide information on versioning, accessing commands, and rough estimates on expected run times, but you may need to modify this to meet your needs. The program will return processed/cleaned fastq files; phylum-, genus-, and species-level taxonomic profiles; UniRef90 gene families; and metacyc and enzyme commission number pathways in stratified and unstratified form.


Running Biobakery Workflows

#!/bin/bash
#BSUB -W 96:00
#BSUB -M 120000
#BSUB -n 32
#BSUB -e /data/MMAC-Shared-Drive/scratch/%J.err
#BSUB -o /data/MMAC-Shared-Drive/scratch/%J.out


#notes:
#       -installed versions are:
#          metaphlan v3.0.7
#          humann v3.0.0
#          kneaddata v0.10.0

#      - get list of available workflows with: biobakery_workflows --help
#      - get workflow options with: biobakery_workflows wmgx --help

#      - expected input files are fastq.gz (single-end or paired-end)
#      - the naming convention is: $SAMPLE.fastq.gz if single-end and $SAMPLE.R1.fastq.gz and $SAMPLE.R2.fastq.gz if paired-ending
#      - $SAMPLE can contain any characters except spaces or periods.
#      - the workflow by default expects input files with the extension "fastq.gz". If your files are not gzipped, run with the option --input-extension fastq

#      - to rename files use something like:
#       for file in *; do mv "$file" "${file//_R1./.R1.}"; done
#       for file in *; do mv "$file" "${file//_R2./.R2.}"; done

#     - 8 samples at ~4Gb per sample takeas ~4-5 hours per sample given these settings
#     - so plan time accordingly and build in buffer
#     - if increase threads above 8 may need additional RAM

#     - script assumes fastq.gz files are located in seqs folder (change as needed)



#Navigate to directory: (CHANGE ME)
cd /scratch/olljt2


#Load biobakery3
module load anaconda3
source activate biobakery
module load bowtie2
module load trimmomatic
export BIOBAKERY_WORKFLOWS_DATABASES=/database/biobakery_workflows_databases/


#Taxonomic and functional profiling
biobakery_workflows wmgx --input seqs --output biobakery_output \
 --bypass-strain-profiling \
 --qc-options="--trimmomatic-options "MINLEN:90"" \
 --taxonomic-profiling-options="--add_viruses" \
 --local-jobs 4 --threads 8


#Modifying taxonomic output
cd biobakery_output
cd metaphlan
cd merged
mkdir mpa3_out
grep -E "(s__)|(taxonomy)" metaphlan_taxonomic_profiles.tsv > mpa3_out/mpa3_species_relab.txt
grep -E "(g__)|(taxonomy)" metaphlan_taxonomic_profiles.tsv | grep -v "s__" > mpa3_out/mpa3_genus_relab.txt
grep -E "(p__)|(taxonomy)" metaphlan_taxonomic_profiles.tsv | grep -v "c__" | grep -v "o__" | grep -v "f__" | grep -v "g__" |grep -v "s__" > mpa3_out/mpa3_phylum_relab.txt
cd ..
cd ..


#Modifying functional output
cd humann/merged
mkdir hmn3_out

humann_split_stratified_table --input pathabundance_relab.tsv --output hmn3_out/.
humann_split_stratified_table --input ecs_relab.tsv --output hmn3_out/.
humann_split_stratified_table --input genefamilies_relab.tsv --output hmn3_out/.

cd hmn3_out
rm ecs_relab_stratified.tsv
rm genefamilies_relab_stratified.tsv
rm pathabundance_relab_stratified.tsv

humann_rename_table --input genefamilies_relab_unstratified.tsv --output genefamilies_relab_unstratified_with_names.tsv --names uniref90
humann_rename_table --input ecs_relab_unstratified.tsv --output ecs_relab_unstratified.tsv_with_names.tsv --names ec


Running Kraken 2 and Bracken

For those who would also like to obtain taxonomic profiles using Kraken 2 and Bracken the script below can be run after the biobakery workflows run completes. We have trimmed the reads using KneadData to a minimum of 90bp to limit errant classifications to coincide with the min length used to build our default Bracken database. So no additional modifications are required to run Kraken and Bracken.

However, we now see the workflow described in Profiling of Shotgun Metagenomic Sequence Data using Kraken2/Bracken and HUMAnN2 as deprecated, but will retain the post for now given ongoing projects using this workflow.

#!/bin/bash
#BSUB -W 24:00
#BSUB -M 120000
#BSUB -n 32
#BSUB -e /data/MMAC-Shared-Drive/scratch/%J.err
#BSUB -o /data/MMAC-Shared-Drive/scratch/%J.out


#Script to run kraken2/bracken pipeline for paired-end reads as part of biobakery 3 workflow on cluster

#notes: - program assumes QC passed fastq files with MINLEN 90 residing in: biobakery_output/kneaddata/main/
#       - program generates k2 read mappings and bracken abundance estimation at the phylum, genus, and species levels
#       - k2 confidnce set to 0.1 and bracken threashold to t-10; consider modifying to reduce FPR as needed
#       - requesting 32 cores, 120G RAM, and 24 hr wall time; see: https://bmi.cchmc.org/resources/software/lsf-examples
#       - to run job: bsub < biobakery3_bracken.bat
#       - should be run after kneaddata


#Navigate to biobakery_output folder: (CHANGE ME)
cd /scratch/olljt2/biobakery_output/kneaddata/main/


#Load modules
module load kraken/2.0.8
module load bracken/2.5.0


#Classify reads with kraken2
mkdir k2_outputs
mkdir k2_reports

for i in *_paired_1.fastq
do
  filename=$(basename "$i")
  fname="${filename%_paired_*.fastq}"
  kraken2 --db /data/MMAC-Shared-Drive/ref_databases/kraken/db_refseq_20200725 \
  --confidence 0.1 \
  --threads 32 \
  --use-names \
  --output k2_outputs/${fname}_output.txt \
  --report k2_reports/${fname}_report.txt \
  --paired ${fname}_paired_1.fastq ${fname}_paired_2.fastq
done


#Abundance estimation with braken (phylum, genus, species)
cd k2_reports
mkdir braken
mkdir braken/species
mkdir braken/genus
mkdir braken/phylum

for i in *_report.txt
do
  filename=$(basename "$i")
  fname="${filename%_report.txt}"
  bracken -d /data/MMAC-Shared-Drive/ref_databases/kraken/db_refseq_20200725 -i $i -r 90 -t 10 -l S -o ${fname}_report_species.txt
done
rm *_species.txt
mv *_bracken.txt braken/species/.

for i in *_report.txt
do
  filename=$(basename "$i")
  fname="${filename%_report.txt}"
  bracken -d /data/MMAC-Shared-Drive/ref_databases/kraken/db_refseq_20200725 -i $i -r 90 -t 10 -l G -o ${fname}_report_genus.txt
done
rm *_genus.txt
mv *_bracken.txt braken/genus/.

for i in *_report.txt
do
  filename=$(basename "$i")
  fname="${filename%_report.txt}"
  bracken -d /data/MMAC-Shared-Drive/ref_databases/kraken/db_refseq_20200725 -i $i -r 90 -t 10 -l P -o ${fname}_report_phylum.txt
done
rm *_phylum.txt
mv *_bracken.txt braken/phylum/.


#Generating combined abundance tables in mpa format
mkdir braken/species/mpa
mkdir braken/genus/mpa
mkdir braken/phylum/mpa

for i in braken/species/*_report_bracken.txt
do
  filename=$(basename "$i")
  fname="${filename%_report_bracken.txt}"
  python /data/MMAC-Shared-Drive/ref_databases/kraken/python_scripts/kreport2mpa.py -r $i -o braken/species/mpa/${fname}_mpa.txt --display-header
done

mkdir braken/species/mpa/combined
python /data/MMAC-Shared-Drive/ref_databases/kraken/python_scripts/combine_mpa.py -i braken/species/mpa/*_mpa.txt -o braken/species/mpa/combined/combined_species_mpa.txt
grep -E "(s__)|(#Classification)" braken/species/mpa/combined/combined_species_mpa.txt > braken/species/mpa/combined/bracken_abundance_species_mpa.txt


for i in braken/genus/*_report_bracken.txt
do
  filename=$(basename "$i")
  fname="${filename%_report_bracken.txt}"
  python /data/MMAC-Shared-Drive/ref_databases/kraken/python_scripts/kreport2mpa.py -r $i -o braken/genus/mpa/${fname}_mpa.txt --display-header
done

mkdir braken/genus/mpa/combined
python /data/MMAC-Shared-Drive/ref_databases/kraken/python_scripts/combine_mpa.py -i braken/genus/mpa/*_mpa.txt -o braken/genus/mpa/combined/combined_genus_mpa.txt
grep -E "(g__)|(#Classification)" braken/genus/mpa/combined/combined_genus_mpa.txt > braken/genus/mpa/combined/bracken_abundance_genus_mpa.txt


for i in braken/phylum/*_report_bracken.txt
do
  filename=$(basename "$i")
  fname="${filename%_report_bracken.txt}"
  python /data/MMAC-Shared-Drive/ref_databases/kraken/python_scripts/kreport2mpa.py -r $i -o braken/phylum/mpa/${fname}_mpa.txt --display-header
done

mkdir braken/phylum/mpa/combined
python /data/MMAC-Shared-Drive/ref_databases/kraken/python_scripts/combine_mpa.py -i braken/phylum/mpa/*_mpa.txt -o braken/phylum/mpa/combined/combined_phylum_mpa.txt
grep -E "(p__)|(#Classification)" braken/phylum/mpa/combined/combined_phylum_mpa.txt > braken/phylum/mpa/combined/bracken_abundance_phylum_mpa.txt


#Cleaning up sample names
sed -i -e 's/_report_bracken.txt//g' braken/species/mpa/combined/bracken_abundance_species_mpa.txt
sed -i -e 's/_report_bracken.txt//g' braken/genus/mpa/combined/bracken_abundance_genus_mpa.txt
sed -i -e 's/_report_bracken.txt//g' braken/phylum/mpa/combined/bracken_abundance_phylum_mpa.txt


#Cleaning up top-level folders
cd ..
mkdir bracken_abundance_files
cp k2_reports/braken/species/mpa/combined/bracken_abundance_species_mpa.txt bracken_abundance_files/.
cp k2_reports/braken/genus/mpa/combined/bracken_abundance_genus_mpa.txt bracken_abundance_files/.
cp k2_reports/braken/phylum/mpa/combined/bracken_abundance_phylum_mpa.txt bracken_abundance_files/.


Building phyloseq objects for downstream analyses in R

If you plan to conduct your analyses in R, we recommend phyloseq as the tool to maintain your microbiome data structures. The script below provides some example code of how a phyloseq object may be built from the biobakery and bracken output files.

We typically perform downstream statistical analyses on local workstations rather than the cluster (easier to update and maintain packages, workflows, etc.), so here we assume you have downloaded the files referenced below to your local workstation. If you are interested in including the MetAPhlAn 3 phylogenetic tree described here, to allow for phylogenetic aware analyses (i.e., UniFrac, phylofactor, etc.), you can download it from the link provided by Doyle Ward (thanks Doyle!). You will need to monitor this tree file and/or obtain an updated version as the chocophlan database expands. We also multiply the MPA3 relative abundance by the final read count to generate expected counts from MetAphlAn.

The MetAphlAn output generated from biobakery workflows is clean and below I use tidyr::separate to parse the taxonomy information. The bracken output is not. So here I only retain the species-level classification in the ps object since the taxonomic levels are skipped in various places. This can be cleaned up, but might take a little time and I have not be able to get to it yet. If you have any suggestions or examples of parsing these taxonomic lineage strings please let me know and I will be happy to incorporate it here.

library(tidyverse); packageVersion("tidyverse")     #version:1.3.0 
library(phyloseq); packageVersion("phyloseq")       #version:1.32.0


#Building MetaPhlAn species abundance ps object
s_abund <- read_tsv("data/mpa3_species_relab.txt")

s_tax_tab <- s_abund %>%
  dplyr::rename("taxonomy" = "# taxonomy") %>%
  dplyr::select(taxonomy) %>%
  tidyr::separate(taxonomy, into = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "\\|") %>%
  dplyr::mutate(spec_row = Species) %>%
  tibble::column_to_rownames(var = "spec_row")

s_otu_tab <- s_abund %>%
  dplyr::rename("taxonomy" = "# taxonomy") %>%
  tidyr::separate(taxonomy, into = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "\\|") %>%
  dplyr::select(-Kingdom, -Phylum, -Class, -Order, -Family, -Genus) %>%
  tibble::column_to_rownames(var = "Species")

names(s_otu_tab) <- gsub(names(s_otu_tab), pattern = "_taxonomic_profile", replacement = "") 

head(colSums(s_otu_tab))
s_otu_tab <- s_otu_tab / 100                                                   #convert to proportion with unit sum of 1
head(colSums(s_otu_tab))

s_meta <- data.frame(seq_id = names(s_otu_tab))
s_meta <- s_meta %>%
  dplyr::mutate(sampleNames_row = seq_id) %>%
  tibble::column_to_rownames(var = "sampleNames_row")

(ps_mpa3_species <- phyloseq(sample_data(s_meta),
                             otu_table(s_otu_tab, taxa_are_rows = TRUE),
                             tax_table(as.matrix(s_tax_tab))))

mpa_tree <- read_tree("mpa_tree/20201013_mpav30_speciesMod.nwk")
(ps_mpa3_species <- merge_phyloseq(ps_mpa3_species, mpa_tree))                 #confirm do not lose too many species (tree will become outdated)
taxa_names(ps_mpa3_species) <- gsub("s__", "", taxa_names(ps_mpa3_species))    #cleaning up names
#plot_tree(ps_mpa3_species, label.tips = "taxa_names", ladderize = "left", color = "Phylum")

read_count_df <- read_tsv("data/kneaddata_read_count_table.tsv")
read_count_df <- read_count_df %>%
  dplyr::rename("number_reads" = "final pair1") %>%
  dplyr::select(Sample, `number_reads`)

otu_tab <- data.frame(otu_table(ps_mpa3_species))
otu_tab <- otu_tab %>%
  t(.) %>%
  data.frame(.) %>%
  rownames_to_column(var = "Sample")

otu_tab %>% select(-Sample) %>% rowSums(.)                                  #confirm each row sums as expected

otu_tab <- left_join(otu_tab, read_count_df)

otu_tab <- otu_tab %>% 
  mutate_at(vars(-number_reads, -Sample), funs(. * number_reads)) %>%       #convert to counts and round up
  mutate_at(vars(-number_reads, -Sample), funs(ceiling(.))) %>%
  select(-number_reads) %>%
  column_to_rownames(var = "Sample")

otu_tab <- t(otu_tab)
otu_table(ps_mpa3_species) <- otu_table(otu_tab, taxa_are_rows = TRUE)

read_count_df
sample_sums(ps_mpa3_species)

saveRDS(ps_mpa3_species, "ps_objects/ps_mpa3_species.rds")
rm(list=ls())



#Building Bracken species abundance ps object
s_abund <- read_tsv("data/bracken_abundance_species_mpa.txt")

s_tax_tab <- s_abund %>%
  dplyr::rename("taxonomy" = "#Classification") %>%
  dplyr::select(taxonomy) %>%
  dplyr::mutate(Species = sub('.*\\|', '', taxonomy),
                Species = gsub("s__", "", Species),
                spec_row = Species) %>%
  dplyr::select(-taxonomy) %>%
  tibble::column_to_rownames(var = "spec_row")

s_otu_tab <- s_abund %>%
  dplyr::rename("taxonomy" = "#Classification") %>%
  dplyr::mutate(taxonomy = sub('.*\\|', '', taxonomy),
                taxonomy = gsub("s__", "", taxonomy)) %>%
  tibble::column_to_rownames(var = "taxonomy")

s_meta <- data.frame(seq_id = names(s_otu_tab))
s_meta <- s_meta %>%
  dplyr::mutate(sampleNames_row = seq_id) %>%
  tibble::column_to_rownames(var = "sampleNames_row")

(ps_bracken_species <- phyloseq(sample_data(s_meta),
                                otu_table(s_otu_tab, taxa_are_rows = TRUE),
                                tax_table(as.matrix(s_tax_tab))))

saveRDS(ps_bracken_species, "ps_objects/ps_bracken_species.rds")
rm(list=ls())



#Building HUMAnN pathway abundance ps object
s_abund <- read_tsv("data/pathabundance_relab_unstratified.tsv")

s_tax_tab <- s_abund %>%
  dplyr::rename("Pathway" = "# Pathway") %>%
  dplyr::select(Pathway) %>%
  dplyr::mutate(spec_row = Pathway) %>%
  tibble::column_to_rownames(var = "spec_row")

s_otu_tab <- s_abund %>%
  dplyr::rename("Pathway" = "# Pathway") %>%
  tibble::column_to_rownames(var = "Pathway")

names(s_otu_tab) <- gsub(names(s_otu_tab), pattern = "_Abundance", replacement = "") 
colSums(s_otu_tab)

s_meta <- data.frame(seq_id = names(s_otu_tab))
s_meta <- s_meta %>%
  dplyr::mutate(sampleNames_row = seq_id) %>%
  tibble::column_to_rownames(var = "sampleNames_row")

(ps_metacyc_pathways <- phyloseq(sample_data(s_meta),
                                otu_table(s_otu_tab, taxa_are_rows = TRUE),
                                tax_table(as.matrix(s_tax_tab))))

saveRDS(ps_metacyc_pathways, "ps_objects/ps_metacyc_pathways.rds")
rm(list=ls())



#Building HUMAnN EC ps object
s_abund <- read_tsv("data/ecs_relab_unstratified.tsv_with_names.tsv")

s_tax_tab <- s_abund %>%
  dplyr::rename("ECN" = "# Gene Family") %>%
  dplyr::select(ECN) %>%
  dplyr::mutate(spec_row = ECN) %>%
  tibble::column_to_rownames(var = "spec_row")

s_otu_tab <- s_abund %>%
  dplyr::rename("ECN" = "# Gene Family") %>%
  tibble::column_to_rownames(var = "ECN")

names(s_otu_tab) <- gsub(names(s_otu_tab), pattern = "_Abundance-RPKs", replacement = "") 
colSums(s_otu_tab)

s_meta <- data.frame(seq_id = names(s_otu_tab))
s_meta <- s_meta %>%
  dplyr::mutate(sampleNames_row = seq_id) %>%
  tibble::column_to_rownames(var = "sampleNames_row")

(ps_enzymes <- phyloseq(sample_data(s_meta),
                        otu_table(s_otu_tab, taxa_are_rows = TRUE),
                        tax_table(as.matrix(s_tax_tab))))

saveRDS(ps_enzymes, "ps_objects/ps_enzymes.rds")
rm(list=ls())



#Building HUMAnN gene families (uniref90) ps object
s_abund <- read_tsv("data/genefamilies_relab_unstratified_with_names.tsv")

s_tax_tab <- s_abund %>%
  dplyr::rename("UniRef90" = "# Gene Family") %>%
  dplyr::select(UniRef90) %>%
  dplyr::mutate(spec_row = UniRef90) %>%
  tibble::column_to_rownames(var = "spec_row")

s_otu_tab <- s_abund %>%
  dplyr::rename("UniRef90" = "# Gene Family") %>%
  tibble::column_to_rownames(var = "UniRef90")

names(s_otu_tab) <- gsub(names(s_otu_tab), pattern = "_Abundance-RPKs", replacement = "") 
colSums(s_otu_tab)

s_meta <- data.frame(seq_id = names(s_otu_tab))
s_meta <- s_meta %>%
  dplyr::mutate(sampleNames_row = seq_id) %>%
  tibble::column_to_rownames(var = "sampleNames_row")

(ps_uniref90_families <- phyloseq(sample_data(s_meta),
                                  otu_table(s_otu_tab, taxa_are_rows = TRUE),
                                  tax_table(as.matrix(s_tax_tab))))

saveRDS(ps_uniref90_families, "ps_objects/ps_uniref90_families.rds")
rm(list=ls())


I hope this helps and don’t hesitate to reach out to us if you have any questions.


Session Info

sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_4.0.4   magrittr_2.0.1   bookdown_0.21    tools_4.0.4     
##  [5] htmltools_0.5.0  yaml_2.2.1       stringi_1.5.3    rmarkdown_2.5   
##  [9] blogdown_0.21.45 knitr_1.30       stringr_1.4.0    digest_0.6.27   
## [13] xfun_0.19        rlang_0.4.8      evaluate_0.14
Associate Professor of Pediatrics

Related