Profiling of Shotgun Metagenomic Sequence Data using Kraken2/Bracken and HUMAnN2

Below I provide scripts to implement the current default workflow for taxonomic profiling using Kraken2 and Bracken and functional profiling using HUMAnN2 used by the Microbial Metagenomics Analysis Center (MMAC) at CCHMC for paired-end data. These scripts are written to run on the CCHMC high-performance computing (HPC) cluster. This post is intended to provide collaborators working with the MMAC detailed information on how their sequence data was processed. If we did not discuss any modifications to this workflow for your project, this is how we generated the taxonomic and functional profiles. We continually work to improve our workflows, so please check with us before citing these packages and versions to ensure the website is up-to-date.

For those investigators looking to run this workflow on your own, you would simply need to convert the scripts to .bat files to be run on the cluster. See here for some instructions or reach out to us if you have questions.

The scripts below perform QC and profiling of paired-end fastq files obtained from next-generation sequencing using well-established tools. The primary steps include:

  • Removal of reads mapping to the human reference database and and basic quality control using kneaddata
  • Assigning each read to the lowest common ancestor using exact kmer matching as implemented by kraken2
  • Generation of phylum-, genus-, and species-level taxonomic profiles using bayesian reestimation of kraken2 classifications via bracken
  • Generation of gene family and metacyc pathways using HUMAnN2

Detailed information on each process can be found at the links above and in the published papers.

The scripts assume you are working with paired-end files since this is the most common set up used by the Microbial Genomics and Metagenomics Laboratory and the DNA Core. If you have single-end files (e.g., forward reads only) send us an email and we can provide scripts to run these same tools on your samples.

Quality Control with Kneaddata

The script below:

  • calls bowtie2 in “very-sensitive” mode to remove human reads mapping to the default human reference and contaminant database
  • calls trimmomatic to trim reads where the average quality score over four reads drops below 20 and filters out any read shorter than 90bp (SLIDINGWINDOW:4:20 MINLEN:90)
  • 90 bp selected as shorter reads found to reduce performance in functional profiling

It also requests the process be run on 10 cores with 64G RAM and 10 hr of wall time. You will need to modify this as needed. You will also need to change the path below to point to your fastq files and follow the instructions provided at the top of the script properly set up the file structure etc. You will also need to change the path for the error (-e) and output (-o) files from /data/MMAC-Shared-Drive/scratch/%J.err to wherever you want your log files to be saved (unless you have access to the MMAC-Shared-Drive). Otherwise you will not be able to access them. You will need to do this for all three scripts.

This step can take a while to run if you have large files (I like to run multiple concurrent batches of samples for large projects space permitting).

#BSUB -W 48:00
#BSUB -M 64000
#BSUB -n 10
#BSUB -e /data/MMAC-Shared-Drive/scratch/%J.err
#BSUB -o /data/MMAC-Shared-Drive/scratch/%J.out

#Script to remove human contaminant sequences and quality filter paired-end reads

#Program assumes:
#       - compressed fastq files are provided (.gz) and end in fastq.gz
#       - files are provided in a single folder
#       - any non-biological sequences have been removed

#To run the script
# - create a new folder for the project
# - drop compressed fastq files in new project folder
# - change first line of code below to navigate to the project folder
# - enter bsub at command line to run: bsub < kneaddata_trim90bp_pe.bat
# - may need to modify wall time etc. given size of project

#Navigate to project folder (change me)
cd /your_path/seqs

#Load modules
module load kneaddata/0.7.3
module load trimmomatic

#Running kneaddata on PE reads
for i in *R1.fastq.gz
  filename=$(basename "$i")
  kneaddata --i ${fname}_R1.fastq.gz --i ${fname}_R2.fastq.gz --output kneaddata_out \
  --reference-db /data/MMAC-Shared-Drive/ref_databases/kneaddata/drupal \
  --remove-intermediate-output -t 5 -p 2 \
  --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:90"

#Output kneaddata report
kneaddata_read_count_table --input kneaddata_out --output kneaddata_out/kneaddata_read_counts.txt

#Clean up file structure
cd kneaddata_out

mkdir homo_sap
mkdir unmatched
mkdir logs

mv *Homo_sapiens* homo_sap/.
mv *unmatched* unmatched/.
mv *log logs/.

The output should be a kneaddata_out folder containing “kneaded” (i.e., quality controlled) forward and reverse read files for each sample, as well as a detailed report indicating how many reads made it through each filter.

Taxonomic Classification and Relative Abundance Estimation using Kraken2 and Bracken

The script below:

  • classifies reads by matching each kmer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given kmer (via the kraken2 algorithm)
  • generates estimates of the relative abundance for each taxa using the bayesian reestimation approach implemented by bracken (at the phylum, genus and species levels)

The script assumes fastq files residing in single folder (typically this will be kneaddata_out folder). It also assumes that the minium read length is not shorter than 90 bp. If you have shorter reads you will want/need to make a new bracken database. The kraken2 confidence is set to 0.1 and bracken threshold to t-10; consider modifying to reduce FPR as needed.

Kraken2 is a RAM intensive program (but better and faster than the previous version). Here I am requesting 120 GB of RAM, 32 cores, and 8 hours of wall time. Modify as needed. The profiling is actually quite fast…so eight hours is likley overkill depending on how many sample you have.

#BSUB -W 8:00
#BSUB -M 120000
#BSUB -n 32
#BSUB -e /data/MMAC-Shared-Drive/scratch/%J.err
#BSUB -o /data/MMAC-Shared-Drive/scratch/%J.out

#Script to run kraken2 bracken pipeline for single-end reads

#notes: - program assumes fastq files residing in single folder (typically will be kneaddata_out folder)
#       - modify if have fasta or other file name (e.g., .fna, .fq, etc.)
#       - program assumes reads have already been quality filtered and trimmed to a min of 90bp
#       - if have shorter min read length then need to build and use bracken db built for shorter read length

#       - program generates k2 read mappings and bracken abundance estimation at the phylum, genus, and species levels
#       - modify if want output for other levels
#       - k2 confidence set to 0.1 and bracken threshold to t-10; consider modifying to reduce FPR as needed

#       - requesting 32 cores, 120G RAM, and 8 hr wall time; see:
#       - to run job: bsub < k2_bracken_pipeline_single_end.bat

#Navigate to folder containing fastq files: (CHANGE ME)
cd /your_path/kneaddata_out

#Load modules
module load kraken/2.0.8
module load bracken/2.5.0

#Classify reads with kraken2
mkdir k2_outputs
mkdir k2_reports

for i in *R1_kneaddata_paired_1.fastq
  filename=$(basename "$i")
  kraken2 --db /data/MMAC-Shared-Drive/ref_databases/kraken/db_refseq_20200725 \
  --confidence 0.1 \
  --threads 32 \
  --use-names \
  --output k2_outputs/${fname}_output.txt \
  --report k2_reports/${fname}_report.txt \
  --paired ${fname}_R1_kneaddata_paired_1.fastq ${fname}_R1_kneaddata_paired_2.fastq

#Abundance estimation with braken (phylum, genus, species)
cd k2_reports
mkdir braken
mkdir braken/species
mkdir braken/genus
mkdir braken/phylum

for i in *_report.txt
  filename=$(basename "$i")
  bracken -d /data/MMAC-Shared-Drive/ref_databases/kraken/db_refseq_20200725 -i $i -r 90 -t 10 -l S -o ${fname}_report_species.txt
rm *_species.txt
mv *_bracken.txt braken/species/.

for i in *_report.txt
  filename=$(basename "$i")
  bracken -d /data/MMAC-Shared-Drive/ref_databases/kraken/db_refseq_20200725 -i $i -r 90 -t 10 -l G -o ${fname}_report_genus.txt
rm *_genus.txt
mv *_bracken.txt braken/genus/.

for i in *_report.txt
  filename=$(basename "$i")
  bracken -d /data/MMAC-Shared-Drive/ref_databases/kraken/db_refseq_20200725 -i $i -r 90 -t 10 -l P -o ${fname}_report_phylum.txt
rm *_phylum.txt
mv *_bracken.txt braken/phylum/.

#Generating combined abundance tables in mpa format
mkdir braken/species/mpa
mkdir braken/genus/mpa
mkdir braken/phylum/mpa

for i in braken/species/*_report_bracken.txt
  filename=$(basename "$i")
  python /data/MMAC-Shared-Drive/ref_databases/kraken/python_scripts/ -r $i -o braken/species/mpa/${fname}_mpa.txt --display-header

mkdir braken/species/mpa/combined
python /data/MMAC-Shared-Drive/ref_databases/kraken/python_scripts/ -i braken/species/mpa/*_mpa.txt -o braken/species/mpa/combined/combined_species_mpa.txt
grep -E "(s__)|(#Classification)" braken/species/mpa/combined/combined_species_mpa.txt > braken/species/mpa/combined/bracken_abundance_species_mpa.txt

for i in braken/genus/*_report_bracken.txt
  filename=$(basename "$i")
  python /data/MMAC-Shared-Drive/ref_databases/kraken/python_scripts/ -r $i -o braken/genus/mpa/${fname}_mpa.txt --display-header

mkdir braken/genus/mpa/combined
python /data/MMAC-Shared-Drive/ref_databases/kraken/python_scripts/ -i braken/genus/mpa/*_mpa.txt -o braken/genus/mpa/combined/combined_genus_mpa.txt
grep -E "(g__)|(#Classification)" braken/genus/mpa/combined/combined_genus_mpa.txt > braken/genus/mpa/combined/bracken_abundance_genus_mpa.txt

for i in braken/phylum/*_report_bracken.txt
  filename=$(basename "$i")
  python /data/MMAC-Shared-Drive/ref_databases/kraken/python_scripts/ -r $i -o braken/phylum/mpa/${fname}_mpa.txt --display-header

mkdir braken/phylum/mpa/combined
python /data/MMAC-Shared-Drive/ref_databases/kraken/python_scripts/ -i braken/phylum/mpa/*_mpa.txt -o braken/phylum/mpa/combined/combined_phylum_mpa.txt
grep -E "(p__)|(#Classification)" braken/phylum/mpa/combined/combined_phylum_mpa.txt > braken/phylum/mpa/combined/bracken_abundance_phylum_mpa.txt

#Cleaning up sample names
sed -i -e 's/_report_bracken.txt//g' braken/species/mpa/combined/bracken_abundance_species_mpa.txt
sed -i -e 's/_report_bracken.txt//g' braken/genus/mpa/combined/bracken_abundance_genus_mpa.txt
sed -i -e 's/_report_bracken.txt//g' braken/phylum/mpa/combined/bracken_abundance_phylum_mpa.txt

#Cleaning up top-level folders
cd ..
mkdir bracken_abundance_files
cp k2_reports/braken/species/mpa/combined/bracken_abundance_species_mpa.txt bracken_abundance_files/.
cp k2_reports/braken/genus/mpa/combined/bracken_abundance_genus_mpa.txt bracken_abundance_files/.
cp k2_reports/braken/phylum/mpa/combined/bracken_abundance_phylum_mpa.txt bracken_abundance_files/.

The main folder of interest will be the bracken_abundance_files folder and it will contain tab delimited files of the relative abundance of each profiled taxa for each sample.

Functional Profiling using HUMAnN2

HUMAnN 2.0 is a pipeline for efficiently and accurately profiling the presence/absence and abundance of microbial pathways in a community from metagenomic or metatranscriptomic sequencing data. This process, referred to as functional profiling, aims to describe the metabolic potential of a microbial community and its members. More generally, functional profiling answers the question “What are the microbes in my community-of-interest doing (or capable of doing)?” (description from developer’s web page)

There is a lot going on under the hood here, but basically the program is mapping reads to functional gene family (e.g., UniRef 90) and metabolic (e.g., metacyc) databases. I recommend taking a look at the tutorial and paper linked to above for those interested in more detail.

HUMAnN2 does not run on paired-end reads by default, so the script below first concatenates the F and R reads for each sample then conducts the profiling. This program takes a while to run on large samples given multiple mapping steps need to occur.

#BSUB -W 12:00
#BSUB -M 64000
#BSUB -n 16
#BSUB -e /data/MMAC-Shared-Drive/scratch/%J.err
#BSUB -o /data/MMAC-Shared-Drive/scratch/%J.out

#Script for functional profiling of paired-end reads using HUMANn2

#notes: - program assumes fastq files residing in single folder (typically will be kneaddata_out folder)
#       - modify if have fasta or other file name (e.g., .fna, .fq, etc.)
#         - program assumes reads have already been quality filtered and contaminants (human) removed
#       - script outputs: stratified and unstratifed pathways and gene family files in CPM
#       - HUMANn2 does not handle PE reads directly; so this script concatenates the F and R reads for each sample before running

#       - requesting 16 cores, 64G RAM, and 12 hr wall time; see:
#       - to run job: bsub < humann2_single_end.bat

#Navigate to folder containing fastq files: (CHANGE ME)
cd /your_path/kneaddata_out

#Load modules
module load anaconda3
source activate metaphlan2

#Concatenate files
mkdir output_cat

for name in *_kneaddata_paired_1.fastq; do
    cat "$name" "$other" > output_cat/"$name"

cd output_cat

for file in *; do mv "${file}" "${file/_R1_kneaddata_paired_1/}"; done

#Run HUMAnN2
for i in *.fastq
  humann2 --input $i \
    --output hmn2_output \
    --metaphlan /data/MMAC-Shared-Drive/ref_databases/humann2/metaphlan2 \
    --nucleotide-database /data/MMAC-Shared-Drive/ref_databases/humann2/chocophlan \
    --protein-database /data/MMAC-Shared-Drive/ref_databases/humann2/uniref \
    --threads 16

#Join all gene family and pathway abudance files
humann2_join_tables --input hmn2_output/ --file_name pathabundance --output humann2_pathabundance.tsv
humann2_join_tables --input hmn2_output/ --file_name genefamilies --output humann2_genefamilies.tsv

#Normalizing RPKs to CPM
humann2_renorm_table --input humann2_pathabundance.tsv --units cpm --output humann2_pathabundance_cpm.tsv
humann2_renorm_table --input humann2_genefamilies.tsv --units cpm --output humann2_genefamilies_cpm.tsv

#Generate stratified tables
humann2_split_stratified_table --input humann2_pathabundance_cpm.tsv --output ./
humann2_split_stratified_table --input humann2_genefamilies_cpm.tsv --output ./

#Cleaning up file structure
mkdir hmn2_pathway_abundance_files
mkdir hmn2_genefamily_abundance_files

mv *pathabundance* hmn2_pathway_abundance_files/.
mv *genefamilies* hmn2_genefamily_abundance_files/.

The output of this program will be stratified (i.e., provides estimated species contribution) and unstratified gene family and pathway abundance files.

And that is it! Once these scripts are run the taxonomic and functional profiling is complete. The next step for us is typically to merge the microbiome data with the relevant metadata files using the phyloseq package in R to facilitate downstream statistical analyses (or to read these files into QIIME2, etc.).

Please reach out to us if you have any questions and we will have a MMAC member respond.

Bonus: Generating the Kraken2 and Bracken Databases

For those interested in how the databases were generated the code is provided below. We were not able build the default database due to unresolved issues described here: The workaround provided by Robyn Wright here was extremely helpful and resolved our issue.

#BSUB -W 48:00
#BSUB -M 120000
#BSUB -n 32
#BSUB -e /scratch/olljt2/%J.err
#BSUB -o /scratch/olljt2/%J.out

#Building kraken2 database
#notes: - built from all complete refseq genomes for bacteria, archaea, virus, fungi, and the GRCh38.p13 human reference
#       - data pulled down 7/25/2020
#       - kraken2-build --download-library command failing/erroring on cluster
#       - used script provided @ as work around
#       - github issue described @
#       - no murine reference included
#       - refseq fungal genomes limited

#       - requesting 32 cores, 120G RAM, and 48 hr wall time
#       - to run job just navigate to file and run: bsub < kraken_refseq_db_20200725.bat
#       - outputting err and out to scratch space

#Load modules
module load kraken/2.0.8
module load blast/2.6.0
module load python3/3.6.3

#Make database folder for database
mkdir /your_path/db_refseq_20200725     #change me (path and date)

#Download taxonomy files
kraken2-build --download-taxonomy --db /your_path/db_refseq_20200725/ --threads 32 --use-ftp

#Download genomes
cp /MMAC-Shared-Drive/ref_databases/kraken/python_scripts/ /your_path/db_refseq_20200725/.
cd /your_path/db_refseq_20200725

python --domain archaea --complete True --ext dna
python --domain bacteria --complete True --ext dna
python --domain viral --complete True --ext dna
python --domain fungi --complete True --ext dna
python --domain vertebrate_mammalian --complete True --ext dna --human True

#Add genomes to library
find /your_path/db_refseq_20200725/archaea/ -name '*.fna' -print0 | xargs -P32 -0 -I{} -n1 kraken2-build --add-to-library {} --db /your_path/db_refseq_20200725
find /your_path/db_refseq_20200725/bacteria/ -name '*.fna' -print0 | xargs -P32 -0 -I{} -n1 kraken2-build --add-to-library {} --db /your_path/db_refseq_20200725
find /your_path/db_refseq_20200725/viral/ -name '*.fna' -print0 | xargs -P32 -0 -I{} -n1 kraken2-build --add-to-library {} --db /your_path/db_refseq_20200725
find /your_path/db_refseq_20200725/fungi/ -name '*.fna' -print0 | xargs -P32 -0 -I{} -n1 kraken2-build --add-to-library {} --db /your_path/db_refseq_20200725
find /your_path/db_refseq_20200725/vertebrate_mammalian/ -name '*.fna' -print0 | xargs -P32 -0 -I{} -n1 kraken2-build --add-to-library {} --db /your_path/db_refseq_20200725

#Build database
kraken2-build --build --db /your_path/db_refseq_20200725/. --threads 32
kraken2-build --clean --db /your_path/db_refseq_20200725/.

#Clean up files
mkdir logs
mv *.txt logs/.

mkdir refseqs
mv archaea refseqs/.
mv bacteria refseqs/.
mv viral refseqs/.
mv fungi refseqs/.
mv vertebrate_mammalian refseqs/.
tar -czvf refseqs.tar.gz refseqs
rm -r refseqs

#Build braken database
module load bracken/2.5.0
bracken-build -d /your_path/db_refseq_20200725 -t 32 -k 35 -l 90;

#notes: - if need for alternative bp lengths repeat build specifying the same database but different read lengths.
#       - script will skip any step already complete
#       -  per J Lu @;#
#        - 90bp selected as expect to discard reads < 90 bp during QC (min expected read length)

The script is provided below for completeness.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
Created on Wed Jul  8 10:55:58 2020

@author: robynwright

Modified from the scripts at

This requires installation of Biopython: conda install -c conda-forge biopython
and also pandas: conda install pandas

Note that this seems to struggle with finding paths if they have any spaces. If this is the case, you will get something like the following error:
    FileNotFoundError: [Errno 2] No such file or directory: '/path/to/directory with spaces/archaea'
To fix, just make the directory already. In this case, I would need to make the archaea directory inside the one the script is running from.

The script will also create a log file (.txt) that you can search to check whether there were any sequences that didn't download/unzip/get the taxids changes to kraken ones.


import os
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import pandas as pd
import argparse

parser = argparse.ArgumentParser(description='This script is to download all sequences from a certain domain that are present in NCBI refseq (\nThis requires the additional packages pandas and Biopython\nAs long as the script is able to download the assembly summary file, then it will create a log_file that tells you about whether each sequence was downloaded or not')
parser.add_argument('--domain', dest='domain', default='bacteria',
                    help='pick which domain to download, deafult is bacteria. Other options include: archaea, fungi, protozoa, viral, vertebrate_mammalian (or any others in NCBI refseq)')
parser.add_argument('--human', dest='human', default=False, 
                    help="add true here to download only human samples, default is False. Note that this only works if the domain is set to vertebrate_mammalian")
parser.add_argument('--ext', dest='ext', default='DNA', 
                    help="choose whether to download DNA or protein sequences, default is DNA. Valid options are DNA, dna, protein or Protein")
parser.add_argument('--complete', dest='complete', default=False, 
                    help="choose whether to only download complete genomes, or all genomes. Default is False, meaning all genomes are downloaded")

args = parser.parse_args()

wd = os.getcwd()+"/"
domain = args.domain
human = args.human
if args.ext == 'DNA' or args.ext == 'dna':
    ext1, ext2 = '_genomic', '.fna'
elif args.ext == 'Protein' or args.ext == 'protein':
    ext1, ext2 = '_protein', '.faa'
    print('Unknown extension given. Valid options are DNA, dna, Protein or protein')
complete = args.complete

#Create a directory
if not os.path.isdir(wd+domain):
  os.system("mkdir "+wd+domain)

#Get the assembly file
  if not os.path.exists("assembly_summary.txt"):
    os.system("wget -q"+domain+"/assembly_summary.txt")
  print("Unable to download assembly_summary.txt for "+domain)

#Parse the file
  f = pd.read_csv("assembly_summary.txt", header=1, index_col=0, sep="\t")
  print("Unable to open assembly_summary.txt for "+domain)

#Parse each of the lines in the file, getting the sequences and changing the IDs of all sequences to match kraken tax ids
#Output what happens for each line of the file - whether it works, and also which point it fails at
log_file = []
for l in list(f.index.values):
  line = f.loc[l, :]
  if human:
      if line["organism_name"] != "Homo sapiens":
  if complete:
      if not human:
          if not line['assembly_level'] == 'Complete Genome':
  ftppath = line['ftp_path']
  aname = ftppath.split('/')[-1]
  if os.path.exists(aname+ext1+".tax."+ext2):
    log_file.append("Already got this sequence: "+aname+ext1+".tax."+ext2+"\n")
    print("Already had this sequence: "+aname+ext1+".tax."+ext2)
      fullpath = ftppath+"/"+aname+ext1+ext2+".gz"
      if not os.path.exists(aname+ext1+ext2):
            os.system("wget -q "+fullpath)
            log_file.append("Didn't download "+aname+ext1+ext2+"\n")
            print("Didn't download "+aname)
            os.system("gunzip "+aname+ext1+ext2+".gz")
            log_file.append("Didn't unzip "+aname+ext1+ext2+".gz\n")
            print("Didn't unzip "+aname+ext1+ext2+".gz")
          log_file.append("Already had the unzipped file: "+aname+ext1+ext2+".gz\n")
          print("Already had the unzipped file: "+aname+ext1+ext2+".gz")
          taxid = line['taxid']  
          new_records = []
          records = SeqIO.parse(aname+ext1+ext2, "fasta")
          for record in records:
              newid ="|kraken:taxid|"+str(taxid)
              newseq = SeqRecord(record.seq, id=newid, description=record.description)
          SeqIO.write(new_records, aname+ext1+".tax"+ext2, "fasta")
          os.system("rm "+aname+ext1+ext2)
          log_file.append("Got this sequence with kraken taxid: "+aname+ext1+".tax"+ext2+"\n")
          print("Got this sequence with kraken taxid: "+aname+ext1+".tax"+ext2)
          log_file.append("Didn't manage to change the taxids for this file: "+aname+ext1+ext2+"\n")
          print("Didn't manage to change the taxids for this file: "+aname+ext1+ext2)
with open("logfile_"+domain+".txt", 'w') as f:
    for row in log_file:
Associate Professor of Pediatrics