Applying Topic Models to Microbiome Data in R


Inherent limitations with one-at-a-time (OaaT) feature testing (i.e., single feature differential abundance analysis) have contributed to the increasing popularity of mixture models for correlating microbial features with factors of interest (i.e., phenotypes, environments, clinical measures, etc.). Latent Dirichlet allocation (LDA), popularized in the area of natural language processing (NLP) for topic modeling, seems to be among the most popular.

The application of LDA modeling to microbiome data is described in an excellent paper by Kris Sankaran and Susan Holmes, “Latent variable modeling for the microbiome”. A key advantage of LDA they highlight is that, opposed to Dirichlet multinomial mixture modeling, documents are allowed to have fractional membership across a set of topics. Thus, when applied to microbiome studies, samples are not exclusively classified to a given community type, but rather allowed to have fractional membership based on the probability of assignment.

To make the connection with text analyses clear (detailed description in the linked Sankaran and Holmes paper above) a:

  • document = single microbiome sample
  • term = single feature of the microbial samples. These are commonly taxa (i.e., species), genes, or ASVs/OTUs
  • topic = community type (latent factor representing a community of features)

So at a high-level, the first goal of an LDA analysis applied to microbiome data is to derive topic-document probabilities and to convert them to observed read counts. Then one can apply existing tools developed for OaaT feature testing to identity differently abundant topics that (hopefully) capture associations for collections of microbes reflecting a given topic/community type.

The purpose of this post is not to get into the weeds of LDA (there are many good ones already out there), but rather to provide a high-level introduction as to to how one might implement such an approach. As new tools come online for this purpose, we expect to continue to update our workflows (for example utilizing alto for topic alignment, etc.). So check back with us for details.

In this example, we attempt to identify topics/community types associated with colorectal cancer (CRC) status using a popular, publicly available dataset contained in the curatedMetagenomicData package.


Loading libraries

First, we load the libraries we will require for the analysis.

library(curatedMetagenomicData); packageVersion("curatedMetagenomicData")
## [1] '3.0.10'
library(tidyverse); packageVersion("tidyverse")                                                 
## [1] '1.3.1'
library(phyloseq); packageVersion("phyloseq")  
## [1] '1.36.0'
library(LinDA); packageVersion("LinDA")  
## [1] '0.1.0'
library(topicmodels); packageVersion("topicmodels")     
## [1] '0.2.12'
library(tidytext); packageVersion("tidytext")            
## [1] '0.3.3'
library(ldatuning); packageVersion("ldatuning")   
## [1] '1.0.2'
library(cowplot); packageVersion("cowplot")  
## [1] '1.1.1'


Downloading Yachida et. al. data

Then we grab an example data set. Here I use the YachidaS_2019 data provided in the curatedMetagenomicData package. For those of you who have read through some of my other posts, you can see that I use this dataset often. This is not for any particular reason other than it has a large number of samples and I now have some familiarity with it.

The code below pulls down the dataset and converts it into a phyloseq object for analysis. Then I filter out a few samples with lower sequencing depths and less common species.

meta_df <- curatedMetagenomicData::sampleMetadata

mydata <- meta_df %>%
  dplyr::filter(study_name == "YachidaS_2019") %>%
  dplyr::select(sample_id, subject_id, body_site, disease, age, gender, BMI) %>%
  dplyr::mutate(case_status = ifelse(disease == "CRC", "CRC", 
                                     ifelse(disease == "healthy", "Control", "Other"))) %>%
  dplyr::filter(case_status != "Other") %>%
  na.omit(.)

keep_id <- mydata$sample_id

crc_se_sp <- sampleMetadata %>%
  dplyr::filter(sample_id %in% keep_id) %>%
  curatedMetagenomicData::returnSamples("relative_abundance", counts = TRUE)

crc_sp_df <- as.data.frame(as.matrix(assay(crc_se_sp)))

crc_sp_df <- crc_sp_df %>%
  rownames_to_column(var = "Species")

otu_tab <- crc_sp_df %>%
  tidyr::separate(Species, into = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "\\|") %>%
  dplyr::select(-Kingdom, -Phylum, -Class, -Order, -Family, -Genus) %>%
  dplyr::mutate(Species = gsub("s__", "", Species)) %>%
  tibble::column_to_rownames(var = "Species")

tax_tab <- crc_sp_df %>%
  dplyr::select(Species) %>%
  tidyr::separate(Species, into = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "\\|") %>%
  dplyr::mutate(Kingdom = gsub("k__", "", Kingdom),
                Phylum = gsub("p__", "", Phylum),
                Class = gsub("c__", "", Class),
                Order = gsub("o__", "", Order),
                Family = gsub("f__", "", Family),
                Genus = gsub("g__", "", Genus),
                Species = gsub("s__", "", Species)) %>%
  dplyr::mutate(spec_row = Species) %>%
  tibble::column_to_rownames(var = "spec_row")

rownames(mydata) <- NULL
mydata <- mydata %>%
  tibble::column_to_rownames(var = "sample_id")

(ps <- phyloseq(sample_data(mydata),
                otu_table(otu_tab, taxa_are_rows = TRUE),
                tax_table(as.matrix(tax_tab))))
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 717 taxa and 503 samples ]
## sample_data() Sample Data:       [ 503 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 717 taxa by 7 taxonomic ranks ]
sample_data(ps)$read_count <- sample_sums(ps)
ps <- subset_samples(ps, read_count > 10^7)

minTotRelAbun <- 1e-5           
x <- taxa_sums(ps)
keepTaxa <- (x / sum(x)) > minTotRelAbun
(ps <- prune_taxa(keepTaxa, ps))
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 392 taxa and 500 samples ]
## sample_data() Sample Data:       [ 500 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 392 taxa by 7 taxonomic ranks ]
head(sample_data(ps))
##              subject_id body_site disease age gender      BMI case_status
## SAMD00115014  sub_10021     stool     CRC  57   male 26.88095         CRC
## SAMD00115015  sub_10023     stool healthy  65   male 26.56250     Control
## SAMD00115025  sub_10025     stool healthy  40   male 25.00000     Control
## SAMD00115027  sub_10029     stool healthy  67 female 20.17325     Control
## SAMD00114971  sub_10031     stool healthy  77   male 24.46460     Control
## SAMD00114972  sub_10033     stool     CRC  54 female 24.74745         CRC
##              read_count
## SAMD00115014   51563815
## SAMD00115015   53302208
## SAMD00115025   44350874
## SAMD00115027   51661218
## SAMD00114971   41370786
## SAMD00114972   40155877
table(sample_data(ps)$case_status)
## 
## Control     CRC 
##     243     257
rm(list= ls()[!(ls() %in% c("ps"))])


As you can see, we end up with a total of 392 species and 500 samples (n=243 controls and n=257 with CRC). This reflects a decent sized microbiome study by current standards.


Aggregating reads to genera

For this example, I reduce the number of taxa by agglomerating reads to genera to speed up the run time. However, this approach could be applied at any taxonomic level of interest (you just might spin out a larger number of topics when modeling species or genes).

(ps_g <- tax_glom(ps, taxrank = "Genus"))
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 131 taxa and 500 samples ]
## sample_data() Sample Data:       [ 500 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 131 taxa by 7 taxonomic ranks ]
taxa_names(ps_g) <- tax_table(ps_g)[, 6]
head(taxa_names(ps_g))
## [1] "Faecalibacterium" "Eubacterium"      "Parabacteroides"  "Blautia"         
## [5] "Agathobaculum"    "Bifidobacterium"


Selecting the number of topics

When fitting the LDA model we must specify the number of latent topics to model. There are various ways to go about this, but here we use the FindTopicsNumber function in the the ldatuning package (and two of the included metrics) to try to select an ideal number of features. More details can be found via the package and referenced papers. For these metrics a lower value suggests an optimal topic structure.

We also use the VEM method for speed. In practice, you may want to consider Gibbs sampling.

I modify the plotting function below to allow it to play well with cowplot.

count_matrix <- data.frame(t(data.frame(otu_table(ps_g))))

result <- FindTopicsNumber(
  count_matrix,
  topics = seq(from = 3, to = 50, by = 3),
  metrics = c("CaoJuan2009", "Arun2010"),
  method = "VEM",
  control = list(seed = 243),
  mc.cores = 4,
  verbose = TRUE
)
## fit models... done.
## calculate metrics:
##   CaoJuan2009... done.
##   Arun2010... done.
my_plot <- function (values)  #updating to allow plotting via cowplot::plot_grid
{
  if ("LDA_model" %in% names(values)) {
    values <- values[!names(values) %in% c("LDA_model")]
  }
  columns <- base::subset(values, select = 2:ncol(values))
  values <- base::data.frame(values["topics"], base::apply(columns, 
                                                           2, function(column) {
                                                             scales::rescale(column, to = c(0, 1), from = range(column))
                                                           }))
  values <- reshape2::melt(values, id.vars = "topics", na.rm = TRUE)
  values$group <- values$variable %in% c("Griffiths2004", "Deveaud2014")
  values$group <- base::factor(values$group, levels = c(FALSE, 
                                                        TRUE), labels = c("Minimize", "Maximize"))
  p <- ggplot(values, aes_string(x = "topics", y = "value", 
                                 group = "variable"))
  p <- p + geom_line()
  p <- p + geom_point(aes_string(shape = "variable"), size = 3)
  p <- p + guides(size = FALSE, shape = guide_legend(title = "Metrics:"))
  p <- p + scale_x_continuous(breaks = values$topics)
  p <- p + labs(x = "\nNumber of Topics", y = NULL)
  p <- p + facet_grid(group ~ .)
  p <- p + theme_bw() %+replace% theme(panel.grid.major.y = element_blank(), 
                                       panel.grid.minor.y = element_blank(), panel.grid.major.x = element_line(colour = "grey70"), 
                                       panel.grid.minor.x = element_blank(), legend.key = element_blank(), 
                                       strip.text.y = element_text(angle = 90),
                                       legend.position = "bottom")
  return(p)
}

(p1 <- my_plot(result))


These metrics suggest that somewhere around 30 to 40 topics may be ideal. I go with 36 in this example.


Deriving topics via LDA

Now we use the LDA function in the topicmodels package to perform the LDA.

I use the tidytext package to extract the per-topic-per-word probabilities via the matrix = “beta” and the per-document-per-topic probabilities via the matrix = “gamma” arguments.

We can see that the per-document-per-topic probabilities shown below allow us to assign a probability for each topic to each sample. This is the fractional membership advantage described above.

lda_k36 <- LDA(count_matrix, k = 36, method = "VEM", control = list(seed = 243))

b_df <- data.frame(tidy(lda_k36, matrix = "beta"))

g_df <- data.frame(tidy(lda_k36, matrix = "gamma")) %>%
  arrange(document, topic)

head(b_df)
##   topic             term         beta
## 1     1 Faecalibacterium 3.599927e-03
## 2     2 Faecalibacterium 4.468011e-04
## 3     3 Faecalibacterium 1.795527e-02
## 4     4 Faecalibacterium 2.448136e-05
## 5     5 Faecalibacterium 1.273613e-05
## 6     6 Faecalibacterium 1.403159e-05
head(g_df)
##       document topic        gamma
## 1 SAMD00114718     1 2.094965e-02
## 2 SAMD00114718     2 3.613434e-06
## 3 SAMD00114718     3 8.211247e-10
## 4 SAMD00114718     4 8.211251e-10
## 5 SAMD00114718     5 8.211255e-10
## 6 SAMD00114718     6 1.676575e-02


Building the topic model phyloseq object

Now we just need to multiply the per-document-per-topic probabilities by the read count for each sample.

The code below does this and stores the data as a phyloseq object.

lib_size_df <- data.frame(sample_sums(ps_g)) %>%
  dplyr::rename("read_count" = "sample_sums.ps_g.") %>%
  rownames_to_column(var = "document")

tm_df <- left_join(lib_size_df, g_df) %>%
  mutate(topic_count = read_count * gamma,
         topic_count = round(topic_count, 0)) %>%
  dplyr::select(-read_count, -gamma) %>%
  pivot_wider(names_from = topic, values_from = topic_count) %>%
  dplyr::rename_with(~ paste0("Topic_", .), -document) %>%
  column_to_rownames(var = "document") %>%
  t(.) %>%
  data.frame(.)

(ps_topic_g <- phyloseq(
  sample_data(ps_g),
  otu_table(tm_df, taxa_are_rows = TRUE)))
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 36 taxa and 500 samples ]
## sample_data() Sample Data:       [ 500 samples by 8 sample variables ]


Fitting a DA model to the topics

Now that we have the data as read counts, we can use existing tools designed for differential abundance analysis to test for differentially abundant topics.

Here I use LinDA since it is fast, accommodates covariates, and has shown good performance when benchmarked against other methods.

FYI - I am using a slightly older version here that I load via the LinDA package. The function can now be found in the MicrobiomStat package. I include some prevalence filtering, trimming of extreme values, and covariate adjustment to highlight these options.

linda <- linda(otu.tab = data.frame(otu_table(ps_topic_g)), meta = data.frame(sample_data(ps_topic_g)), 
               formula = '~ case_status + age + gender + BMI', alpha = 0.05, winsor.quan = 0.97,
               prev.cut = 0.3, lib.cut = 0, n.cores = 1)
## Pseudo-count approach is used.
linda_res_df <- data.frame(linda$output$case_statusCRC) %>%
  dplyr::arrange(padj) %>%
  rownames_to_column(var = "Topic")

fdr_linda <- linda_res_df %>%
  mutate(reject = ifelse(padj < 0.05, "Yes", "No")) %>%
  separate(Topic, into = c("t", "t_n"), remove = FALSE, convert = TRUE) %>%
  mutate(Topic = gsub("_", " ", Topic)) %>%
  arrange(desc(t_n)) %>%
  mutate(Topic = factor(Topic, levels = Topic))

(p2 <- ggplot(data = fdr_linda, aes(x = Topic, y = log2FoldChange, fill = reject)) +
    geom_col(alpha = .8) +
    labs(y = "\nLog2 Fold-Change", x = "") +
    theme(axis.text.x = element_text(color = "black", size = 8),
          axis.text.y = element_text(color = "black", size = 8),
          axis.title.y = element_text(size = 8),
          axis.title.x = element_text(size = 10),
          legend.text = element_text(size = 8),
          legend.title = element_text(size = 8)) +
    coord_flip() +
    scale_fill_manual(values = c("grey", "dodgerblue")) +
    geom_hline(yintercept = 0, linetype="dotted") +
    theme(legend.position = "none"))


From the plot we can see that we have three topics that have positive log2 fold changes and low FDR p-values (< 0.05) for samples obtained from CRC patients when compared to controls. We have two that have negative log2 fold values and low FDR p-values highlighting enrichment of these topics in control samples.

In the next two sections, we will follow-up a bit on topic five.


Plotting per-topic-word-probabilities

Selecting topic five as an example, we can use the information in the per-topic-per-word probabilities matrix to examine which genera have the highest probabilities of assignment to this topic/community type.

Below we can see that Porphyromonas has the highest probability of assignment followed by several others with lower values.

b_plot <- b_df %>%
  dplyr::filter(topic == 5) %>%
  arrange(desc(beta)) %>%
  slice_head(n = 20) %>%
  arrange(desc(term)) %>%
  mutate(term = factor(term, levels = term))

(p3 <- ggplot(data = b_plot, aes(x = beta, y = term, color = term)) +
    geom_point(aes(size = beta)) +
    labs(x = "\nTopic 5: Genus-Topic Probabilities (beta)", y = "") +
    theme(legend.position = "none"))


Plotting associations for the top contributing genera

We can also fit the LinDA model to each genera individually to get some idea of their association with CRC status when we do not directly account for the other features (for better or worse).

Here we see that nearly all of these genera are enriched in CRC samples (these plots could benefit from some focus on the variation as well). Thus, it makes sense that the community type would also have a strong, positive association as well.

linda_g_all <- linda(otu.tab = data.frame(otu_table(ps_g)), meta = data.frame(sample_data(ps_g)), 
                      formula = '~ case_status + age + gender + BMI', alpha = 0.05, winsor.quan = 0.97,
                      prev.cut = 0, lib.cut = 0, n.cores = 1)
## Imputation approach is used.
linda_res_g_df <- data.frame(linda_g_all$output$case_statusCRC) %>%
  dplyr::arrange(padj) %>%
  rownames_to_column(var = "Genus")

keep_g <- b_plot$term

log2_df <- linda_res_g_df %>%
  filter(Genus %in% keep_g) %>%
  arrange(desc(Genus)) %>%
  mutate(Genus = factor(Genus, levels = Genus))

(p4 <- ggplot(data = log2_df, aes(x = Genus, y = log2FoldChange, fill = Genus)) +
    geom_col(alpha = .8) +
    labs(y = "\nLog2 Fold-Change", x = "") +
    theme(axis.text.x = element_text(color = "black", size = 8),
          axis.text.y = element_text(color = "black", size = 8),
          axis.title.y = element_text(size = 8),
          axis.title.x = element_text(size = 10),
          legend.text = element_text(size = 8),
          legend.title = element_text(size = 8)) +
    coord_flip() +
    geom_hline(yintercept = 0, linetype="dotted") +
    theme(legend.position = "none"))


Example combined LDA figure

In this block I include some code to show how you might put all of this together in one figure.

cowplot::plot_grid(p1, p2, p3, p4, ncol = 2, nrow = 2, scale = .9, labels = c("A", "B", "C", "D")) +
  theme(plot.background = element_rect(fill = "white"))


Session Info

sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] cowplot_1.1.1                  ldatuning_1.0.2               
##  [3] tidytext_0.3.3                 topicmodels_0.2-12            
##  [5] LinDA_0.1.0                    phyloseq_1.36.0               
##  [7] forcats_0.5.1                  stringr_1.4.0                 
##  [9] dplyr_1.0.7                    purrr_0.3.4                   
## [11] readr_2.0.2                    tidyr_1.1.4                   
## [13] tibble_3.1.5                   ggplot2_3.3.5                 
## [15] tidyverse_1.3.1                curatedMetagenomicData_3.0.10 
## [17] TreeSummarizedExperiment_2.0.3 Biostrings_2.60.2             
## [19] XVector_0.32.0                 SingleCellExperiment_1.14.1   
## [21] SummarizedExperiment_1.22.0    Biobase_2.52.0                
## [23] GenomicRanges_1.44.0           GenomeInfoDb_1.28.4           
## [25] IRanges_2.26.0                 S4Vectors_0.30.2              
## [27] BiocGenerics_0.38.0            MatrixGenerics_1.4.3          
## [29] matrixStats_0.61.0            
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.2                    tidyselect_1.1.1             
##   [3] lme4_1.1-27.1                 RSQLite_2.2.8                
##   [5] AnnotationDbi_1.54.1          grid_4.1.1                   
##   [7] BiocParallel_1.26.2           modeest_2.4.0                
##   [9] munsell_0.5.0                 codetools_0.2-18             
##  [11] withr_2.4.2                   colorspace_2.0-2             
##  [13] filelock_1.0.2                NLP_0.2-1                    
##  [15] highr_0.9                     knitr_1.36                   
##  [17] rstudioapi_0.13               labeling_0.4.2               
##  [19] slam_0.1-48                   GenomeInfoDbData_1.2.6       
##  [21] bit64_4.0.5                   farver_2.1.0                 
##  [23] rhdf5_2.36.0                  fBasics_3042.89.1            
##  [25] vctrs_0.3.8                   treeio_1.16.2                
##  [27] generics_0.1.1                xfun_0.29                    
##  [29] BiocFileCache_2.0.0           R6_2.5.1                     
##  [31] clue_0.3-60                   bitops_1.0-7                 
##  [33] rhdf5filters_1.4.0            cachem_1.0.6                 
##  [35] DelayedArray_0.18.0           assertthat_0.2.1             
##  [37] promises_1.2.0.1              scales_1.1.1                 
##  [39] gtable_0.3.0                  spatial_7.3-14               
##  [41] timeDate_3043.102             rlang_0.4.12                 
##  [43] splines_4.1.1                 lazyeval_0.2.2               
##  [45] broom_0.7.9                   BiocManager_1.30.16          
##  [47] yaml_2.2.1                    reshape2_1.4.4               
##  [49] modelr_0.1.8                  backports_1.2.1              
##  [51] httpuv_1.6.3                  tokenizers_0.2.1             
##  [53] tools_4.1.1                   bookdown_0.24                
##  [55] ellipsis_0.3.2                jquerylib_0.1.4              
##  [57] biomformat_1.20.0             stabledist_0.7-1             
##  [59] Rcpp_1.0.7                    plyr_1.8.6                   
##  [61] zlibbioc_1.38.0               RCurl_1.98-1.5               
##  [63] rpart_4.1-15                  statip_0.2.3                 
##  [65] haven_2.4.3                   ggrepel_0.9.1                
##  [67] cluster_2.1.2                 fs_1.5.0                     
##  [69] magrittr_2.0.1                data.table_1.14.2            
##  [71] timeSeries_3062.100           blogdown_1.7                 
##  [73] lmerTest_3.1-3                reprex_2.0.1                 
##  [75] SnowballC_0.7.0               hms_1.1.1                    
##  [77] mime_0.12                     evaluate_0.14                
##  [79] xtable_1.8-4                  readxl_1.3.1                 
##  [81] compiler_4.1.1                crayon_1.4.1                 
##  [83] minqa_1.2.4                   htmltools_0.5.2              
##  [85] mgcv_1.8-36                   later_1.3.0                  
##  [87] tzdb_0.1.2                    lubridate_1.8.0              
##  [89] DBI_1.1.1                     ExperimentHub_2.0.0          
##  [91] rmutil_1.1.5                  dbplyr_2.1.1                 
##  [93] MASS_7.3-54                   rappdirs_0.3.3               
##  [95] boot_1.3-28                   Matrix_1.3-4                 
##  [97] ade4_1.7-18                   permute_0.9-5                
##  [99] cli_3.1.0                     igraph_1.2.7                 
## [101] pkgconfig_2.0.3               numDeriv_2016.8-1.1          
## [103] xml2_1.3.2                    foreach_1.5.1                
## [105] bslib_0.3.1                   multtest_2.48.0              
## [107] rvest_1.0.2                   janeaustenr_0.1.5            
## [109] digest_0.6.28                 vegan_2.5-7                  
## [111] tm_0.7-8                      rmarkdown_2.11               
## [113] cellranger_1.1.0              tidytree_0.3.5               
## [115] curl_4.3.2                    shiny_1.7.1                  
## [117] modeltools_0.2-23             nloptr_1.2.2.2               
## [119] lifecycle_1.0.1               nlme_3.1-152                 
## [121] jsonlite_1.7.2                Rhdf5lib_1.14.2              
## [123] fansi_0.5.0                   pillar_1.6.4                 
## [125] lattice_0.20-44               KEGGREST_1.32.0              
## [127] fastmap_1.1.0                 httr_1.4.2                   
## [129] survival_3.2-11               interactiveDisplayBase_1.30.0
## [131] glue_1.4.2                    png_0.1-7                    
## [133] iterators_1.0.13              BiocVersion_3.13.1           
## [135] bit_4.0.4                     stringi_1.7.5                
## [137] sass_0.4.0                    blob_1.2.2                   
## [139] stable_1.1.4                  AnnotationHub_3.0.2          
## [141] memoise_2.0.0                 ape_5.5
Associate Professor of Pediatrics

Related