Multiple Imputation with Microbiome Data


The motivation for this post was a recent twitter exchange discussing difficulties faced when analyzing microbiome data with missing covariate values (i.e., incomplete sample metadata fields). I have run into this issue as well, and given the increasing size of microbiome studies, expect so to have many others. Even if one were to assume that the missing values were missing completely at random (MCAR), which is often a strong assumption, someone went through a lot of effort (and cost) to obtain the samples. Thus, excluding sequenced samples from an analysis due to missing values on one or more covariates is painful and has the potential to lead to bias and reduced precision.

Here I provide some example code to perform multiple imputation (MI) with predictive mean matching (PMM) using Stef van Buuren’s MICE package when performing a differential abundance analysis (i.e., one-at-a-time feature testing). For those less familiar with MI, I recommend first reading through Stef van Buuren’s textbook (free online) Flexible Imputation of Missing Data. In general, the approach is to [modified from text]:

  • Start with the observed, incomplete data
  • Impute several complete versions of the data by replacing the missing values with plausible data values drawn from a distribution specifically modeled for each missing entry
  • Estimate the parameters of interest from each of the imputed datasets
  • Pool the parameter estimates into one estimate properly accounting for the multiple sources of variability

A common recommendation for the variables to include in an imputation model are those variables included in the analysis model (including the exposure and outcome of interest), as well as any auxiliary variables thought to extend the MAR assumption. Often these auxiliary variables are those with a strong correlation with incomplete variables. There are also many ways one can develop an imputation model and then carry out the imputation of missing values. More information on these approaches and decisions can be found in Parts I, II, and III of van Buuren’s text.

In this example, I use centered log-ratio (CLR) regression to obtain estimates of species differential relative abundance (log2 scale) between samples collected from patients with colorectal cancer versus healthy controls. I first induce some missing values on three covariates: age, gender, and BMI, and then fit the default imputation model in MICE for each species. The exposure (case_status), outcome (species), and covariates (age, gender, BMI) are included in each imputation model.


Load libraries

First, I load all required libraries.

library(tidyverse); packageVersion("tidyverse")                             
## [1] '1.3.1'
library(phyloseq); packageVersion("phyloseq")                               
## [1] '1.36.0'
library(microbiome); packageVersion("microbiome")                           
## [1] '1.14.0'
library(mice); packageVersion("mice")                                       
## [1] '3.14.0'
library(broom); packageVersion("broom")                                     
## [1] '1.0.1'
library(Hmisc); packageVersion("Hmisc")                                     
## [1] '4.6.0'
library(furrr); packageVersion("furrr")                                     
## [1] '0.2.3'
library(curatedMetagenomicData); packageVersion("curatedMetagenomicData")
## [1] '3.0.10'


Grab Yachida data

Then I download the Yachida data using the curatedMetagenomicData package and convert it to a phyloseq object.

meta_df <- curatedMetagenomicData::sampleMetadata

mydata <- meta_df %>%
  dplyr::filter(study_name == "YachidaS_2019") %>%
  dplyr::select(sample_id, subject_id, body_site, disease, age, gender, BMI) %>%
  dplyr::mutate(case_status = ifelse(disease == "CRC", "CRC", 
                                     ifelse(disease == "healthy", "Control", "Other"))) %>%
  dplyr::filter(case_status != "Other")

keep_id <- mydata$sample_id

crc_se_sp <- sampleMetadata %>%
  dplyr::filter(sample_id %in% keep_id) %>%
  curatedMetagenomicData::returnSamples("relative_abundance", counts = TRUE)

crc_sp_df <- as.data.frame(as.matrix(assay(crc_se_sp)))

crc_sp_df <- crc_sp_df %>%
  rownames_to_column(var = "Species")

otu_tab <- crc_sp_df %>%
  tidyr::separate(Species, into = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "\\|") %>%
  dplyr::select(-Kingdom, -Phylum, -Class, -Order, -Family, -Genus) %>%
  dplyr::mutate(Species = gsub("s__", "", Species)) %>%
  tibble::column_to_rownames(var = "Species")

tax_tab <- crc_sp_df %>%
  dplyr::select(Species) %>%
  tidyr::separate(Species, into = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "\\|") %>%
  dplyr::mutate(Kingdom = gsub("k__", "", Kingdom),
                Phylum = gsub("p__", "", Phylum),
                Class = gsub("c__", "", Class),
                Order = gsub("o__", "", Order),
                Family = gsub("f__", "", Family),
                Genus = gsub("g__", "", Genus),
                Species = gsub("s__", "", Species)) %>%
  dplyr::mutate(spec_row = Species) %>%
  tibble::column_to_rownames(var = "spec_row")

rownames(mydata) <- NULL
mydata <- mydata %>%
  tibble::column_to_rownames(var = "sample_id")

ps <- phyloseq(sample_data(mydata),
                otu_table(otu_tab, taxa_are_rows = TRUE),
                tax_table(as.matrix(tax_tab)))

sample_data(ps)$read_count <- sample_sums(ps)
ps <- subset_samples(ps, read_count > 10^7)

ps
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 717 taxa and 501 samples ]
## sample_data() Sample Data:       [ 501 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 717 taxa by 7 taxonomic ranks ]
rm(list= ls()[!(ls() %in% c("ps"))])

You can see that we end up with 717 species on 501 samples and 8 metadata fields.


CLR transform and filter low prevalence taxa

Now I transform the sequence read counts to the centered log-ratio scale using the microbiome package and filter out low prevalence species.

Here this results in an appreciable loss of species due to the the high-degree of sparseness. One might consider other filtering thresholds (or approaches), but I use this here to keep the number of species low to help speed up the computations.

ps_clr <- microbiome::transform(ps, 'clr')

ps_filt <- filter_taxa(ps, function(x) sum(x > 0) > (0.3*length(x)), TRUE)

keep_taxa <- taxa_names(ps_filt)
(ps_clr_filt <- prune_taxa(keep_taxa, ps_clr))
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 107 taxa and 501 samples ]
## sample_data() Sample Data:       [ 501 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 107 taxa by 7 taxonomic ranks ]


Generate missing covariate data

Since there is very little missing data in these example data, here I generate some missing values for the available covariates. In other scripts (not shown here), I induced some conditional missing data patterns meeting the MAR assumption and was able to accurately reproduce the results obtained from the full dataset (as expected).

ps_clr_filt %>%
  sample_data(.) %>%
  data.frame(.) %>%
  dplyr::select(age, gender, BMI, case_status) %>%
  describe(.)
## . 
## 
##  4  Variables      501  Observations
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      501        0       52    0.999     61.9    12.41       41       46 
##      .25      .50      .75      .90      .95 
##       55       64       70       75       76 
## 
## lowest : 21 27 28 30 31, highest: 75 76 77 78 79
## --------------------------------------------------------------------------------
## gender 
##        n  missing distinct 
##      501        0        2 
##                         
## Value      female   male
## Frequency     211    290
## Proportion  0.421  0.579
## --------------------------------------------------------------------------------
## BMI 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      500        1      434        1    22.86    3.432    18.18    19.15 
##      .25      .50      .75      .90      .95 
##    20.70    22.59    24.61    26.54    28.41 
## 
## lowest : 16.14153 16.42365 16.79688 16.88019 17.08744
## highest: 33.14000 35.20105 36.52301 37.55102 39.32834
## --------------------------------------------------------------------------------
## case_status 
##        n  missing distinct 
##      501        0        2 
##                           
## Value      Control     CRC
## Frequency      244     257
## Proportion   0.487   0.513
## --------------------------------------------------------------------------------
ps_to_df <- function(ps_obj){
  my_otus <- data.frame(t(data.frame(otu_table(ps_obj))))
  my_otus <- my_otus %>%
    rownames_to_column(var = "id")
  
  meta_df <- data.frame(sample_data(ps_obj)) %>%
    rownames_to_column(var = "id") %>%
    rename_all(tolower)
  
  out_df <- left_join(meta_df, my_otus)
  return(out_df)
}

clr_df <- ps_to_df(ps_clr_filt) %>%
  dplyr::select(-subject_id, -body_site, -disease, -read_count, -id) %>%
  mutate(id = row_number()) %>%
  dplyr::select(id, everything())
  
clr_miss_df <- clr_df %>%
  mutate(age = ifelse(id > 1 & id < 40, NA, age),
         gender = ifelse(id > 100 & id < 130, NA, gender),
         bmi = ifelse(id > 110 & id < 170, NA, bmi)) %>%
  mutate(case_status = factor(case_status),
         gender = factor(gender))

clr_miss_df %>%
  sample_data(.) %>%
  data.frame(.) %>%
  dplyr::select(age, gender, bmi, case_status) %>%
  describe(.)
## . 
## 
##  4  Variables      501  Observations
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      463       38       52    0.999    61.66    12.61       41       46 
##      .25      .50      .75      .90      .95 
##       54       64       70       75       76 
## 
## lowest : 21 27 28 30 31, highest: 75 76 77 78 79
## --------------------------------------------------------------------------------
## gender 
##        n  missing distinct 
##      472       29        2 
##                         
## Value      female   male
## Frequency     196    276
## Proportion  0.415  0.585
## --------------------------------------------------------------------------------
## bmi 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      441       60      389        1    22.92    3.491    18.20    19.15 
##      .25      .50      .75      .90      .95 
##    20.70    22.60    24.61    26.64    29.00 
## 
## lowest : 16.42365 16.79688 16.88019 17.08744 17.18750
## highest: 33.14000 35.20105 36.52301 37.55102 39.32834
## --------------------------------------------------------------------------------
## case_status 
##        n  missing distinct 
##      501        0        2 
##                           
## Value      Control     CRC
## Frequency      244     257
## Proportion   0.487   0.513
## --------------------------------------------------------------------------------
table(apply(is.na(clr_miss_df), 1, any))
## 
## FALSE  TRUE 
##   393   108


CLR regression with multiple imputation

Now I fit the CLR regressions using MI with PMM to impute the missing covariate values.

The steps are to:

  • Generate a long form dataset retaining only the variables needed for the modeling
  • Write a function to perform 30 imputations (How many imputations?) of the observed data using the default settings and PMM (I set the seed to be able to reproduce the results); fit a LM to each of the CLR transformed datasets; pool the results obtained from each dataset; and return the parameter estimate, SE, etc. for case_status
  • Nest each species in its own dataset/tibble and then iterate the function over each nested dataset
  • Obtain an FDR corrected p-value and format the output

I use the future_map function in furrr to run the computations in parallel. This approach could get computationally intensive for very large datasets with many features. However, one can spread them out over as many available cores as possible (i.e., run on HPC cluster if available).

Here I use 6 threads and the computations run in 1 minute on a MacBook Pro. So for most datasets, while the imputations may add some considerable time, I do not expect this approach will present an insurmountable computational bottleneck.

clr_miss_long_df <- clr_miss_df %>%
  dplyr::select(-id) %>%
  pivot_longer(cols = c(-age, -bmi, -case_status, -gender), 
               names_to = "species", values_to = "abundance") %>%
  arrange(species)

my_imp <- function(my_df){
  imp_obj <- mice(data = my_df, m = 30, method = "pmm", seed = 123, printFlag = FALSE)
  fit_imp <- with(imp_obj, lm(abundance ~ case_status + age + gender + bmi))
  imp_res <- summary(pool(fit_imp))
  imp_res <- data.frame(imp_res)
  my_out <- imp_res[2, ]
  return(my_out)
}

plan(multisession, workers = 6)

res_imp_df <- clr_miss_long_df %>%
  tidyr::nest(data = c(-species)) %>%
  dplyr::mutate(model = future_map(data, ~ my_imp(my_df = .))) %>%
  select(-data) %>%
  unnest(cols = c(model)) %>%
  arrange(p.value) %>%
  mutate(padj = p.adjust(p.value, method = "fdr"),
         log2FoldChange = log2(exp(estimate)),
         se = log2(exp(std.error))) %>%
  dplyr::select(species, term, log2FoldChange, se, statistic, p.value, padj)

head(res_imp_df, 10)
## # A tibble: 10 × 7
##    species                       term      log2F…¹    se stati…² p.value    padj
##    <chr>                         <fct>       <dbl> <dbl>   <dbl>   <dbl>   <dbl>
##  1 Roseburia_faecis              case_sta…   -2.61 0.612   -4.26 2.40e-5 0.00257
##  2 Collinsella_aerofaciens       case_sta…    1.90 0.528    3.60 3.55e-4 0.0190 
##  3 Roseburia_intestinalis        case_sta…   -1.97 0.578   -3.41 7.09e-4 0.0222 
##  4 Holdemania_filiformis         case_sta…    1.32 0.392    3.36 8.31e-4 0.0222 
##  5 Bacteroides_cellulosilyticus  case_sta…    1.68 0.530    3.18 1.57e-3 0.0335 
##  6 Eubacterium_eligens           case_sta…   -2.00 0.640   -3.13 1.88e-3 0.0335 
##  7 Streptococcus_anginosus_group case_sta…    1.09 0.357    3.05 2.38e-3 0.0364 
##  8 Streptococcus_salivarius      case_sta…   -1.17 0.395   -2.96 3.24e-3 0.0433 
##  9 Bacteroides_plebeius          case_sta…    2.15 0.738    2.91 3.72e-3 0.0443 
## 10 Anaerotruncus_colihominis     case_sta…    1.08 0.395    2.73 6.48e-3 0.0622 
## # … with abbreviated variable names ¹​log2FoldChange, ²​statistic


Summary

Multiple imputation may provide a rigorous strategy to deal with missing covariate data in microbiome studies. I used CLR regression to run a differential abundance analysis in this example, but the approach could easily be applied to other measures (i.e., Shannon diversity, etc.) or DA workflows. In the future, I hope to extend the code to use Massalin2 and/or LinDA as the regression engine, but this would likely entail some additional complexity and computational time. Investigation into the performance of generating a single set of imputed datasets to be used for all analyses, perhaps conditional on all relevant covariates and some type of summary measures of the microbiome (i.e., alpha-diversity and species community types or principal components, etc.) would be interesting, and may perform sufficiently well and require much less computational effort.

Inverse probability weighting in another approach that could be applied should the computational demand of MI create an issue; however, MI may have some advantages in terms of not requiring the missing data model to be developed only on those with fully observed data, typically can provide more precise estimates, and is easy to incorporate auxiliary variables extending the MAR assumption. Generative adversarial imputation nets (GAIN) or methods using non-parametric statistical learning approaches like random forests may also perform well.

Thus, MI should be considered in lieu of (or at least in addition to) conducting a complete case (CC) analysis (i.e., dropping any samples where a metadata value is missing). While imputing values for unobserved variables may at first seem questionable, the assumptions required for MI to provide unbiased/less biased results are often much more plausible than those required for the CC analysis.


Session info

sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] curatedMetagenomicData_3.0.10  TreeSummarizedExperiment_2.0.3
##  [3] Biostrings_2.60.2              XVector_0.32.0                
##  [5] SingleCellExperiment_1.14.1    SummarizedExperiment_1.22.0   
##  [7] Biobase_2.52.0                 GenomicRanges_1.44.0          
##  [9] GenomeInfoDb_1.28.4            IRanges_2.26.0                
## [11] S4Vectors_0.30.2               BiocGenerics_0.38.0           
## [13] MatrixGenerics_1.4.3           matrixStats_0.61.0            
## [15] furrr_0.2.3                    future_1.22.1                 
## [17] Hmisc_4.6-0                    Formula_1.2-4                 
## [19] survival_3.2-11                lattice_0.20-44               
## [21] broom_1.0.1                    mice_3.14.0                   
## [23] microbiome_1.14.0              phyloseq_1.36.0               
## [25] forcats_0.5.1                  stringr_1.4.0                 
## [27] dplyr_1.0.9                    purrr_0.3.4                   
## [29] readr_2.0.2                    tidyr_1.2.0                   
## [31] tibble_3.1.8                   ggplot2_3.3.5                 
## [33] tidyverse_1.3.1               
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1                  backports_1.4.1              
##   [3] AnnotationHub_3.0.2           BiocFileCache_2.0.0          
##   [5] plyr_1.8.6                    igraph_1.2.7                 
##   [7] lazyeval_0.2.2                splines_4.1.1                
##   [9] BiocParallel_1.26.2           listenv_0.8.0                
##  [11] digest_0.6.28                 foreach_1.5.1                
##  [13] htmltools_0.5.2               fansi_0.5.0                  
##  [15] memoise_2.0.0                 magrittr_2.0.3               
##  [17] checkmate_2.0.0               cluster_2.1.2                
##  [19] tzdb_0.1.2                    globals_0.14.0               
##  [21] modelr_0.1.8                  jpeg_0.1-9                   
##  [23] colorspace_2.0-2              rappdirs_0.3.3               
##  [25] blob_1.2.2                    rvest_1.0.2                  
##  [27] haven_2.4.3                   xfun_0.29                    
##  [29] crayon_1.4.1                  RCurl_1.98-1.5               
##  [31] jsonlite_1.7.2                iterators_1.0.13             
##  [33] ape_5.5                       glue_1.6.2                   
##  [35] gtable_0.3.0                  zlibbioc_1.38.0              
##  [37] DelayedArray_0.18.0           Rhdf5lib_1.14.2              
##  [39] scales_1.2.1                  DBI_1.1.1                    
##  [41] Rcpp_1.0.7                    xtable_1.8-4                 
##  [43] htmlTable_2.3.0               tidytree_0.3.5               
##  [45] bit_4.0.4                     foreign_0.8-81               
##  [47] htmlwidgets_1.5.4             httr_1.4.2                   
##  [49] RColorBrewer_1.1-2            ellipsis_0.3.2               
##  [51] pkgconfig_2.0.3               nnet_7.3-16                  
##  [53] sass_0.4.2                    dbplyr_2.1.1                 
##  [55] utf8_1.2.2                    AnnotationDbi_1.54.1         
##  [57] later_1.3.0                   tidyselect_1.1.2             
##  [59] rlang_1.0.4                   reshape2_1.4.4               
##  [61] BiocVersion_3.13.1            cachem_1.0.6                 
##  [63] munsell_0.5.0                 cellranger_1.1.0             
##  [65] tools_4.1.1                   cli_3.3.0                    
##  [67] ExperimentHub_2.0.0           generics_0.1.3               
##  [69] RSQLite_2.2.8                 ade4_1.7-18                  
##  [71] evaluate_0.16                 biomformat_1.20.0            
##  [73] fastmap_1.1.0                 yaml_2.2.1                   
##  [75] bit64_4.0.5                   knitr_1.40                   
##  [77] fs_1.5.2                      KEGGREST_1.32.0              
##  [79] nlme_3.1-152                  mime_0.12                    
##  [81] xml2_1.3.2                    compiler_4.1.1               
##  [83] rstudioapi_0.13               interactiveDisplayBase_1.30.0
##  [85] filelock_1.0.2                curl_4.3.2                   
##  [87] png_0.1-7                     reprex_2.0.1                 
##  [89] treeio_1.16.2                 bslib_0.3.1                  
##  [91] stringi_1.7.5                 blogdown_1.7                 
##  [93] Matrix_1.3-4                  vegan_2.5-7                  
##  [95] permute_0.9-5                 multtest_2.48.0              
##  [97] vctrs_0.4.1                   pillar_1.8.0                 
##  [99] lifecycle_1.0.1               rhdf5filters_1.4.0           
## [101] BiocManager_1.30.16           jquerylib_0.1.4              
## [103] data.table_1.14.2             bitops_1.0-7                 
## [105] httpuv_1.6.3                  R6_2.5.1                     
## [107] latticeExtra_0.6-29           promises_1.2.0.1             
## [109] bookdown_0.24                 gridExtra_2.3                
## [111] parallelly_1.28.1             codetools_0.2-18             
## [113] MASS_7.3-54                   assertthat_0.2.1             
## [115] rhdf5_2.36.0                  withr_2.4.2                  
## [117] GenomeInfoDbData_1.2.6        mgcv_1.8-36                  
## [119] hms_1.1.1                     grid_4.1.1                   
## [121] rpart_4.1-15                  rmarkdown_2.11               
## [123] Rtsne_0.15                    shiny_1.7.1                  
## [125] lubridate_1.8.0               base64enc_0.1-3
Associate Professor of Pediatrics

Related