Differential Feature Abundance Meta-Analysis


In a previous post, I highlighted some difficulties one faces with differential abundance (DA) testing when trying to identify those features (i.e., OTUs/ASVs, species, pathways, etc.) that “differ the most” according to some condition of interest. I examined the use of bootstrap ranking to assess the uncertainty in rankings when we have access to only a single study. In this post, I highlight how we can combine the results of multiple studies to obtain a pooled/summary estimate for a given feature when we have access to the primary data. While obtaining such data can often be difficult, one sometimes gets lucky given the increasing number of publicly available datasets.

Individual participant data meta-analysis (IPDMA) can be undertaken when one has access to the actual data for each study. Riley, Tierney & Stewart have an excellent textbook, materials, workshops, etc. on the topic on their website that I highly recommend. When we do not have access to the primary data, pooled/summary estimates can still be obtained if we can extract the required information (i.e., effect sizes and standard errors) from published studies (often this information is provided as a supplemental table). Of course in the latter case, we do not have the ability to standardize approaches (i.e., bioinformatic processing, variable harmonization, covariate adjustment, etc.) as we would when conducting an IPDMA.

In this post, I provide an example of how one can conduct a differential feature abundance meta-analysis. I use the curatedMetagenomicData package to access multiple studies examining differences in the stool microbiome of patients with colorectal cancer (CRC) when compared to otherwise healthy controls. There is not data on all relevant factors one might wish to adjust for to limit confounding bias (not sure I even have a good idea of what a sufficient adjustment set might look like), but these data were processed in a uniform fashion (all taxonomic profiles classified via MetaPhlAn3) and collected information on the condition of interest. So at a minimum we have a great dataset for a didactic example.

I will also look at the extent to which the results of a single large study align with the pooled estimates, since often, one does not have access to additional studies and published reports are based on the results of single (often underpowered) study. While I see the goals of meta-analysis and validation studies as being different, synthesizing the results of studies conducted in diverse cohorts can be informative in understanding variability in feature rankings (i.e, how do the results compare across studies, which features have the most “consistent” effects, etc.). In a future posts, I hope to use some simulated and real data sets to highlight the extent to which we might be able to drive down false positive results by including a validation cohort in our study design and what reasonable sample size requirements might actually be for differential abundance testing.


Load libraries

library(curatedMetagenomicData); packageVersion("curatedMetagenomicData")  
## [1] '3.0.10'
library(tidyverse); packageVersion("tidyverse")                                                  
## [1] '1.3.1'
library(phyloseq); packageVersion("phyloseq")                               
## [1] '1.36.0'
library(Maaslin2); packageVersion("Maaslin2")                                
## [1] '1.6.0'
library(metafor); packageVersion("metafor")                               
## [1] '3.0.2'
library(broom); packageVersion("broom")                                   
## [1] '0.7.9'


Identify CRC studies

First, we need to identify those studies that have data on both CRC cases and controls in the curatedMetagenomicData package. The sampleMetadata object is fantastic and makes such a search quite accessible. Here is a link to the curatedMetagenomicData package tutorial that explains in greater detail how to access data available through the package.

Basically, the code chunk below just runs a text search for studies with CRC in the disease field and retains those participant sample IDs where they provided a stool sample and were enrolled in a study containing both CRC cases and controls. This clearly looks much “cleaner” than the first time I ran this search and had to figure out what terms to use etc. This always takes me a little trial and error since there is a ton of information in the sampleMetadata object.

meta_df <- curatedMetagenomicData::sampleMetadata

crc_studies <- meta_df %>%
  dplyr::filter(grepl("CRC", disease)) %>%
  dplyr::distinct(study_name)
keep_studies <- crc_studies$study_name

crc_df <- meta_df %>%
  dplyr::filter(study_name %in% keep_studies) %>%
  dplyr::filter(study_condition %in% c("CRC", "control")) %>%
  dplyr::filter(body_site == "stool")

crc_df <- crc_df %>%
  dplyr::mutate(case_status = ifelse(str_detect(disease, "CRC"), "CRC", 
                                     ifelse(disease == "healthy", "Control", "Other"))) %>%
  dplyr::filter(case_status != "Other")

mydata <- crc_df %>%
  dplyr::select(study_name, sample_id, subject_id, body_site, disease, age, gender, 
                BMI, country, sequencing_platform, DNA_extraction_kit)
head(mydata)
##   study_name sample_id subject_id body_site                      disease age
## 1 FengQ_2015  SID31004   SID31004     stool CRC;fatty_liver;hypertension  64
## 2 FengQ_2015  SID31021   SID31021     stool                      healthy  60
## 3 FengQ_2015  SID31159   SID31159     stool                          CRC  73
## 4 FengQ_2015  SID31188   SID31188     stool CRC;fatty_liver;hypertension  65
## 5 FengQ_2015  SID31223   SID31223     stool                          CRC  65
## 6 FengQ_2015  SID31237   SID31237     stool CRC;fatty_liver;hypertension  67
##   gender   BMI country sequencing_platform DNA_extraction_kit
## 1   male 29.35     AUT       IlluminaHiSeq              MoBio
## 2 female 22.10     AUT       IlluminaHiSeq              MoBio
## 3   male 17.99     AUT       IlluminaHiSeq              MoBio
## 4   male 28.03     AUT       IlluminaHiSeq              MoBio
## 5 female 23.31     AUT       IlluminaHiSeq              MoBio
## 6   male 29.40     AUT       IlluminaHiSeq              MoBio
mydata %>%
  count(study_name)
##         study_name   n
## 1       FengQ_2015  62
## 2      GuptaA_2019  60
## 3  HanniganGD_2017  55
## 4   ThomasAM_2018a  43
## 5   ThomasAM_2018b  60
## 6  ThomasAM_2019_c  80
## 7   VogtmannE_2016 104
## 8     WirbelJ_2018 125
## 9    YachidaS_2019 504
## 10        YuJ_2015 112
## 11    ZellerG_2014 114
rm(list= ls()[!(ls() %in% c("mydata"))])


We end up with 11 studies meeting these criteria and a fairly large number of samples (> 1300).


Download curated data

The returnSamples function allows us to pull down the processed microbiome data. Here I just pass the set of IDs I want to the function and request the taxonomic abundances as counts (as opposed to relative abundance as proportions).

Then I convert the data into a single phyloseq object. One could perform the subsequent analysis using the SummarizedExperiment object if preferred. I am just more comfortable working with phyloseq objects.

keep_id <- mydata$sample_id

crc_se_sp <- sampleMetadata %>%
  dplyr::filter(sample_id %in% keep_id) %>%
  curatedMetagenomicData::returnSamples("relative_abundance", counts = TRUE)

crc_sp_df <- as.data.frame(as.matrix(assay(crc_se_sp)))

crc_sp_df <- crc_sp_df %>%
  rownames_to_column(var = "Species")

otu_tab <- crc_sp_df %>%
  tidyr::separate(Species, into = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "\\|") %>%
  dplyr::select(-Kingdom, -Phylum, -Class, -Order, -Family, -Genus) %>%
  dplyr::mutate(Species = gsub("s__", "", Species)) %>%
  tibble::column_to_rownames(var = "Species")

tax_tab <- crc_sp_df %>%
  dplyr::select(Species) %>%
  tidyr::separate(Species, into = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = "\\|") %>%
  dplyr::mutate(Kingdom = gsub("k__", "", Kingdom),
                Phylum = gsub("p__", "", Phylum),
                Class = gsub("c__", "", Class),
                Order = gsub("o__", "", Order),
                Family = gsub("f__", "", Family),
                Genus = gsub("g__", "", Genus),
                Species = gsub("s__", "", Species)) %>%
  dplyr::mutate(spec_row = Species) %>%
  tibble::column_to_rownames(var = "spec_row")

meta_df <- mydata %>%
  tibble::column_to_rownames(var = "sample_id")
(ps <- phyloseq(sample_data(meta_df),
                otu_table(otu_tab, taxa_are_rows = TRUE),
                tax_table(as.matrix(tax_tab))))
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 934 taxa and 1319 samples ]
## sample_data() Sample Data:       [ 1319 samples by 10 sample variables ]
## tax_table()   Taxonomy Table:    [ 934 taxa by 7 taxonomic ranks ]
rm(list= ls()[!(ls() %in% c("ps"))])


We can see that we now have a phyloseq object containing 934 taxa on 1319 samples with information on 7 taxonomic ranks and 10 sample variables. This is the information we need to conduct the meta-analysis.


Prepare data for DA testing

These code chunks clean up the metadata a bit, remove samples with low read counts, and remove taxa seen at less than 0.0001% relative abundance.

mydata <- data.frame(sample_data(ps)) %>%
  dplyr::select(-subject_id, -body_site, -DNA_extraction_kit) %>%
  na.omit(.) %>%
  dplyr::mutate(case_status = ifelse(str_detect(disease, "CRC"), "CRC", 
                                     ifelse(disease == "healthy", "Control", "Other"))) %>%
  dplyr::filter(case_status != "Other") %>%
  dplyr::mutate(study_name = recode(study_name, ThomasAM_2019_c = "ThomasAM_2019"),
                study_name = factor(study_name, 
                                    levels = c("ZellerG_2014", "FengQ_2015", "YuJ_2015", 
                                               "VogtmannE_2016", "HanniganGD_2017", "ThomasAM_2018a", 
                                               "ThomasAM_2018b", "WirbelJ_2018", "GuptaA_2019",  
                                               "ThomasAM_2019", "YachidaS_2019")))

sample_data(ps)$study_name <- NULL

merge_meta <- mydata %>%
  dplyr::select(study_name, case_status)

ps <- merge_phyloseq(ps, sample_data(merge_meta))
(ps <- subset_samples(ps, !(is.na(case_status))))      
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 934 taxa and 1304 samples ]
## sample_data() Sample Data:       [ 1304 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 934 taxa by 7 taxonomic ranks ]
head(sort(sample_sums(ps))) 
##     MG100142     MG100163 SAMD00164886     MG100162     MG100164     MG100160 
##        17146       330750       722269      2291325      2521950      2956655
(ps <- subset_samples(ps, sample_sums(ps) > 10^6)) 
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 934 taxa and 1301 samples ]
## sample_data() Sample Data:       [ 1301 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 934 taxa by 7 taxonomic ranks ]
minTotRelAbun <- 1e-5           
x <- taxa_sums(ps)
keepTaxa <- (x / sum(x)) > minTotRelAbun
(ps <- prune_taxa(keepTaxa, ps))       
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 462 taxa and 1301 samples ]
## sample_data() Sample Data:       [ 1301 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 462 taxa by 7 taxonomic ranks ]
rm(x, keepTaxa, minTotRelAbun, mydata, merge_meta)


Differential abundance estimates for Yachida et. al.

In a previous post, I presented results using the same workflow for the Yachida et. al. 2019 data alone.

Here we reproduce those results to see how how the findings from a single, large study agree with the pooled/summary estimates obtained across all studies.

We will perform the DA testing using the popular Maaslin2 package with the default settings (as recommended here). See my previous post for a few other workflows (among the list of many dozens of available) that could be used if interested.

ps_y <- subset_samples(ps, study_name == "YachidaS_2019")

m_y <- Maaslin2(
  input_data = data.frame(otu_table(ps_y)),
  input_metadata = data.frame(sample_data(ps_y)),
  output = "/Users/olljt2/desktop/maaslin2_output",
  min_abundance = 0.0,
  min_prevalence = 0.3,
  normalization = "TSS",
  transform = "LOG",
  analysis_method = "LM",
  fixed_effects = c("age", "gender", "BMI", "case_status"),
  correction = "BH",
  standardize = FALSE,
  cores = 1,
  plot_heatmap = FALSE,
  plot_scatter = FALSE)
  
mas_y_out <- m_y$results %>%
  dplyr::filter(metadata == "case_status") %>%
  dplyr::select(-qval, -name, -metadata, -N, -N.not.zero)
head(mas_y_out)
##                         feature value       coef    stderr         pval
## 1       Collinsella_aerofaciens   CRC  1.7140080 0.4373069 0.0001011899
## 2              Roseburia_faecis   CRC -1.7735569 0.4542806 0.0001076316
## 3         Holdemania_filiformis   CRC  1.1320742 0.3153403 0.0003634875
## 4  Bacteroides_cellulosilyticus   CRC  1.6529415 0.4741172 0.0005330389
## 5 Streptococcus_anginosus_group   CRC  0.8565514 0.2613467 0.0011207718
## 6        Mogibacterium_diversum   CRC  0.9207161 0.2961708 0.0019866209


As before, we see that Collinsella aerofaciens has the lowest p-value and thus may be considered (depending on one’s criteria) among the most differentially abundant species. We also see Roseburia faecis, Holdemania filiformis, Bacteroides cellulosilyticus, Streptococcus anginosus, and Mogibacterium diversum among the “top hits”.


Meta-Analysis: Stage 1

In the next two sections, I show how one can conduct a two-stage meta-analysis for each species that is seen in at least three studies. In stage 1, we obtain the estimated coefficient values and standard errors for each species in each study. In stage 2, we pool the results to obtain a summary estimate.

To obtain the stage 1 estimates we:

  • generate a phyloseq object for each study and combine them all in single a list object
  • write a function to run the Maaslin2 workflow and return the results for each species as a data.frame
  • apply the function to each ps object/study in the list
  • combine the results as a data.frame

Here the estimate for case_status is of interest and we include age, gender, and BMI as model covariates. All other selections are set to the Maaslin2 defaults other than suppressing some of the output. I use only one core here since these models take only a few moments to run.

The first line of code below just cleans up the sample names so they do not cause us problems later on (best practice IMO is never to use leading numbers or to include special characters other than underscores in your sample names or they can/will cause you issues/headaches later on).

sample_names(ps) <- gsub("\\-", "_", sample_names(ps))    

FengQ_2015 <- subset_samples(ps, study_name == "FengQ_2015")
GuptaA_2019 <- subset_samples(ps, study_name == "GuptaA_2019")
HanniganGD_2017 <- subset_samples(ps, study_name == "HanniganGD_2017")
ThomasAM_2018a <- subset_samples(ps, study_name == "ThomasAM_2018a")
ThomasAM_2018b <- subset_samples(ps, study_name == "ThomasAM_2018b")
ThomasAM_2019 <- subset_samples(ps, study_name == "ThomasAM_2019")
VogtmannE_2016 <- subset_samples(ps, study_name == "VogtmannE_2016") 
WirbelJ_2018 <- subset_samples(ps, study_name == "WirbelJ_2018")   
YachidaS_2019 <- subset_samples(ps, study_name == "YachidaS_2019")   
YuJ_2015 <- subset_samples(ps, study_name == "YuJ_2015")        
ZellerG_2014 <- subset_samples(ps, study_name == "ZellerG_2014")

ps_list <- list(FengQ_2015 = FengQ_2015,
                GuptaA_2019 = GuptaA_2019,
                HanniganGD_2017 = HanniganGD_2017,
                ThomasAM_2018a = ThomasAM_2018a,
                ThomasAM_2018b = ThomasAM_2018b,
                ThomasAM_2019 = ThomasAM_2019,
                VogtmannE_2016 = VogtmannE_2016,
                WirbelJ_2018 = WirbelJ_2018, 
                YachidaS_2019 = YachidaS_2019,  
                YuJ_2015 = YuJ_2015,       
                ZellerG_2014 = ZellerG_2014)

ps_list
## $FengQ_2015
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 462 taxa and 62 samples ]
## sample_data() Sample Data:       [ 62 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 462 taxa by 7 taxonomic ranks ]
## 
## $GuptaA_2019
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 462 taxa and 60 samples ]
## sample_data() Sample Data:       [ 60 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 462 taxa by 7 taxonomic ranks ]
## 
## $HanniganGD_2017
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 462 taxa and 52 samples ]
## sample_data() Sample Data:       [ 52 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 462 taxa by 7 taxonomic ranks ]
## 
## $ThomasAM_2018a
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 462 taxa and 40 samples ]
## sample_data() Sample Data:       [ 40 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 462 taxa by 7 taxonomic ranks ]
## 
## $ThomasAM_2018b
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 462 taxa and 58 samples ]
## sample_data() Sample Data:       [ 58 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 462 taxa by 7 taxonomic ranks ]
## 
## $ThomasAM_2019
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 462 taxa and 80 samples ]
## sample_data() Sample Data:       [ 80 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 462 taxa by 7 taxonomic ranks ]
## 
## $VogtmannE_2016
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 462 taxa and 101 samples ]
## sample_data() Sample Data:       [ 101 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 462 taxa by 7 taxonomic ranks ]
## 
## $WirbelJ_2018
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 462 taxa and 125 samples ]
## sample_data() Sample Data:       [ 125 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 462 taxa by 7 taxonomic ranks ]
## 
## $YachidaS_2019
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 462 taxa and 502 samples ]
## sample_data() Sample Data:       [ 502 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 462 taxa by 7 taxonomic ranks ]
## 
## $YuJ_2015
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 462 taxa and 111 samples ]
## sample_data() Sample Data:       [ 111 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 462 taxa by 7 taxonomic ranks ]
## 
## $ZellerG_2014
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 462 taxa and 110 samples ]
## sample_data() Sample Data:       [ 110 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 462 taxa by 7 taxonomic ranks ]
get_s1_coef <- function(ps_object){
  m <- Maaslin2(
    input_data = data.frame(otu_table(ps_object)),
    input_metadata = data.frame(sample_data(ps_object)),
    output = "/Users/olljt2/desktop/maaslin2_output",
    min_abundance = 0.0,
    min_prevalence = 0.3,
    normalization = "TSS",
    transform = "LOG",
    analysis_method = "LM",
    fixed_effects = c("age", "gender", "BMI", "case_status"),
    correction = "BH",
    standardize = FALSE,
    cores = 1,
    plot_heatmap = FALSE,
    plot_scatter = FALSE)
  
  mas_out <- m$results %>%
    dplyr::filter(metadata == "case_status") %>%
    dplyr::select(-qval, -name, -metadata, -N, -N.not.zero)
  
  return(mas_out)
}

stage_1_res <- lapply(ps_list, get_s1_coef)
s1_res_df <- bind_rows(stage_1_res, .id = "column_label")
print.data.frame(head(s1_res_df))
##   column_label                         feature value      coef    stderr
## 1   FengQ_2015                Escherichia_coli   CRC  5.882174 1.3664527
## 2   FengQ_2015                Prevotella_copri   CRC  6.079525 1.4439969
## 3   FengQ_2015          Eubacterium_sp_CAG_251   CRC -7.014662 1.8576141
## 4   FengQ_2015 Fusicatenibacter_saccharivorans   CRC -1.800875 0.5530868
## 5   FengQ_2015               Coprococcus_comes   CRC -3.200692 1.0025953
## 6   FengQ_2015            Alistipes_finegoldii   CRC  3.949649 1.2227965
##           pval
## 1 6.650347e-05
## 2 9.159924e-05
## 3 3.821179e-04
## 4 1.904298e-03
## 5 2.297237e-03
## 6 2.056689e-03


Meta-Analysis: Stage 2

You can see from the rows printed above that we now have a data.frame that contains the effect size and SE for each species, in each study, that we need to generate the pooled/summary estimates.

In stage 2, we generate pooled estimates using the random-effects MA model fitted via restricted maximum likelihood provided in the metafor package. I only generate pooled estimates for species seen in at least 3 studies and use the Knapp and Hartung adjustment for the standard errors to help account for uncertainty in the estimate of the amount of (residual) heterogeneity.

I use a bit of tidyverse code with purr::map to iterate over nested data.frames, organize the results using broom, and calculate an FDR p-value for each species (should one consider it to be of use).

I also print the first few rows of the output data.frame so we can see the results. Often I will convert the Maaslin2 default output to the log2 scale for comparisons to other models [e.g., log2(exp(coef))]. However, here I just use the default output and refer to the log relative abundances as the “effect size” measure.

prev_taxa <- s1_res_df %>%
  dplyr::count(feature) %>%
  dplyr::filter(n > 2)

keep_taxa <- prev_taxa$feature

s1_res_filt_df <- s1_res_df %>%
  dplyr::filter(feature %in% keep_taxa)     #retain species seen in at least 3 studies for MA


s2_res_df <- s1_res_filt_df %>%
  dplyr::rename("study_name" = "column_label") %>%
  dplyr::select(-pval, -value) %>%
  tidyr::nest(data = c(study_name, coef, stderr)) %>%
  dplyr::mutate(model = map(data, ~rma(yi = coef, sei = stderr, data = ., method = "REML", knha = TRUE))) %>%
  dplyr::mutate(tidy_model = map(model, tidy, conf.int = TRUE, exponentiate = FALSE)) %>%
  select(-data, -model) %>%
  unnest(cols = c(tidy_model)) %>%
  dplyr::select(-type, -term) %>%
  dplyr::arrange(p.value) %>%
  dplyr::mutate(fdr_pval = p.adjust(p.value, method = "fdr"))

print.data.frame(head(s2_res_df))
##                  feature   estimate std.error statistic      p.value   conf.low
## 1  Clostridium_symbiosum  2.0741083 0.3646613  5.687767 0.0007447156  1.2118215
## 2   Bacteroides_fragilis  1.6558011 0.3639436  4.549609 0.0013867145  0.8325034
## 3  Bilophila_wadsworthia  0.7938644 0.1691665  4.692799 0.0015558825  0.4037657
## 4    Eisenbergiella_tayi  1.6555579 0.2715053  6.097699 0.0017171969  0.9576312
## 5 Roseburia_intestinalis -1.4338628 0.3306763 -4.336152 0.0018880488 -2.1819046
## 6   Hungatella_hathewayi  1.6063510 0.3615452  4.443016 0.0021592643  0.7726263
##   conf.high   fdr_pval
## 1  2.936395 0.03871462
## 2  2.479099 0.03871462
## 3  1.183963 0.03871462
## 4  2.353485 0.03871462
## 5 -0.685821 0.03871462
## 6  2.440076 0.03871462


Comparing Yachida to pooled estimates

Collinsella aerofaciens had the lowest p-value (low FDR p-value if calculated) and thus, by that metric, might be considered the “top hit”.

This code chunk lets us see how this species stacks up across studies.

ca_df <- s1_res_df %>% 
  filter(feature == "Collinsella_aerofaciens")

(ca_res <- rma(yi = coef, sei = stderr, method = "REML", knha = TRUE, data = ca_df))
## 
## Random-Effects Model (k = 11; tau^2 estimator: REML)
## 
## tau^2 (estimated amount of total heterogeneity): 1.2185 (SE = 0.8904)
## tau (square root of estimated tau^2 value):      1.1038
## I^2 (total heterogeneity / total variability):   64.13%
## H^2 (total variability / sampling variability):  2.79
## 
## Test for Heterogeneity:
## Q(df = 10) = 28.6296, p-val = 0.0014
## 
## Model Results:
## 
## estimate      se    tval  df    pval    ci.lb   ci.ub 
##   0.2929  0.4237  0.6912  10  0.5051  -0.6511  1.2369    
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
forest(ca_res, addfit = TRUE, showweights = TRUE, header = TRUE, 
       slab = ca_df$column_label,
       xlab = "Coef [95% CI]", mlab = "Random Effects Model:\nCollinsella aerofaciens")


We see less impressive results in the pooled analysis for this particular species. This may not be entirety unexpected since, when we try to “pick winners” this way from a single study, we are by definition selecting the most extreme results without any attempt to account for the fact that the effect might be expected to regress back toward some less extreme value if we ran this same process again (or many times) in a new set of samples drawn from the same study population of interest. Of course here we have samples from diverse cohorts, so the attenuation may also reflect true differences in the correlation across these different “study populations”. I do not dig into that here.

Let’s look at the other top hits in the Yachida et. al. data.


my_taxa <- c("Collinsella_aerofaciens", "Roseburia_faecis", "Holdemania_filiformis",   
             "Bacteroides_cellulosilyticus", "Streptococcus_anginosus_group", "Mogibacterium_diversum")

my_res <- s2_res_df %>%
  dplyr::filter(feature %in% my_taxa)

print.data.frame(head(my_res))
##                         feature   estimate std.error  statistic     p.value
## 1 Streptococcus_anginosus_group  1.0697916 0.2147448  4.9816887 0.004169671
## 2        Mogibacterium_diversum  0.8322284 0.1786965  4.6572164 0.005545968
## 3         Holdemania_filiformis  0.6255553 0.3987170  1.5689208 0.160655175
## 4              Roseburia_faecis -1.0002745 0.6799560 -1.4710871 0.172021690
## 5  Bacteroides_cellulosilyticus  0.4047686 0.4269421  0.9480643 0.370851454
## 6       Collinsella_aerofaciens  0.2928582 0.4236712  0.6912392 0.505144955
##     conf.low conf.high   fdr_pval
## 1  0.5177726 1.6218107 0.04875308
## 2  0.3728744 1.2915825 0.05619914
## 3 -0.3172605 1.5683712 0.31307162
## 4 -2.5153109 0.5147619 0.32280613
## 5 -0.5797617 1.3892988 0.58112805
## 6 -0.6511401 1.2368565 0.67352661


Streptococcus anginosus and Mogibacterium diversum still look like good candidates based on the pooled results. The other four less so. Let’s generate a forest plot for Streptococcus anginosus.


sa_df <- s1_res_df %>% 
  filter(feature == "Streptococcus_anginosus_group")

(sa_res <- rma(yi = coef, sei = stderr, method = "REML", knha = TRUE, data = sa_df))
## 
## Random-Effects Model (k = 6; tau^2 estimator: REML)
## 
## tau^2 (estimated amount of total heterogeneity): 0.0144 (SE = 0.1866)
## tau (square root of estimated tau^2 value):      0.1199
## I^2 (total heterogeneity / total variability):   4.05%
## H^2 (total variability / sampling variability):  1.04
## 
## Test for Heterogeneity:
## Q(df = 5) = 5.1779, p-val = 0.3946
## 
## Model Results:
## 
## estimate      se    tval  df    pval   ci.lb   ci.ub 
##   1.0698  0.2147  4.9817   5  0.0042  0.5178  1.6218  ** 
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
forest(sa_res, addfit = TRUE, showweights = TRUE, header = TRUE, 
       slab = sa_df$column_label,
       xlab = "Coef [95% CI]", mlab = "Random Effects Model:\nStreptococcus anginosus")


Here we see the summary estimate is dominated by the Yachida data (large sample size and low variance), but the effects are in the same direction across all studies providing additional support for perhaps a more “generalizable” association.


Species with low pooled FDR p-values

In this code chunk, I print the top taxa with the lowest FDR p-values based on the pooled estimates, plot the “top hits”, and generate a forest plot for Bacteroides fragilis.

print.data.frame(head(s2_res_df))
##                  feature   estimate std.error statistic      p.value   conf.low
## 1  Clostridium_symbiosum  2.0741083 0.3646613  5.687767 0.0007447156  1.2118215
## 2   Bacteroides_fragilis  1.6558011 0.3639436  4.549609 0.0013867145  0.8325034
## 3  Bilophila_wadsworthia  0.7938644 0.1691665  4.692799 0.0015558825  0.4037657
## 4    Eisenbergiella_tayi  1.6555579 0.2715053  6.097699 0.0017171969  0.9576312
## 5 Roseburia_intestinalis -1.4338628 0.3306763 -4.336152 0.0018880488 -2.1819046
## 6   Hungatella_hathewayi  1.6063510 0.3615452  4.443016 0.0021592643  0.7726263
##   conf.high   fdr_pval
## 1  2.936395 0.03871462
## 2  2.479099 0.03871462
## 3  1.183963 0.03871462
## 4  2.353485 0.03871462
## 5 -0.685821 0.03871462
## 6  2.440076 0.03871462
my_plot <- s2_res_df %>%
  dplyr::filter(fdr_pval < 0.1) %>%
  mutate(high = ifelse(estimate > 0, "high", "low"),
         feature = fct_reorder(feature, estimate))

ggplot(data = my_plot, aes(x = feature, y = estimate, fill = high)) +
  geom_col(alpha = .8) +
  labs(y = "\nEffect Size", x = "") +
  theme(axis.text.x = element_text(color = "black", size = 8),
        axis.text.y = element_text(color = "black", size = 8),
        axis.title.y = element_text(size = 8),
        axis.title.x = element_text(size = 10),
        legend.text = element_text(size = 8),
        legend.title = element_text(size = 8)) +
  coord_flip() +
  geom_hline(yintercept = 0, linetype="dotted") +
  scale_fill_manual("legend", values = c("high" = "#00BFC4", "low" = "#F8766D")) +
  theme(legend.position = "none")

bf_df <- s1_res_df %>% 
  filter(feature == "Bacteroides_fragilis")

(bf_res <- rma(yi = coef, sei = stderr, method = "REML", knha = TRUE, data = bf_df))
## 
## Random-Effects Model (k = 10; tau^2 estimator: REML)
## 
## tau^2 (estimated amount of total heterogeneity): 0.6474 (SE = 0.7253)
## tau (square root of estimated tau^2 value):      0.8046
## I^2 (total heterogeneity / total variability):   43.83%
## H^2 (total variability / sampling variability):  1.78
## 
## Test for Heterogeneity:
## Q(df = 9) = 16.3556, p-val = 0.0598
## 
## Model Results:
## 
## estimate      se    tval  df    pval   ci.lb   ci.ub 
##   1.6558  0.3639  4.5496   9  0.0014  0.8325  2.4791  ** 
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
forest(bf_res, addfit = TRUE, showweights = TRUE, header = TRUE, 
       slab = bf_df$column_label,
       xlab = "Coef [95% CI]", mlab = "Random Effects Model:\nBacteroides fragilis")


The main point of this post is that synthesizing the results of multiple studies may provide us with some additional insight into the potential variability in feature rankings that is missed when we have access to only a single study. As more and more results and datasets from microbiome studies are published, the opportunities to learn from the application of meta-analytic approaches is increased.


Session Info

sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] broom_0.7.9                    metafor_3.0-2                 
##  [3] Matrix_1.3-4                   Maaslin2_1.6.0                
##  [5] phyloseq_1.36.0                forcats_0.5.1                 
##  [7] stringr_1.4.0                  dplyr_1.0.7                   
##  [9] purrr_0.3.4                    readr_2.0.2                   
## [11] tidyr_1.1.4                    tibble_3.1.5                  
## [13] ggplot2_3.3.5                  tidyverse_1.3.1               
## [15] curatedMetagenomicData_3.0.10  TreeSummarizedExperiment_2.0.3
## [17] Biostrings_2.60.2              XVector_0.32.0                
## [19] SingleCellExperiment_1.14.1    SummarizedExperiment_1.22.0   
## [21] Biobase_2.52.0                 GenomicRanges_1.44.0          
## [23] GenomeInfoDb_1.28.4            IRanges_2.26.0                
## [25] S4Vectors_0.30.2               BiocGenerics_0.38.0           
## [27] MatrixGenerics_1.4.3           matrixStats_0.61.0            
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1                  backports_1.2.1              
##   [3] AnnotationHub_3.0.2           BiocFileCache_2.0.0          
##   [5] plyr_1.8.6                    igraph_1.2.7                 
##   [7] lazyeval_0.2.2                splines_4.1.1                
##   [9] BiocParallel_1.26.2           lpsymphony_1.20.0            
##  [11] digest_0.6.28                 foreach_1.5.1                
##  [13] htmltools_0.5.2               fansi_0.5.0                  
##  [15] magrittr_2.0.1                memoise_2.0.0                
##  [17] cluster_2.1.2                 tzdb_0.1.2                   
##  [19] modelr_0.1.8                  colorspace_2.0-2             
##  [21] blob_1.2.2                    rvest_1.0.2                  
##  [23] rappdirs_0.3.3                haven_2.4.3                  
##  [25] xfun_0.29                     crayon_1.4.1                 
##  [27] RCurl_1.98-1.5                jsonlite_1.7.2               
##  [29] biglm_0.9-2.1                 hash_2.2.6.1                 
##  [31] survival_3.2-11               iterators_1.0.13             
##  [33] ape_5.5                       glue_1.4.2                   
##  [35] gtable_0.3.0                  zlibbioc_1.38.0              
##  [37] DelayedArray_0.18.0           Rhdf5lib_1.14.2              
##  [39] DEoptimR_1.0-9                scales_1.1.1                 
##  [41] mvtnorm_1.1-3                 DBI_1.1.1                    
##  [43] Rcpp_1.0.7                    xtable_1.8-4                 
##  [45] tidytree_0.3.5                bit_4.0.4                    
##  [47] getopt_1.20.3                 httr_1.4.2                   
##  [49] ellipsis_0.3.2                farver_2.1.0                 
##  [51] pkgconfig_2.0.3               sass_0.4.0                   
##  [53] dbplyr_2.1.1                  utf8_1.2.2                   
##  [55] labeling_0.4.2                tidyselect_1.1.1             
##  [57] rlang_0.4.12                  reshape2_1.4.4               
##  [59] later_1.3.0                   AnnotationDbi_1.54.1         
##  [61] munsell_0.5.0                 BiocVersion_3.13.1           
##  [63] cellranger_1.1.0              tools_4.1.1                  
##  [65] cachem_1.0.6                  cli_3.1.0                    
##  [67] generics_0.1.1                RSQLite_2.2.8                
##  [69] ExperimentHub_2.0.0           ade4_1.7-18                  
##  [71] mathjaxr_1.4-0                evaluate_0.14                
##  [73] biomformat_1.20.0             fastmap_1.1.0                
##  [75] yaml_2.2.1                    knitr_1.36                   
##  [77] bit64_4.0.5                   fs_1.5.0                     
##  [79] robustbase_0.93-9             KEGGREST_1.32.0              
##  [81] pbapply_1.5-0                 nlme_3.1-152                 
##  [83] mime_0.12                     xml2_1.3.2                   
##  [85] compiler_4.1.1                rstudioapi_0.13              
##  [87] filelock_1.0.2                curl_4.3.2                   
##  [89] png_0.1-7                     interactiveDisplayBase_1.30.0
##  [91] reprex_2.0.1                  treeio_1.16.2                
##  [93] pcaPP_1.9-74                  bslib_0.3.1                  
##  [95] stringi_1.7.5                 highr_0.9                    
##  [97] blogdown_1.7                  lattice_0.20-44              
##  [99] vegan_2.5-7                   permute_0.9-5                
## [101] multtest_2.48.0               vctrs_0.3.8                  
## [103] pillar_1.6.4                  lifecycle_1.0.1              
## [105] rhdf5filters_1.4.0            optparse_1.7.1               
## [107] BiocManager_1.30.16           jquerylib_0.1.4              
## [109] data.table_1.14.2             bitops_1.0-7                 
## [111] httpuv_1.6.3                  R6_2.5.1                     
## [113] bookdown_0.24                 promises_1.2.0.1             
## [115] codetools_0.2-18              MASS_7.3-54                  
## [117] assertthat_0.2.1              rhdf5_2.36.0                 
## [119] withr_2.4.2                   GenomeInfoDbData_1.2.6       
## [121] mgcv_1.8-36                   hms_1.1.1                    
## [123] grid_4.1.1                    rmarkdown_2.11               
## [125] logging_0.10-108              shiny_1.7.1                  
## [127] lubridate_1.8.0
Associate Professor of Pediatrics

Related