If you use decemedip in published research, please cite Shen et al. (2025), or:
Shen N, Zhang Z, Baca S, Korthauer K. decemedip: hierarchical Bayesian modeling for cell type deconvolution of immunoprecipitation-based DNA methylomes. bioRxiv. 2025:2025-05.
This package will be submitted to Bioconductor. For now, you can install the development version from GitHub:
After installation, load the decemedip package:
Cell-free and bulk DNA methylation data obtained through MeDIP-seq
reflect a mixture of methylation signals across multiple cell types.
Decomposing these signals to infer cell type composition can provide
valuable insights for cancer diagnosis, immune response monitoring, and
other biomedical applications. However, challenges like
enrichment-induced biases and sparse reference data make this task
complex. The decemedip package addresses these challenges
through a hierarchical Bayesian framework that estimates cell type
proportions and models the relationship between MeDIP-seq counts and
reference methylation data.
decemedip couples a logit-normal model with a generalize additive model (GAM) framework. For each site \(i \in \{1, ..., N\}\) in the reference panel, the input to the model is the fractional methylation levels \(x_{ik}\) for each cell type \(k \in \{1, ..., K\}\), the CpG density level \(z_i\) and the MeDIP-seq read count \(y_i\), where \(K\) is the total number of cell types. A unit simplex variable that follows a logit-normal prior, \(\boldsymbol{\pi} = (\pi_1, ..., \pi_K)\) where \(\pi_k > 0, \sum_{k=1}^K \pi_k = 1\), is included to describe the proportions of cell types in the reference panel while taking into account the correlations between these cell types.
decemedip requires three primary inputs:
Our package provides default reference matrices for hg19 and hg38
along with the corresponding CpG density information, as objects in
class SummarizedExperiment. The objects can be accessed by
calling data(hg19.ref.cts.se) and
data(hg19.ref.anc.se), or
data(hg38.ref.cts.se) and
data(hg38.ref.anc.se). By default, the main function
decemedip() applies hg19.ref.cts.se and
hg19.ref.anc.se as the reference panels. Please refer to
the manuscript for details of how the default reference panels were
constructed.
On a side note, we provide the function
makeReferencePanel() to allow user to build their own
reference panels, which only requires input of reference CpGs and
corresponded fractional methylation level matrix. The function computes
CpG density on its own. Note that the cell type-specific sites and
anchor sites need to be included in two
SummarizedExperiment objects to be inputs to the main
function decemedip(). See ?makeReferencePanel
for more information.
We show how the reference panel look like in the following chunks:
data(hg19.ref.cts.se)
print(hg19.ref.cts.se)
## class: RangedSummarizedExperiment
## dim: 2500 25
## metadata(0):
## assays(1): beta
## rownames(2500): cg18856478 cg20820767 ... cg01071459 cg20726993
## rowData names(4): probe label pos n_cpgs_100bp
## colnames(25): Monocytes_EPIC B-cells_EPIC ... Upper_GI Uterus_cervix
## colData names(0):
head(granges(hg19.ref.cts.se))
## GRanges object with 6 ranges and 4 metadata columns:
## seqnames ranges strand | probe label
## <Rle> <IRanges> <Rle> | <character> <character>
## cg18856478 chr1 43814358 * | cg18856478 Monocytes_EPIC hyper..
## cg20820767 chr1 45082840 * | cg20820767 Monocytes_EPIC hyper..
## cg14855367 chr3 191048308 * | cg14855367 Monocytes_EPIC hyper..
## cg25913761 chr15 90727560 * | cg25913761 Monocytes_EPIC hyper..
## cg21546950 chr1 77904032 * | cg21546950 Monocytes_EPIC hyper..
## cg14981189 chr10 5113871 * | cg14981189 Monocytes_EPIC hyper..
## pos n_cpgs_100bp
## <integer> <integer>
## cg18856478 43814358 7
## cg20820767 45082840 12
## cg14855367 191048308 1
## cg25913761 90727560 4
## cg21546950 77904032 1
## cg14981189 5113871 1
## -------
## seqinfo: 22 sequences from hg19 genome; no seqlengthsdata(hg19.ref.anc.se)
print(hg19.ref.anc.se)
## class: RangedSummarizedExperiment
## dim: 1000 25
## metadata(0):
## assays(1): beta
## rownames(1000): 353534 76294 ... 73948 87963
## rowData names(7): probe pos ... avg_beta_rank n_cpgs_100bp
## colnames(25): Monocytes_EPIC B-cells_EPIC ... Upper_GI Uterus_cervix
## colData names(0):
head(granges(hg19.ref.anc.se))
## GRanges object with 6 ranges and 7 metadata columns:
## seqnames ranges strand | probe pos label
## <Rle> <IRanges> <Rle> | <character> <integer> <character>
## 353534 chr19 19496477 * | cg15264323 19496477 All-tissue U
## 76294 chr1 154193656 * | cg13576006 154193656 All-tissue U
## 300825 chr12 860787 * | cg13284045 860787 All-tissue U
## 369864 chr5 43514988 * | cg20545087 43514988 All-tissue U
## 247131 chr16 278788 * | cg06819375 278788 All-tissue U
## 308091 chr12 96429111 * | cg09725090 96429111 All-tissue U
## margin avg_beta avg_beta_rank n_cpgs_100bp
## <numeric> <numeric> <numeric> <integer>
## 353534 0.1 0.063240 70720.0 10
## 76294 0.1 0.013492 19877.5 7
## 300825 0.1 0.029155 36307.5 1
## 369864 0.1 0.038771 45387.5 6
## 247131 0.1 0.032970 39748.5 2
## 308091 0.1 0.076555 77269.0 7
## -------
## seqinfo: 22 sequences from hg19 genome; no seqlengthsThe main function decemedip() fits the decemedip model.
It allows two types of input:
We provide instructions for both input types as follows.
PS: By default, the decemedip() function uses a hg19
reference panel. But users may add the arguments
ref_cts = hg38.ref.cts.se, ref_anc = hg38.ref.anc.se to
apply read counts extraction on hg38 data.
We use built-in objects of the package that contains read counts of the prostate tumor patient-derived xenograft (PDX) samples from the Berchuck et al. (2022) study to demonstrate the output and diagnostics in this vignette.
Note that in this example, we use only a subset of the cell types from the default reference panel to reduce the build time of the vignette. Only blood cells and prostate are included.
data(example.hg19.ref.cts.se)
data(example.hg19.ref.anc.se)
data(example.pdx.counts.cts.se)
data(example.pdx.counts.anc.se)We extract the sample LuCaP_147CR from this example
dataset for follwing illustration:
# read counts of cell type-specific CpGs of the sample 'LuCaP_147CR'
counts_cts <- assay(example.pdx.counts.cts.se)[, "LuCaP_147CR"]
# read counts of anchor CpGs of the sample 'LuCaP_147CR'
counts_anc <- assay(example.pdx.counts.anc.se)[, "LuCaP_147CR"]Due to the vignette running time limit by Bioconductor, we only run
500 iterations (iter = 500) for the purpose of
demonstration, which causes the warning of effective Samples Size (ESS)
being too low. In regular cases, we recommend to run 2000 iterations
(the default) for a stable posterior inference.
output <- decemedip(
counts_cts = counts_cts,
counts_anc = counts_anc,
ref_cts = example.hg19.ref.cts.se,
ref_anc = example.hg19.ref.anc.se,
diagnostics = TRUE,
cores = 4,
iter = 500
)
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess
## MCMC converged with seed 2024The output is a list containing two elements:
data_list: An organized list of variables used as input
to the Stan posterior sampling function.posterior: An stanfit object produced by
Stan representing the fitted posteriors.After running the model, you may extract and save the summary of
fitted posteriors using the monitor() and
extract() functions provided by the RStan
package. See documentation of RStan for details of these
functions.
Extract the fitted posterior of cell type proportions (\(\boldsymbol\pi\)):
smr_pi.df <- getSummaryOnPi(output$posterior, cell_type_names = colnames(example.hg19.ref.cts.se))
print(smr_pi.df)
## cell_type mean se_mean sd 2.5% 25% 50%
## pi[1] Monocytes_EPIC 0.040 0.0033 0.072 1.5e-08 1.3e-04 0.00288
## pi[2] B-cells_EPIC 0.018 0.0020 0.045 6.4e-08 5.2e-05 0.00102
## pi[3] CD4T-cells_EPIC 0.015 0.0020 0.040 2.8e-08 3.7e-05 0.00083
## pi[4] NK-cells_EPIC 0.012 0.0014 0.031 4.3e-08 7.6e-05 0.00080
## pi[5] CD8T-cells_EPIC 0.013 0.0018 0.040 2.0e-08 5.0e-05 0.00073
## pi[6] Neutrophils_EPIC 0.021 0.0020 0.046 1.2e-07 1.4e-04 0.00124
## pi[7] Erythrocyte_progenitors 0.014 0.0014 0.033 1.0e-07 9.5e-05 0.00120
## pi[8] Prostate 0.867 0.0073 0.122 5.8e-01 7.9e-01 0.89643
## 75% 97.5% n_eff Rhat valid
## pi[1] 0.0527 0.25 459 1 1
## pi[2] 0.0100 0.17 489 1 1
## pi[3] 0.0080 0.13 415 1 1
## pi[4] 0.0065 0.11 501 1 1
## pi[5] 0.0057 0.11 473 1 1
## pi[6] 0.0160 0.18 551 1 1
## pi[7] 0.0086 0.12 509 1 1
## pi[8] 0.9743 1.00 282 1 1smr_pi.df:
cell_type: The name of the parameter or variable being
analyzed.mean: The posterior mean, representing the point
estimate of the parameter.se_mean: The standard error of the mean, calculated as
sd / sqrt(n_eff), indicating precision of the mean estimate.sd: The posterior standard deviation, representing the
spread or uncertainty of the parameter estimate.2.5%, 25%, 50% (median),
75%, 97.5%: Percentiles of the posterior
distribution, providing a summary of parameter uncertainty. These define
the 95% credible interval (2.5% to 97.5%).n_eff: The effective sample size, indicating how many
independent samples the chain produced after accounting for
autocorrelation.Rhat: The potential scale reduction factor, measuring
chain convergence. Values close to 1.00 suggest good convergence.valid: A flag indicating whether diagnostic checks
(e.g., Rhat and n_eff) passed for this parameter (1 = passed, 0 =
potential issues).Plotting out the fitted cell type proportions with credible intervals:
labels <- gsub("_", " ", smr_pi.df$cell_type)
labels <- gsub("(.*) EPIC", "\\1", labels)
smr_pi.df |>
mutate(cell_type = factor(cell_type, labels = labels)) |>
ggplot(aes(cell_type, mean)) +
geom_linerange(aes(ymin = `2.5%`, ymax = `97.5%`),
position = position_dodge2(width = 0.035),
linewidth = 7, alpha = 0.3
) +
geom_linerange(aes(ymin = `25%`, ymax = `75%`),
position = position_dodge2(width = 0.035),
linewidth = 7, alpha = 1
) +
geom_point(
position = position_dodge2(width = 0.035),
fill = "white", shape = 21, size = 8
) +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Note that this plot is only accessible when diagnostics
is set toTRUE in the decemedip() function. The
actual read counts (orange) and the fitted counts predicted by the GAM
component (black) are shown in the figure across varying levels of CpG
density. Grey area represents the 95% credible intervals of the
predicted counts. `CpG density: x’ means that there are x CpGs in the
100-bp window surrounding the CpG.
The decemedip package provides a robust framework for
cell type deconvolution from MeDIP-seq data. By following this vignette,
users can apply the method to their own datasets, extract key model
outputs, and generate diagnostic plots for analysis.
sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] rstan_2.32.7 StanHeaders_2.32.10
## [3] decemedip_0.99.8 ggplot2_4.0.1
## [5] dplyr_1.1.4 SummarizedExperiment_1.41.0
## [7] Biobase_2.71.0 GenomicRanges_1.63.1
## [9] Seqinfo_1.1.0 IRanges_2.45.0
## [11] S4Vectors_0.49.0 BiocGenerics_0.57.0
## [13] generics_0.1.4 MatrixGenerics_1.23.0
## [15] matrixStats_1.5.0 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 bitops_1.0-9 gridExtra_2.3
## [4] httr2_1.2.2 inline_0.3.21 biomaRt_2.67.1
## [7] rlang_1.1.7 magrittr_2.0.4 MEDIPS_1.63.0
## [10] otel_0.2.0 compiler_4.5.2 RSQLite_2.4.5
## [13] loo_2.9.0 png_0.1-8 vctrs_0.6.5
## [16] stringr_1.6.0 pkgconfig_2.0.3 crayon_1.5.3
## [19] fastmap_1.2.0 dbplyr_2.5.1 XVector_0.51.0
## [22] labeling_0.4.3 Rsamtools_2.27.0 rmarkdown_2.30
## [25] preprocessCore_1.73.0 purrr_1.2.1 bit_4.6.0
## [28] xfun_0.55 cachem_1.1.0 cigarillo_1.1.0
## [31] jsonlite_2.0.0 progress_1.2.3 blob_1.3.0
## [34] DelayedArray_0.37.0 BiocParallel_1.45.0 parallel_4.5.2
## [37] prettyunits_1.2.0 R6_2.6.1 bslib_0.9.0
## [40] stringi_1.8.7 RColorBrewer_1.1-3 rtracklayer_1.71.3
## [43] DNAcopy_1.85.0 jquerylib_0.1.4 Rcpp_1.1.1
## [46] knitr_1.51 R.utils_2.13.0 Matrix_1.7-4
## [49] tidyselect_1.2.1 abind_1.4-8 yaml_2.3.12
## [52] codetools_0.2-20 curl_7.0.0 pkgbuild_1.4.8
## [55] lattice_0.22-7 tibble_3.3.1 withr_3.0.2
## [58] KEGGREST_1.51.1 S7_0.2.1 evaluate_1.0.5
## [61] RcppParallel_5.1.11-1 BiocFileCache_3.1.0 Biostrings_2.79.4
## [64] pillar_1.11.1 BiocManager_1.30.27 filelock_1.0.3
## [67] RCurl_1.98-1.17 hms_1.1.4 rstantools_2.6.0
## [70] scales_1.4.0 gtools_3.9.5 glue_1.8.0
## [73] maketools_1.3.2 tools_4.5.2 BiocIO_1.21.0
## [76] sys_3.4.3 BSgenome_1.79.1 GenomicAlignments_1.47.0
## [79] buildtools_1.0.0 XML_3.99-0.20 grid_4.5.2
## [82] QuickJSR_1.8.1 AnnotationDbi_1.73.0 restfulr_0.0.16
## [85] cli_3.6.5 rappdirs_0.3.3 S4Arrays_1.11.1
## [88] V8_8.0.1 gtable_0.3.6 R.methodsS3_1.8.2
## [91] sass_0.4.10 digest_0.6.39 SparseArray_1.11.10
## [94] rjson_0.2.23 farver_2.1.2 R.oo_1.27.1
## [97] memoise_2.0.1 htmltools_0.5.9 lifecycle_1.0.5
## [100] httr_1.4.7 bit64_4.6.0-1