betterChromVAR 0.99.35
The chromVAR R package was originally developed by Alicia Schep and colleagues from the Greenleaf lab (Schep et al. 2017). Although originally designed for single-cell ATAC-seq data, it has been shown to be highly sensitive for bulk data as well (Gerbaldo et al. 2024). The aim of the method is to infer, based on the accessibility of motif matches, the relative activity of transcription factors (TFs) in each sample or cell. It is recommended that you read the chromVAR documentation before using this package.
betterChromVAR is first and foremost a considerably faster, analytical re-implementation of the original method (it is also considerably faster than the C++ reimplementation in ArchR (Granja et al. 2021). Contrarily to the original chromVAR, it is entirely deterministic, and achieves much higher efficiency by computing expectations and variance at the level of bias bins, instead of in the peak-space.
In addition, betterChromVAR included a few additions, such as simpler weighted expectations, bias shrinkage, and an ATAC-seq normalization method based on the chromVAR logic.
betterChromVARbetterChromVAR is a R package available via the
Bioconductor repository. It can be installed using
the following commands in your R session:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("betterChromVAR")
## Check that you have a valid Bioconductor installation
BiocManager::valid()
betterChromVAR takes two primary inputs: 1) the counts in peaks
across cells or samples, and 2) an annotation of which TFs/motifs match which
peak1 The annotation can be binary or probabilistic (i.e. from 0 to 1). If using
motif matches, however, it is recommended to input binary matches, as the
magnitude of motif scores is generally poorly correlated to actual binding..
Both can be provided as matrices or, better, as RangedSummarizedExperiment
objects2 See SummarizedExperiment if you’re not familiar with
those.
There are multiple ways of generating your peak counts; the original
chromVAR package includes such a function (getCounts()), and the
epiwraps package has some with more
options (functions peakCountsFromBAM() and peakCountsFromFrags()). Similarly,
the motifmatchr package can be used to generate the motif
matching annotation. An important consideration is that the peaks or regions
used for the purpose of this analysis should have similar widths. It is thus
highly recommended, before generating the count and annotation matrices, to
resize your regions (e.g. using peaks <- resize(peaks, width=300, fix="center"))
– the exact size can be something not too large (otherwise the presence/absence
of a motif becomes meaningless) and ideally close to the median size of your
original peaks. In addition, is it advisable to restrict your peaks to those
that are on standard chromosomes3 See keepStandardChromosomes() from the
GenomeInfoDb package.
Here we’ll use dummy data as example:
suppressPackageStartupMessages({
library(SummarizedExperiment)
library(betterChromVAR)
})
attach(getDummyData())
counts
## class: RangedSummarizedExperiment
## dim: 500 10
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(1): bias
## colnames(10): sample1 sample2 ... sample9 sample10
## colData names(0):
head(motifMatches)
## 6 x 5 sparse Matrix of class "dgCMatrix"
## motif1 motif2 motif3 motif4 motif5
## [1,] . . . . .
## [2,] . 1 . . .
## [3,] 1 . . . .
## [4,] . . . 1 .
## [5,] . . . . 1
## [6,] . . . . .
In this case, we can see that rowData(counts) already has a bias column
indicating the GC content of the regions:
rowData(counts)
## DataFrame with 500 rows and 1 column
## bias
## <numeric>
## 1 0.500409
## 2 0.487818
## 3 0.436871
## 4 0.528755
## 5 0.645516
## ... ...
## 496 0.419372
## 497 0.441628
## 498 0.531503
## 499 0.513935
## 500 0.511592
Had this not been the case, we would first need to add is using:
# not run
counts <- addGCBias(counts, genome=my_genome)
where my_genome is a BSgenome object or similar (e.g. an
FaFile4 See the Rsamtools package.).
Once we have this, we can launch the computation of the deviations :
dev <- betterChromVAR(counts, motifMatches)
dev
## class: SummarizedExperiment
## dim: 5 10
## metadata(0):
## assays(2): deviations z
## rownames(5): motif1 motif2 motif3 motif4 motif5
## rowData names(5): N total variability var.pval var.adjPval
## colnames(10): sample1 sample2 ... sample9 sample10
## colData names(0):
The resulting dev object is a SummarizedExperiment with the same columns
(and colData) as the original counts object, but with the motifs as rows
(instead of the original peaks). The values can be interpreted as the relative
activity, across samples or cells, of the corresponding TFs5 Note, however,
that similar motifs will be given similar activity estimations, so that it is
often hard to know which of a set of highly-similar motifs is in fact
responsible for the observed signal.. When the function is used with default
arguments, the results are virtually the same as the original chromVAR
(averaging over the noise coming from the random background selection of the
original method).
The object contains two assays: the deviations assay contains the
bias-adjusted deviations from the expectation (by default, the average), i.e.
the difference to the expectation divided by the expectation, and the z assay
contains z-scores, i.e. the difference to the expectation divided by the
variance of the expectation.
If your samples/cells have similar library sizes, it is recommended that you
use the z assay for downstream analysis (such as differential TF activity
using limma). If the samples have very different library sizes, the z scores
will be influenced by that, and it might be preferable to use the deviations
assay.
The variability of each motif across the dataset is stored in the rowData of the object :
rowData(dev)
## DataFrame with 5 rows and 5 columns
## N total variability var.pval var.adjPval
## <numeric> <numeric> <numeric> <numeric> <numeric>
## motif1 73 38037 0.537774 0.977990 0.995756
## motif2 86 43452 0.475605 0.990889 0.995756
## motif3 79 36337 0.581370 0.962610 0.995756
## motif4 68 36796 0.723354 0.858888 0.995756
## motif5 89 50301 0.429746 0.995756 0.995756
If bootstrap confidence intervals of the variability are additionally desired,
one can simply use the computeVariability function of chromVAR
on the present output (i.e. dev).
The betterChromVAR() function is actually a wrapper around three steps, which
can also be executed individually for more customization, or to avoid repeating
some computation multiple times.
The first step is creating the bias bin and their pairwise sampling probabilities. This is achieve with:
bg <- getBackgroundBins(counts)
## Creating 50*50=2500 bias bins and computing their sampling distances
bg
## bcvBackground object with 500 peaks,
## split into 50*50 ( 2500 ) bins.
The second step is computing the per-bin background expectations and variances for each sample or cell:
bg <- computeBackgrounds(counts, bg)
bg
## bcvBackground object with 500 peaks,
## split into 50*50 ( 2500 ) bins.
## Background data filled for 10 samples.
The bg object has now been filled with the additional information. It can now
be used for:
dev <- computeDeviationsAnalytic(counts, background=bg, annotations=motifMatches)
dev
## class: SummarizedExperiment
## dim: 5 10
## metadata(0):
## assays(2): deviations z
## rownames(5): motif1 motif2 motif3 motif4 motif5
## rowData names(5): N total variability var.pval var.adjPval
## colnames(10): sample1 sample2 ... sample9 sample10
## colData names(0):
This last call could be made with different annotations, without needed to recompute the previous steps.
In addition to the enrichment and GC bias taken into account by the original
chromVAR, betterChromVAR supports an optional
third dimension: fragment length bias (see ?getBackgroundBins() for more
information). This requires the compilation of an additional bias component
(the log10-transformed mean or median fragment length per region). If desired,
desired, we recommended using the counting functions from the
epiwraps package, which can provide this
information.
If you are using betterChromVAR (or ChromVAR, for that matter) on single-cell
data with multiple cell types, the way to best run it depends on whether your
are interested in differences within cell types, or between cell types. If
interested in differences between cell types, run it on the entire dataset,
specifying the cell types as grouping argument to betterChromVAR(). In this
way, rare and abundant cell types will be given the same weight in computing
the expectation.
If you are interested in differences within cell types (e.g. between samples/conditions), you should instead run the method separately for each cell type. In this way, the background bins will be defined based on the enrichment in the given cell type, leading to better capture of the bias. The drawback, however, is that the values won’t be comparable across cell types. To test across conditions, we also recommend using it on pseudobulk data.
The only step of the process that needs to be done across the entire dataset is
the computing of the expectation, i.e. the mean (or weighted mean) for each
region across all cells, and the creation of the background bins, which depends
on it. Everything else can easily be executed in chunks (of cells), as is done
for multi-threading in the betterChromVAR() function. This can be done
manually on the full dataset, and then passed to downstream functions for
computations on subsets of the data:
ex <- getExpectation(counts)
bg <- getBackgroundBins(ex, bias=rowData(counts)$bias)
## Creating 50*50=2500 bias bins and computing their sampling distances
# this is equivalent to bg <- getBackgroundBins(counts)
Then we can apply the next steps only on subset of the data:
bg2 <- computeBackgrounds(counts[,1:3], bg, expectation = ex)
dev2 <- computeDeviationsAnalytic(counts[,1:3], bg2, motifMatches)
# this should be identical to what we had run on the whole object:
identical(assay(dev)[,1:3], assay(dev2))
## [1] TRUE
If data is on disk, rather than in memory, and you are adjusting the expectation
based on cell-type abundance using the grouping argument, getBackgroundBins
will load the entire count matrix into memory. To avoid this, compute the
mean per group using block-wise operations, for instance using the
aggregateAcrossCells() function of the scrapper package, and
then use the rowMeans() of that as expectation.
The general chromVAR approach, and in particular the deterministic version
implemented here, is also amenable to be used to normalize GC- and enrichment
bias out of bulk ATAC-seq data. This is implemented in the CVnorm() function.
In addition, if given a grouping of the samples (e.g. experimental conditions),
the function applies a variance-based smoothing of the bias correction,
inspired by smoothed quantile normalization (see the qsmooth
package or Hicks et al. (2018) ).
In a nutshell, if the bias in a certain background bin is explained by
experimental groups, it will be less corrected than if it varies across samples
of the groups6 Note that this is different from the original
qsmooth approach – see ?CVnorm for more detail..
Example usage:
# we assign arbitrary groups to the samples:
counts$group <- rep(LETTERS[1:2], each=5)
# we run the smoothed CVnorm:
counts <- CVnorm(counts, grouping=counts$group)
# (the normal CVnorm could be run by omitting the grouping)
counts
## class: RangedSummarizedExperiment
## dim: 500 10
## metadata(0):
## assays(2): counts corrected
## rownames: NULL
## rowData names(1): bias
## colnames(10): sample1 sample2 ... sample9 sample10
## colData names(1): group
This adds a corrected assay to the object. Note that although the assay is
corrected for enrichment- and GC-bias, it is not corrected for library size
differences. Rather, it is on the original count scale (although not integer
anymore), so that it is amenable to use in downstream count-based analysis
methods such as edgeR.
Gerbaldo, Felix Ezequiel, Emanuel Sonder, Vincent Fischer, Selina Frei, Jiayi Wang, Katharina Gapp, Mark D Robinson, and Pierre-Luc Germain. 2024. “On the Identification of Differentially-Active Transcription Factors from Atac-Seq Data.” PLoS Computational Biology 20 (10): e1011971.
Granja, Jeffrey M, M Ryan Corces, Sarah E Pierce, S Tansu Bagdatli, Hani Choudhry, Howard Y Chang, and William J Greenleaf. 2021. “ArchR Is a Scalable Software Package for Integrative Single-Cell Chromatin Accessibility Analysis.” Nature Genetics 53 (3): 403–11.
Hicks, Stephanie C, Kwame Okrah, Joseph N Paulson, John Quackenbush, Rafael A Irizarry, and Héctor Corrada Bravo. 2018. “Smooth Quantile Normalization.” Biostatistics 19 (2): 185–98.
Schep, Alicia N, Beijing Wu, Jason D Buenrostro, and William J Greenleaf. 2017. “ChromVAR: Inferring Transcription-Factor-Associated Accessibility from Single-Cell Epigenomic Data.” Nature Methods 14 (10): 975–78.
## R version 4.6.0 alpha (2026-04-05 r89794)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] betterChromVAR_0.99.35 SummarizedExperiment_1.41.1
## [3] Biobase_2.71.0 GenomicRanges_1.63.2
## [5] Seqinfo_1.1.0 IRanges_2.45.0
## [7] S4Vectors_0.49.1 BiocGenerics_0.57.0
## [9] generics_0.1.4 MatrixGenerics_1.23.0
## [11] matrixStats_1.5.0 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-5 jsonlite_2.0.0 crayon_1.5.3
## [4] compiler_4.6.0 BiocManager_1.30.27 Biostrings_2.79.5
## [7] parallel_4.6.0 jquerylib_0.1.4 BiocParallel_1.45.0
## [10] yaml_2.3.12 fastmap_1.2.0 lattice_0.22-9
## [13] R6_2.6.1 XVector_0.51.0 S4Arrays_1.11.1
## [16] knitr_1.51 DelayedArray_0.37.1 bookdown_0.46
## [19] bslib_0.10.0 rlang_1.2.0 cachem_1.1.0
## [22] xfun_0.57 sass_0.4.10 otel_0.2.0
## [25] SparseArray_1.11.13 cli_3.6.6 digest_0.6.39
## [28] grid_4.6.0 lifecycle_1.0.5 evaluate_1.0.5
## [31] codetools_0.2-20 abind_1.4-8 rmarkdown_2.31
## [34] tools_4.6.0 htmltools_0.5.9