Introduction to betterChromVAR

0.1 Introduction

The chromVAR R package was originally developed by Alicia Schep and colleagues from the Greenleaf lab (Schep et al. 2017). Although originally designed for single-cell ATAC-seq data, it has been shown to be highly sensitive for bulk data as well (Gerbaldo et al. 2024). The aim of the method is to infer, based on the accessibility of motif matches, the relative activity of transcription factors (TFs) in each sample or cell. It is recommended that you read the chromVAR documentation before using this package.

betterChromVAR is first and foremost a considerably faster, analytical re-implementation of the original method (it is also considerably faster than the C++ reimplementation in ArchR (Granja et al. 2021). Contrarily to the original chromVAR, it is entirely deterministic, and achieves much higher efficiency by computing expectations and variance at the level of bias bins, instead of in the peak-space.

In addition, betterChromVAR included a few additions, such as simpler weighted expectations, bias shrinkage, and an ATAC-seq normalization method based on the chromVAR logic.

0.2 Installing `betterChromVAR`

betterChromVAR is a R package available via the Bioconductor repository. It can be installed using the following commands in your R session:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
  }

BiocManager::install("betterChromVAR")

## Check that you have a valid Bioconductor installation
BiocManager::valid()

0.3 Computing motif deviations with betterChromVAR

betterChromVAR takes two primary inputs: 1) the counts in peaks across cells or samples, and 2) an annotation of which TFs/motifs match which peak111 The annotation can be binary or probabilistic (i.e. from 0 to 1). If using motif matches, however, it is recommended to input binary matches, as the magnitude of motif scores is generally poorly correlated to actual binding.. Both can be provided as matrices or, better, as RangedSummarizedExperiment objects222 See SummarizedExperiment if you’re not familiar with those.

There are multiple ways of generating your peak counts; the original chromVAR package includes such a function (getCounts()), and the epiwraps package has some with more options (functions peakCountsFromBAM() and peakCountsFromFrags()). Similarly, the motifmatchr package can be used to generate the motif matching annotation. An important consideration is that the peaks or regions used for the purpose of this analysis should have similar widths. It is thus highly recommended, before generating the count and annotation matrices, to resize your regions (e.g. using peaks <- resize(peaks, width=300, fix="center")) – the exact size can be something not too large (otherwise the presence/absence of a motif becomes meaningless) and ideally close to the median size of your original peaks. In addition, is it advisable to restrict your peaks to those that are on standard chromosomes333 See keepStandardChromosomes() from the GenomeInfoDb package.

Here we’ll use dummy data as example:

suppressPackageStartupMessages({
  library(SummarizedExperiment)
  library(betterChromVAR)
})
attach(getDummyData())
counts

## class: RangedSummarizedExperiment 
## dim: 500 10 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(1): bias
## colnames(10): sample1 sample2 ... sample9 sample10
## colData names(0):

head(motifMatches)

## 6 x 5 sparse Matrix of class "dgCMatrix"
##      motif1 motif2 motif3 motif4 motif5
## [1,]      .      .      .      .      .
## [2,]      .      1      .      .      .
## [3,]      1      .      .      .      .
## [4,]      .      .      .      1      .
## [5,]      .      .      .      .      1
## [6,]      .      .      .      .      .

In this case, we can see that rowData(counts) already has a bias column indicating the GC content of the regions:

rowData(counts)

## DataFrame with 500 rows and 1 column
##          bias
##     <numeric>
## 1    0.500409
## 2    0.487818
## 3    0.436871
## 4    0.528755
## 5    0.645516
## ...       ...
## 496  0.419372
## 497  0.441628
## 498  0.531503
## 499  0.513935
## 500  0.511592

Had this not been the case, we would first need to add is using:

# not run
counts <- addGCBias(counts, genome=my_genome)

where my_genome is a BSgenome object or similar (e.g. an FaFile444 See the Rsamtools package.).

Once we have this, we can launch the computation of the deviations :

dev <- betterChromVAR(counts, motifMatches)
dev

## class: SummarizedExperiment 
## dim: 5 10 
## metadata(0):
## assays(2): deviations z
## rownames(5): motif1 motif2 motif3 motif4 motif5
## rowData names(5): N total variability var.pval var.adjPval
## colnames(10): sample1 sample2 ... sample9 sample10
## colData names(0):

The resulting dev object is a SummarizedExperiment with the same columns (and colData) as the original counts object, but with the motifs as rows (instead of the original peaks). The values can be interpreted as the relative activity, across samples or cells, of the corresponding TFs555 Note, however, that similar motifs will be given similar activity estimations, so that it is often hard to know which of a set of highly-similar motifs is in fact responsible for the observed signal.. When the function is used with default arguments, the results are virtually the same as the original chromVAR (averaging over the noise coming from the random background selection of the original method).

The object contains two assays: the deviations assay contains the bias-adjusted deviations from the expectation (by default, the average), i.e. the difference to the expectation divided by the expectation, and the z assay contains z-scores, i.e. the difference to the expectation divided by the variance of the expectation.

If your samples/cells have similar library sizes, it is recommended that you use the z assay for downstream analysis (such as differential TF activity using limma). If the samples have very different library sizes, the z scores will be influenced by that, and it might be preferable to use the deviations assay.

The variability of each motif across the dataset is stored in the rowData of the object :

rowData(dev)

## DataFrame with 5 rows and 5 columns
##                N     total variability  var.pval var.adjPval
##        <numeric> <numeric>   <numeric> <numeric>   <numeric>
## motif1        73     38037    0.537774  0.977990    0.995756
## motif2        86     43452    0.475605  0.990889    0.995756
## motif3        79     36337    0.581370  0.962610    0.995756
## motif4        68     36796    0.723354  0.858888    0.995756
## motif5        89     50301    0.429746  0.995756    0.995756

If bootstrap confidence intervals of the variability are additionally desired, one can simply use the computeVariability function of chromVAR on the present output (i.e. dev).

0.3.1 Individual steps of betterChromVAR

The betterChromVAR() function is actually a wrapper around three steps, which can also be executed individually for more customization, or to avoid repeating some computation multiple times.

The first step is creating the bias bin and their pairwise sampling probabilities. This is achieve with:

bg <- getBackgroundBins(counts)

## Creating 50*50=2500 bias bins and computing their sampling distances

bg

## bcvBackground object with 500 peaks,
##  split into  50*50  ( 2500 ) bins.

The second step is computing the per-bin background expectations and variances for each sample or cell:

bg <- computeBackgrounds(counts, bg)
bg

## bcvBackground object with 500 peaks,
##  split into  50*50  ( 2500 ) bins.
## Background data filled for  10  samples.

The bg object has now been filled with the additional information. It can now be used for:

dev <- computeDeviationsAnalytic(counts, background=bg, annotations=motifMatches)
dev

## class: SummarizedExperiment 
## dim: 5 10 
## metadata(0):
## assays(2): deviations z
## rownames(5): motif1 motif2 motif3 motif4 motif5
## rowData names(5): N total variability var.pval var.adjPval
## colnames(10): sample1 sample2 ... sample9 sample10
## colData names(0):

This last call could be made with different annotations, without needed to recompute the previous steps.

0.3.2 Including fragment length bias

In addition to the enrichment and GC bias taken into account by the original chromVAR, betterChromVAR supports an optional third dimension: fragment length bias (see ?getBackgroundBins() for more information). This requires the compilation of an additional bias component (the log10-transformed mean or median fragment length per region). If desired, desired, we recommended using the counting functions from the epiwraps package, which can provide this information.

0.3.3 Note on single-cell data

If you are using betterChromVAR (or ChromVAR, for that matter) on single-cell data with multiple cell types, the way to best run it depends on whether your are interested in differences within cell types, or between cell types. If interested in differences between cell types, run it on the entire dataset, specifying the cell types as grouping argument to betterChromVAR(). In this way, rare and abundant cell types will be given the same weight in computing the expectation.

If you are interested in differences within cell types (e.g. between samples/conditions), you should instead run the method separately for each cell type. In this way, the background bins will be defined based on the enrichment in the given cell type, leading to better capture of the bias. The drawback, however, is that the values won’t be comparable across cell types. To test across conditions, we also recommend using it on pseudobulk data.

0.3.3.1 Execution on Atlas-scale datasets

The only step of the process that needs to be done across the entire dataset is the computing of the expectation, i.e. the mean (or weighted mean) for each region across all cells, and the creation of the background bins, which depends on it. Everything else can easily be executed in chunks (of cells), as is done for multi-threading in the betterChromVAR() function. This can be done manually on the full dataset, and then passed to downstream functions for computations on subsets of the data:

ex <- getExpectation(counts)
bg <- getBackgroundBins(ex, bias=rowData(counts)$bias)

## Creating 50*50=2500 bias bins and computing their sampling distances

# this is equivalent to bg <- getBackgroundBins(counts)

Then we can apply the next steps only on subset of the data:

bg2 <- computeBackgrounds(counts[,1:3], bg, expectation = ex)
dev2 <- computeDeviationsAnalytic(counts[,1:3], bg2, motifMatches)
# this should be identical to what we had run on the whole object:
identical(assay(dev)[,1:3], assay(dev2))

## [1] TRUE

If data is on disk, rather than in memory, and you are adjusting the expectation based on cell-type abundance using the grouping argument, getBackgroundBins will load the entire count matrix into memory. To avoid this, compute the mean per group using block-wise operations, for instance using the aggregateAcrossCells() function of the scrapper package, and then use the rowMeans() of that as expectation.

0.4 Bias-normalization of bulk ATAC-seq data

The general chromVAR approach, and in particular the deterministic version implemented here, is also amenable to be used to normalize GC- and enrichment bias out of bulk ATAC-seq data. This is implemented in the CVnorm() function. In addition, if given a grouping of the samples (e.g. experimental conditions), the function applies a variance-based smoothing of the bias correction, inspired by smoothed quantile normalization (see the qsmooth package or Hicks et al. (2018) ). In a nutshell, if the bias in a certain background bin is explained by experimental groups, it will be less corrected than if it varies across samples of the groups666 Note that this is different from the original qsmooth approach – see ?CVnorm for more detail..

Example usage:

# we assign arbitrary groups to the samples:
counts$group <- rep(LETTERS[1:2], each=5)
# we run the smoothed CVnorm:
counts <- CVnorm(counts, grouping=counts$group)
# (the normal CVnorm could be run by omitting the grouping)
counts

## class: RangedSummarizedExperiment 
## dim: 500 10 
## metadata(0):
## assays(2): counts corrected
## rownames: NULL
## rowData names(1): bias
## colnames(10): sample1 sample2 ... sample9 sample10
## colData names(1): group

This adds a corrected assay to the object. Note that although the assay is corrected for enrichment- and GC-bias, it is not corrected for library size differences. Rather, it is on the original count scale (although not integer anymore), so that it is amenable to use in downstream count-based analysis methods such as edgeR.

0.5 References

Gerbaldo, Felix Ezequiel, Emanuel Sonder, Vincent Fischer, Selina Frei, Jiayi Wang, Katharina Gapp, Mark D Robinson, and Pierre-Luc Germain. 2024. “On the Identification of Differentially-Active Transcription Factors from Atac-Seq Data.” PLoS Computational Biology 20 (10): e1011971.

Granja, Jeffrey M, M Ryan Corces, Sarah E Pierce, S Tansu Bagdatli, Hani Choudhry, Howard Y Chang, and William J Greenleaf. 2021. “ArchR Is a Scalable Software Package for Integrative Single-Cell Chromatin Accessibility Analysis.” Nature Genetics 53 (3): 403–11.

Hicks, Stephanie C, Kwame Okrah, Joseph N Paulson, John Quackenbush, Rafael A Irizarry, and Héctor Corrada Bravo. 2018. “Smooth Quantile Normalization.” Biostatistics 19 (2): 185–98.

Schep, Alicia N, Beijing Wu, Jason D Buenrostro, and William J Greenleaf. 2017. “ChromVAR: Inferring Transcription-Factor-Associated Accessibility from Single-Cell Epigenomic Data.” Nature Methods 14 (10): 975–78.

0.6 Session info

## R version 4.6.0 alpha (2026-04-05 r89794)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] betterChromVAR_0.99.35      SummarizedExperiment_1.41.1
##  [3] Biobase_2.71.0              GenomicRanges_1.63.2       
##  [5] Seqinfo_1.1.0               IRanges_2.45.0             
##  [7] S4Vectors_0.49.1            BiocGenerics_0.57.0        
##  [9] generics_0.1.4              MatrixGenerics_1.23.0      
## [11] matrixStats_1.5.0           BiocStyle_2.39.0           
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-5        jsonlite_2.0.0      crayon_1.5.3       
##  [4] compiler_4.6.0      BiocManager_1.30.27 Biostrings_2.79.5  
##  [7] parallel_4.6.0      jquerylib_0.1.4     BiocParallel_1.45.0
## [10] yaml_2.3.12         fastmap_1.2.0       lattice_0.22-9     
## [13] R6_2.6.1            XVector_0.51.0      S4Arrays_1.11.1    
## [16] knitr_1.51          DelayedArray_0.37.1 bookdown_0.46      
## [19] bslib_0.10.0        rlang_1.2.0         cachem_1.1.0       
## [22] xfun_0.57           sass_0.4.10         otel_0.2.0         
## [25] SparseArray_1.11.13 cli_3.6.6           digest_0.6.39      
## [28] grid_4.6.0          lifecycle_1.0.5     evaluate_1.0.5     
## [31] codetools_0.2-20    abind_1.4-8         rmarkdown_2.31     
## [34] tools_4.6.0         htmltools_0.5.9