From the perspective of metabolites as the continuation of the central dogma of biology, metabolomics provides the closest link to many phenotypes of interest. This makes untargeted LC-MS metabolomics data promising in teasing apart the complexities of living systems. However, due to experimental reasons, the data includes non-wanted variation which limits quality and reproducibility, especially if the data is obtained from several batches.
The batchCorr package reduces unwanted variation by way of between-batch alignment, within-batch drift correction and between-batch normalization using batch-specific quality control (QC) samples and long-term reference QC samples. Please see the associated article (Brunius, Shi, and Landberg 2016) for more thorough descriptions of algorithms.
To install batchCorr, install BiocManager first, if it
is not installed. Afterwards use the install function from
BiocManager.
The example data allows for demonstration of all core
batchCorr functionality: between-batch alignment,
within-batch drift correction and between-batch normalization. The
example data consisting of three batches from a single analytical mode
consists of three objects: PTnofill (non-imputed/filled abundances,
matrix), PTfilled (imputed/filled abundances) and meta (sample and
feature metadata, data.frame).
batchCorr was originally designed to work with basic
data structures but the three main methods, alignBatches,
correctDrift and normalizeBatches also support
SummarizedExperiment. This improves interoperability with other
Bioconductor packages. This includes xcms for preprocessing, the
qmtools, phenomis and/or pmp packages to complement normalization and
quality control as well as statistical tests, machine learning and
annotation.
Abundances are included in a matrix, while sample and feature data is
included in a data.frame. batchCorr works best as per the
chronology presented below, where both the original functionality and
SummarizedExperiment-functionality is demonstrated.
Utility functions for the original functionality include:
peakInfo to extract m/z and rt from peak table based on
a separator in rownamesgetBatch to extract specific batch from a list with a
peak table and metadatamergeBatches to merge batches after drift
correctionImportant analytical background includes batch-specific QC samples and long-term reference QC samples, which are regularly interspersed in the injection sequence. Batch-specific QC samples are typically pooled aliquots of study samples, and are used for within-batch drift correction. Long-term reference QC samples are not of the same biological origin as the batch-specific QC samples, and are therefore not directly representative of the sample population. Long-term reference QC samples are used for between-batch alignment, within-batch drift correction and between-batch normalization.
Let’s create a SummarizedExperiment object in order to demonstrate
the original functionality and the new SummarizedExperiment methods in
parallel. SummarizedExperiment may also be output from xcms-based
preprocessing using xcms::quantify().
peaks <- SimpleList(t(PTnofill), t(PTfill))
sampleData <- meta
featureData <- peakInfo(PT = PTnofill, sep = "@", start = 3)
rownames(featureData) <- rownames(peaks[[1]])
se <- SummarizedExperiment(assays = peaks, colData = sampleData,
rowData = featureData)
names(assays(se)) <- c("nofill", "fill")Below, we focus on basic usage. The list output for basic data structures includes processing metadata which can be used for troubleshooting. The SummarizedExperiment methods return objects with modified peak tables. Please see the documentation for more information (for example, ?alignBatches).
Shifts in retention time (RT) and mass-to-charge ratio (m/z) across
batches results in some metabolites being redundantly represented in the
dataset. To rectify this, between-batch alignment of features is
performed using alignBatches, which encompasses the
following steps:
NAs to all samples > 80% based on long-term reference QC
samples# Extract peakinfo (i.e. m/z and rt of features),
# These column names have 2 leading characters describing LC-MS mode
# -> start at 3
peakIn <- peakInfo(PT = PTnofill, sep = "@", start = 3)
# Perform multi-batch alignment
alignBat <- alignBatches(
peakInfo = peakIn, PeakTabNoFill = PTnofill,
PeakTabFilled = PTfill, batches = meta$batch,
sampleGroups = meta$grp, selectGroup = "QC",
report = FALSE
)
# Extract new peak table
PT <- alignBat$PTalignBelow, we use SummarizedExperiment with multiple peak tables. When
using SummarizedExperiment sequentially such that a single peak table is
replaced by a new one, one doesn’t need to specify the
assay.type or name parameters. The assay is
added to the SummarizedExperiment supplied to
PeakTabFilled.
Drift in abundance within a batch gives rise to unwanted variation
which can be modelled in terms of injection order. Many methods fail to
take into account different drift patterns in features or are prone to
overfitting. Herein, within-batch drift correction is performed using
the wrapper correctDrift, which involves:
Fitting a cubic spline and calculation of correction factor
Correction of the abundances using correction factor
The mixture models used to cluster the drift patterns in
correctDrift() can fail to converge for some combinations
of geometry and cluster number. This results in missing Bayesian
Information Criterion (BIC) measures for some models. You can check from
which converged models the final model was selected using the “BIC”
element in the output for basic data structures. To learn more about the
model-based clustering used herein, refer to the mclust (e)book (Scrucca et al. 2023).
# Batch B
batchB <- getBatch(
peakTable = PT, meta = meta,
batch = meta$batch, select = "B"
)
BCorr <- correctDrift(
peakTable = batchB$peakTable,
injections = batchB$meta$inj,
sampleGroups = batchB$meta$grp, QCID = "QC",
G = seq(5, 35, by = 3), modelNames = c("VVE", "VEE"),
report = FALSE
)
# Batch F
batchF <- getBatch(
peakTable = PT, meta = meta,
batch = meta$batch, select = "F"
)
FCorr <- correctDrift(
peakTable = batchF$peakTable,
injections = batchF$meta$inj,
sampleGroups = batchF$meta$grp,
QCID = "QC", G = seq(5, 35, by = 3),
modelNames = c("VVE", "VEE"),
report = FALSE
)
# Batch H
batchH <- getBatch(
peakTable = PT, meta = meta,
batch = meta$batch, select = "H"
)
HCorr <- correctDrift(
peakTable = batchH$peakTable,
injections = batchH$meta$inj,
sampleGroups = batchH$meta$grp,
QCID = "QC", G = seq(5, 35, by = 3),
modelNames = c("VVE", "VEE"),
report = FALSE
)
HCorr$BIC
## Bayesian Information Criterion (BIC):
## VVE VEE
## 5 7471.557 7028.343
## 8 7608.446 7460.302
## 11 7471.978 7456.081
##
## Top 3 models based on the BIC criterion:
## VVE,8 VVE,11 VVE,5
## 7608.446 7471.978 7471.557Similarly, we subset the SummarizedExperiment by batch and perform drift correction, but for this example using long-term reference samples for quality indicators.
batch_labels <- unique(colData(se)$batch)
batches <- lapply(batch_labels, function(batch_label) {
se[, colData(se)$batch == batch_label]
})
batches <- lapply(batches, correctDrift, injections = "inj",
sampleGroups = "grp", RefID = "Ref", G = seq(5, 35, by = 3),
modelNames = c("VVE", "VEE"), report = FALSE,
assay.type = "aligned", name = "corrected")normalizeBatches performs between-batch normalization
either based on long-term reference QC samples or median batch intensity
depending on the following dual criterion:
If the long-term QC samples are not considered reliable according to
the above dual criterion for a specific feature, batches were normalized
by sample population median, where a sample population can be specified
explicitly to the population argument. Features not present
in all batches are also excluded from the dataset.
mergedData <- mergeBatches(list(BCorr, FCorr, HCorr), qualRatio = 0.5)
normData <- normalizeBatches(
peakTableCorr = mergedData$peakTableCorr,
batches = meta$batch, sampleGroup = meta$grp,
refGroup = "Ref", population = "all"
)
PTnorm <- normData$peakTableMerging with mergeBatches, as above, includes a quality
control step, where features with CV > the limit (supplied to
correctDrift()) in a specified proportion of batches (default = 0.5) are
excluded. For SummarizedExperiment, we join the batches, keeping
features shared across all batches but without filtering features by
proportion of quality batches.
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] batchCorr_1.1.0 SummarizedExperiment_1.41.0
## [3] Biobase_2.71.0 GenomicRanges_1.63.0
## [5] Seqinfo_1.1.0 IRanges_2.45.0
## [7] S4Vectors_0.49.0 BiocGenerics_0.57.0
## [9] generics_0.1.4 MatrixGenerics_1.23.0
## [11] matrixStats_1.5.0 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-4 jsonlite_2.0.0 compiler_4.5.1
## [4] BiocManager_1.30.26 Rcpp_1.1.0 parallel_4.5.1
## [7] jquerylib_0.1.4 BiocParallel_1.45.0 yaml_2.3.10
## [10] fastmap_1.2.0 lattice_0.22-7 plyr_1.8.9
## [13] R6_2.6.1 XVector_0.51.0 S4Arrays_1.11.0
## [16] knitr_1.50 DelayedArray_0.37.0 maketools_1.3.2
## [19] bslib_0.9.0 rlang_1.1.6 reshape_0.8.10
## [22] cachem_1.1.0 xfun_0.54 sass_0.4.10
## [25] sys_3.4.3 SparseArray_1.11.1 cli_3.6.5
## [28] digest_0.6.37 grid_4.5.1 mclust_6.1.2
## [31] lifecycle_1.0.4 evaluate_1.0.5 codetools_0.2-20
## [34] buildtools_1.0.0 abind_1.4-8 rmarkdown_2.30
## [37] tools_4.5.1 htmltools_0.5.8.1