BERT (Batch-Effect Removal with Trees) offers flexible and efficient batch effect correction of omics data, while providing maximum tolerance to missing values. Tested on multiple datasets from proteomic analyses, BERT offered a typical 5-10x runtime improvement over existing methods, while retaining more numeric values and preserving batch effect reduction quality.
As such, BERT is a valuable preprocessing tool for data analysis workflows, in particular for proteomic data. By providing BERT via Bioconductor, we make this tool available to a wider research community. An accompanying research paper is currently under preparation and will be made public soon.
BERT addresses the same fundamental data integration challenges than the [HarmonizR][https://github.com/HSU-HPC/HarmonizR] package, which is released on Bioconductor in November 2023. However, various algorithmic modications and optimizations of BERT provide better execution time and better data coverage than HarmonizR. Moreover, BERT offers a more user-friendly design and a less error-prone input format.
Please note that our package BERT is neither affiliated with nor related to Bidirectional Encoder Representations from Transformers as published by Google.
Please report any questions and issues in the GitHub forum, the BioConductor forum or directly contact the authors,
Please download and install a current version of R (Windows binaries). You might want to consider installing a development environment as well, e.g. RStudio. Finally, BERT can be installed via Bioconductor using
if (!require("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
}
BiocManager::install("BERT")which will install all required dependencies. To install the development version of BERT, you can use devtools as follows
which may require the manual installation of the dependencies
sva and limma.
As input, BERT requires a dataframe1 with samples in rows
and features in columns. For each sample, the respective batch should be
indicated by an integer or string in a corresponding column labelled
Batch. Missing values should be labelled as NA. A
valid example dataframe could look like this:
example = data.frame(feature_1 = stats::rnorm(5), feature_2 = stats::rnorm(5), Batch=c(1,1,2,2,2))
example
#> feature_1 feature_2 Batch
#> 1 -1.3520630 -0.8586532 1
#> 2 -1.3031982 -0.9456793 1
#> 3 -0.2093192 0.1091705 2
#> 4 0.3622840 0.1747156 2
#> 5 -2.1444545 0.5311959 2Note that each batch should contain at least two samples. Optional columns that can be passed are
Label A column with integers or strings indicating
the (known) class for each sample. NA is not allowed. BERT
may use this columns and Batch to compute quality metrics
after batch effect correction.
Sample A sample name. This column is ignored by BERT
and can be used to provide meta-information for further
processing.
Cov_1, Cov_2, …, Cov_x:
One or multiple columns with integers, indicating one or several
covariate levels. NA is not allowed. If this(these)
column(s) is present, BERT will pass them as covariates to the the
underlying batch effect correction method. As an example, this
functionality can be used to preserve differences between
healthy/tumorous samples, if some of the batches exhibit strongly
variable class distributions. Note that BERT requires at least two
numeric values per batch and unique covariate level to adjust a feature.
Features that don’t satisfy this condition in a specific batch are set
to NA for that batch.
Reference A column with integers or strings from
\(\mathbb{N}_0\) that indicate, whether
a sample should be used for “learning” the transformation for batch
effect correction or whether the sample should be co-adjusted using the
learned transformation from the other samples.NA is not
allowed. This feature can be used, if some batches contain unique
classes or samples with unknown classes which would prohibit the usage
of covariate columns. If the column contains a 0 for a
sample, this sample will be co-adjusted. Otherwise, the sample should
contain the respective class (encoded as integer or string). Note that
BERT requires at least two references of common class per adjustment
step and that the Reference column is mutually exclusive
with covariate columns.
Note that BERT tries to find all metadata information for a
SummarizedExperiment, including the mandatory batch
information, using colData. For instance, a valid
SummarizedExperiment might be defined as
nrows <- 200
ncols <- 8
expr_values <- matrix(runif(nrows * ncols, 1, 1e4), nrows)
# colData also takes all other metadata information, such as Label, Sample,
# Covariables etc.
colData <- data.frame(Batch=c(1,1,1,1,2,2,2,2), Reference=c(1,1,0,0,1,1,0,0))
dataset_raw = SummarizedExperiment::SummarizedExperiment(assays=list(expr=expr_values), colData=colData)BERT can be invoked by importing the BERT library and
calling the BERT function. The batch effect corrected data
is returned as a dataframe that mirrors the input dataframe2.
library(BERT)
# generate test data with 10% missing values as provided by the BERT library
dataset_raw <- generate_dataset(features=60, batches=10, samplesperbatch=10, mvstmt=0.1, classes=2)
# apply BERT
dataset_adjusted <- BERT(dataset_raw)
#> 2025-11-11 13:17:42.143684 INFO::Formatting Data.
#> 2025-11-11 13:17:42.150595 INFO::Replacing NaNs with NAs.
#> 2025-11-11 13:17:42.15738 INFO::Removing potential empty rows and columns
#> 2025-11-11 13:17:42.362897 INFO::Found 600 missing values.
#> 2025-11-11 13:17:42.377245 INFO::Introduced 0 missing values due to singular proteins at batch/covariate level.
#> 2025-11-11 13:17:42.377716 INFO::Done
#> 2025-11-11 13:17:42.37815 INFO::Acquiring quality metrics before batch effect correction.
#> 2025-11-11 13:17:42.388057 INFO::Starting hierarchical adjustment
#> 2025-11-11 13:17:42.388722 INFO::Found 10 batches.
#> 2025-11-11 13:17:42.389147 INFO::Cores argument is not defined or BPPARAM has been specified. Argument corereduction will not be used.
#> 2025-11-11 13:17:44.041162 INFO::Using default BPPARAM
#> 2025-11-11 13:17:44.041659 INFO::Processing subtree level 1
#> 2025-11-11 13:17:45.373398 INFO::Processing subtree level 2
#> 2025-11-11 13:17:46.718878 INFO::Adjusting the last 1 batches sequentially
#> 2025-11-11 13:17:46.7202 INFO::Done
#> 2025-11-11 13:17:46.720599 INFO::Acquiring quality metrics after batch effect correction.
#> 2025-11-11 13:17:46.723815 INFO::ASW Batch was 0.437685527534883 prior to batch effect correction and is now -0.0993476078864879 .
#> 2025-11-11 13:17:46.724272 INFO::ASW Label was 0.362517823923227 prior to batch effect correction and is now 0.829234555589712 .
#> 2025-11-11 13:17:46.724985 INFO::Total function execution time is 4.59973621368408 s and adjustment time is 4.33169484138489 s ( 94.17 )BERT uses the logging library to convey live information
to the user during the adjustment procedure. The algorithm first
verifies the shape and suitability of the input dataframe (lines 1-6)
before continuing with the actual batch effect correction (lines 8-14).
BERT measure batch effects before and after the correction step by means
of the average silhouette score (ASW) with respect to batch and labels
(lines 7 and 15). The ASW Label should increase in a successful batch
effect correction, whereas low values (\(\leq
0\)) are desireable for the ASW Batch3. Finally, BERT prints
the total function execution time (including the computation time for
the quality metrics).
BERT offers a large number of parameters to customize the batch effect adjustment. The full function call, including all defaults is
BERT(data, cores = NULL, combatmode = 1, corereduction=2, stopParBatches=2, backend="default", method="ComBat", qualitycontrol=TRUE, verify=TRUE, labelname="Label", batchname="Batch", referencename="Reference", samplename="Sample", covariatename=NULL, BPPARAM=NULL, assayname=NULL)In the following, we list the respective meaning of each parameter: -
data: The input dataframe/matrix/SummarizedExperiment to
adjust. See Data Preparation for
detailed formatting instructions. - data The data for
batch-effect correction. Must contain at least two samples per batch and
2 features.
cores: BERT uses BiocParallel
for parallelization. If the user specifies a value cores,
BERT internally creates and uses a new instance of
BiocParallelParam, which is however not exhibited to the
user. Setting this parameter can speed up the batch effect adjustment
considerably, in particular for large datasets and on unix-based
operating systems. A value between \(2\) and \(4\) is a reasonable choice for typical
commodity hardware. Multi-node computations are not supported as of now.
If, however, cores is not specified, BERT will default to
BiocParallel::bpparam(), which may have been set by the
user or the system. Additionally, the user can directly specify a
specific instance of BiocParallelParam to be used via the
BPPARAM argument.combatmode An integer that encodes the parameters to
use for ComBat.| Value | par.prior | mean.only |
|---|---|---|
| 1 | TRUE | FALSE |
| 2 | TRUE | TRUE |
| 3 | FALSE | FALSE |
| 4 | FALSE | TRUE |
The value of this parameter will be ignored, if
method!="ComBat".
corereduction Positive integer indicating the factor
by which the number of processes should be reduced, once no further
adjustment is possible for the current number of batches.4 This parameter is used
only, if the user specified a custom value for parameter
cores.
stopParBatches Positive integer indicating the
minimum number of batches required at a hierarchy level to proceed with
parallelized adjustment. If the number of batches is smaller, adjustment
will be performed sequentially to avoid communication
overheads.
backend: The backend to use for inter-process
communication. Possible choices are default and
file, where the former refers to the default communication
backend of the requested parallelization mode and the latter will create
temporary .rds files for data communication. ‘default’ is
usually faster for small to medium sized datasets.
method: The method to use for the underlying batch
effect correction steps. Should be either ComBat,
limma for limma::removeBatchEffects or
ref for adjustment using specified references (cf. Data Preparation). The underlying batch
effect adjustment method for ref is a modified version of
the limma method.
qualitycontrol: A boolean to (de)activate the ASW
computation. Deactivating the ASW computations accelerates the
computations.
verify: A boolean to (de)activate the initial format
check of the input data. Deactivating this verification step accelerates
the computations.
labelname: A string containing the name of the
column to use as class labels. The default is “Label”.
batchname: A string containing the name of the
column to use as batch labels. The default is “Batch”.
referencename: A string containing the name of the
column to use as reference labels. The default is “Reference”.
covariatename: A vector containing the names of
columns with categorical covariables.The default is NULL, in which case
all column names are matched agains the pattern “Cov”.
BPPARAM: An instance of
BiocParallelParam that will be used for parallelization.
The default is null, in which case the value of cores
determines the behaviour of BERT.
assayname: If the user chooses to pass a
SummarizedExperiment object, they need to specify the name
of the assay that they want to apply BERT to here. BERT then returns the
input SummarizedExperiment with an additional assay labeled
assayname_BERTcorrected.
BERT utilizes the logging package for output. The user
can easily specify the verbosity of BERT by setting the global logging
level in the script. For instance
BERT exhibits a large number of parameters for parallelisation as to provide users with maximum flexibility. For typical scenarios, however, the default parameters are well suited. For very large experiments (\(>15\) batches), we recommend to increase the number of cores (a reasonable value is \(4\) but larger values may be possible on your hardware). Most users should leave all parameters to their respective default.
In the following, we present simple cookbook examples for BERT usage. Note that ASWs (and runtime) will most likely differ on your machine, since the data generating process involves multiple random choices.
Here, BERT uses limma as underlying batch effect correction algorithm
(method='limma') and performs all computations on a single
process (cores parameter is left on default).
# import BERT
library(BERT)
# generate data with 30 batches, 60 features, 15 samples per batch, 15% missing values and 2 classes
dataset_raw <- generate_dataset(features=60, batches=20, samplesperbatch=15, mvstmt=0.15, classes=2)
# BERT
dataset_adjusted <- BERT(dataset_raw, method="limma")
#> 2025-11-11 13:17:46.765938 INFO::Formatting Data.
#> 2025-11-11 13:17:46.766507 INFO::Replacing NaNs with NAs.
#> 2025-11-11 13:17:46.767304 INFO::Removing potential empty rows and columns
#> 2025-11-11 13:17:46.76914 INFO::Found 2700 missing values.
#> 2025-11-11 13:17:46.787622 INFO::Introduced 0 missing values due to singular proteins at batch/covariate level.
#> 2025-11-11 13:17:46.788022 INFO::Done
#> 2025-11-11 13:17:46.788411 INFO::Acquiring quality metrics before batch effect correction.
#> 2025-11-11 13:17:46.79623 INFO::Starting hierarchical adjustment
#> 2025-11-11 13:17:46.796749 INFO::Found 20 batches.
#> 2025-11-11 13:17:46.79713 INFO::Cores argument is not defined or BPPARAM has been specified. Argument corereduction will not be used.
#> 2025-11-11 13:17:46.79756 INFO::Using default BPPARAM
#> 2025-11-11 13:17:46.797917 INFO::Processing subtree level 1
#> 2025-11-11 13:17:47.120733 INFO::Processing subtree level 2
#> 2025-11-11 13:17:47.428303 INFO::Processing subtree level 3
#> 2025-11-11 13:17:47.751024 INFO::Adjusting the last 1 batches sequentially
#> 2025-11-11 13:17:47.752742 INFO::Done
#> 2025-11-11 13:17:47.753319 INFO::Acquiring quality metrics after batch effect correction.
#> 2025-11-11 13:17:47.764064 INFO::ASW Batch was 0.49521695684342 prior to batch effect correction and is now -0.133139409052052 .
#> 2025-11-11 13:17:47.764525 INFO::ASW Label was 0.274273131149433 prior to batch effect correction and is now 0.851292712593726 .
#> 2025-11-11 13:17:47.765146 INFO::Total function execution time is 0.999224662780762 s and adjustment time is 0.956088781356812 s ( 95.68 )Here, BERT uses ComBat as underlying batch effect correction
algorithm (method is left on default) and performs all
computations on a 2 processes (cores=2).
# import BERT
library(BERT)
# generate data with 30 batches, 60 features, 15 samples per batch, 15% missing values and 2 classes
dataset_raw <- generate_dataset(features=60, batches=20, samplesperbatch=15, mvstmt=0.15, classes=2)
# BERT
dataset_adjusted <- BERT(dataset_raw, cores=2)
#> 2025-11-11 13:17:47.791062 INFO::Formatting Data.
#> 2025-11-11 13:17:47.791684 INFO::Replacing NaNs with NAs.
#> 2025-11-11 13:17:47.792464 INFO::Removing potential empty rows and columns
#> 2025-11-11 13:17:47.79417 INFO::Found 2700 missing values.
#> 2025-11-11 13:17:47.812737 INFO::Introduced 0 missing values due to singular proteins at batch/covariate level.
#> 2025-11-11 13:17:47.813157 INFO::Done
#> 2025-11-11 13:17:47.813526 INFO::Acquiring quality metrics before batch effect correction.
#> 2025-11-11 13:17:47.821797 INFO::Starting hierarchical adjustment
#> 2025-11-11 13:17:47.82233 INFO::Found 20 batches.
#> 2025-11-11 13:17:48.312427 INFO::Set up parallel execution backend with 2 workers
#> 2025-11-11 13:17:48.313463 INFO::Processing subtree level 1 with 20 batches using 2 cores.
#> 2025-11-11 13:17:50.291597 INFO::Adjusting the last 2 batches sequentially
#> 2025-11-11 13:17:50.292627 INFO::Adjusting sequential tree level 1 with 2 batches
#> 2025-11-11 13:17:51.354261 INFO::Done
#> 2025-11-11 13:17:51.354724 INFO::Acquiring quality metrics after batch effect correction.
#> 2025-11-11 13:17:51.361847 INFO::ASW Batch was 0.507201042658448 prior to batch effect correction and is now -0.137630422551799 .
#> 2025-11-11 13:17:51.362252 INFO::ASW Label was 0.281403084359587 prior to batch effect correction and is now 0.809343322818682 .
#> 2025-11-11 13:17:51.36272 INFO::Total function execution time is 3.5717396736145 s and adjustment time is 3.53179335594177 s ( 98.88 )Here, BERT takes the input data using a
SummarizedExperiment instead. Batch effect correction is
then performed using ComBat as underlying algorithm (method
is left on default) and all computations are performed on a single
process (cores parameter is left on default).
nrows <- 200
ncols <- 8
# SummarizedExperiments store samples in columns and features in rows (in contrast to BERT).
# BERT will automatically account for this.
expr_values <- matrix(runif(nrows * ncols, 1, 1e4), nrows)
# colData also takes further metadata information, such as Label, Sample,
# Reference or Covariables
colData <- data.frame("Batch"=c(1,1,1,1,2,2,2,2), "Label"=c(1,2,1,2,1,2,1,2), "Sample"=c(1,2,3,4,5,6,7,8))
dataset_raw = SummarizedExperiment::SummarizedExperiment(assays=list(expr=expr_values), colData=colData)
dataset_adjusted = BERT(dataset_raw, assayname = "expr")
#> 2025-11-11 13:17:51.399459 INFO::Formatting Data.
#> 2025-11-11 13:17:51.39996 INFO::Recognized SummarizedExperiment
#> 2025-11-11 13:17:51.400318 INFO::Typecasting input to dataframe.
#> 2025-11-11 13:17:51.420494 INFO::Replacing NaNs with NAs.
#> 2025-11-11 13:17:51.421151 INFO::Removing potential empty rows and columns
#> 2025-11-11 13:17:51.423267 INFO::Found 0 missing values.
#> 2025-11-11 13:17:51.427264 INFO::Introduced 0 missing values due to singular proteins at batch/covariate level.
#> 2025-11-11 13:17:51.42761 INFO::Done
#> 2025-11-11 13:17:51.427935 INFO::Acquiring quality metrics before batch effect correction.
#> 2025-11-11 13:17:51.430387 INFO::Starting hierarchical adjustment
#> 2025-11-11 13:17:51.430831 INFO::Found 2 batches.
#> 2025-11-11 13:17:51.43118 INFO::Cores argument is not defined or BPPARAM has been specified. Argument corereduction will not be used.
#> 2025-11-11 13:17:51.431541 INFO::Using default BPPARAM
#> 2025-11-11 13:17:51.431847 INFO::Adjusting the last 2 batches sequentially
#> 2025-11-11 13:17:51.432403 INFO::Adjusting sequential tree level 1 with 2 batches
#> 2025-11-11 13:17:51.461372 INFO::Done
#> 2025-11-11 13:17:51.461743 INFO::Acquiring quality metrics after batch effect correction.
#> 2025-11-11 13:17:51.464062 INFO::ASW Batch was -0.00511194229366883 prior to batch effect correction and is now -0.0913301408618857 .
#> 2025-11-11 13:17:51.464446 INFO::ASW Label was 0.00109858657551083 prior to batch effect correction and is now 0.0142058483943652 .
#> 2025-11-11 13:17:51.464903 INFO::Total function execution time is 0.0654449462890625 s and adjustment time is 0.0306291580200195 s ( 46.8 )BERT can utilize categorical covariables that are specified in
columns Cov_1, Cov_2, .... These columns are automatically
detected and integrated into the batch effect correction process.
# import BERT
library(BERT)
# set seed for reproducibility
set.seed(1)
# generate data with 5 batches, 60 features, 30 samples per batch, 15% missing values and 2 classes
dataset_raw <- generate_dataset(features=60, batches=5, samplesperbatch=30, mvstmt=0.15, classes=2)
# create covariable column with 2 possible values, e.g. male/female condition
dataset_raw["Cov_1"] = sample(c(1,2), size=dim(dataset_raw)[1], replace=TRUE)
# BERT
dataset_adjusted <- BERT(dataset_raw)
#> 2025-11-11 13:17:51.506027 INFO::Formatting Data.
#> 2025-11-11 13:17:51.506568 INFO::Replacing NaNs with NAs.
#> 2025-11-11 13:17:51.507164 INFO::Removing potential empty rows and columns
#> 2025-11-11 13:17:51.508372 INFO::Found 1350 missing values.
#> 2025-11-11 13:17:51.508928 INFO::BERT requires at least 2 numeric values per batch/covariate level. This may reduce the number of adjustable features considerably, depending on the quantification technique.
#> 2025-11-11 13:17:51.520209 INFO::Introduced 0 missing values due to singular proteins at batch/covariate level.
#> 2025-11-11 13:17:51.520576 INFO::Done
#> 2025-11-11 13:17:51.520895 INFO::Acquiring quality metrics before batch effect correction.
#> 2025-11-11 13:17:51.524008 INFO::Starting hierarchical adjustment
#> 2025-11-11 13:17:51.524472 INFO::Found 5 batches.
#> 2025-11-11 13:17:51.524789 INFO::Cores argument is not defined or BPPARAM has been specified. Argument corereduction will not be used.
#> 2025-11-11 13:17:51.52516 INFO::Using default BPPARAM
#> 2025-11-11 13:17:51.525466 INFO::Processing subtree level 1
#> 2025-11-11 13:17:51.673615 INFO::Adjusting the last 2 batches sequentially
#> 2025-11-11 13:17:51.67524 INFO::Adjusting sequential tree level 1 with 2 batches
#> 2025-11-11 13:17:51.7162 INFO::Done
#> 2025-11-11 13:17:51.716687 INFO::Acquiring quality metrics after batch effect correction.
#> 2025-11-11 13:17:51.720557 INFO::ASW Batch was 0.492773245691086 prior to batch effect correction and is now -0.0377157224767566 .
#> 2025-11-11 13:17:51.720987 INFO::ASW Label was 0.40854766060101 prior to batch effect correction and is now 0.895560693013661 .
#> 2025-11-11 13:17:51.721591 INFO::Total function execution time is 0.215595006942749 s and adjustment time is 0.191797018051147 s ( 88.96 )In rare cases, class distributions across experiments may be severely
skewed. In particular, a batch might contain classes that other batches
don’t contain. In these cases, samples of common conditions may serve as
references (bridges) between the batches
(method="ref"). BERT utilizes those samples as references
that have a condition specified in the “Reference” column of the input.
All other samples are co-adjusted. Please note, that this strategy
implicitly uses limma as underlying batch effect correction
algorithm.
# import BERT
library(BERT)
# generate data with 4 batches, 6 features, 15 samples per batch, 15% missing values and 2 classes
dataset_raw <- generate_dataset(features=6, batches=4, samplesperbatch=15, mvstmt=0.15, classes=2)
# create reference column with default value 0. The 0 indicates, that the respective sample should be co-adjusted only.
dataset_raw[, "Reference"] <- 0
# randomly select 2 references per batch and class - in practice, this choice will be determined by external requirements (e.g. class known for only these samples)
batches <- unique(dataset_raw$Batch) # all the batches
for(b in batches){ # iterate over all batches
# references from class 1
ref_idx = sample(which((dataset_raw$Batch==b)&(dataset_raw$Label==1)), size=2, replace=FALSE)
dataset_raw[ref_idx, "Reference"] <- 1
# references from class 2
ref_idx = sample(which((dataset_raw$Batch==b)&(dataset_raw$Label==2)), size=2, replace=FALSE)
dataset_raw[ref_idx, "Reference"] <- 2
}
# BERT
dataset_adjusted <- BERT(dataset_raw, method="ref")
#> 2025-11-11 13:17:51.758523 INFO::Formatting Data.
#> 2025-11-11 13:17:51.759046 INFO::Replacing NaNs with NAs.
#> 2025-11-11 13:17:51.759753 INFO::Removing potential empty rows and columns
#> 2025-11-11 13:17:51.760503 INFO::Found 60 missing values.
#> 2025-11-11 13:17:51.763371 INFO::Introduced 0 missing values due to singular proteins at batch/covariate level.
#> 2025-11-11 13:17:51.763775 INFO::Done
#> 2025-11-11 13:17:51.764174 INFO::Acquiring quality metrics before batch effect correction.
#> 2025-11-11 13:17:51.766324 INFO::Starting hierarchical adjustment
#> 2025-11-11 13:17:51.766837 INFO::Found 4 batches.
#> 2025-11-11 13:17:51.767285 INFO::Cores argument is not defined or BPPARAM has been specified. Argument corereduction will not be used.
#> 2025-11-11 13:17:51.76772 INFO::Using default BPPARAM
#> 2025-11-11 13:17:51.768121 INFO::Processing subtree level 1
#> 2025-11-11 13:17:51.838898 INFO::Adjusting the last 2 batches sequentially
#> 2025-11-11 13:17:51.840474 INFO::Adjusting sequential tree level 1 with 2 batches
#> 2025-11-11 13:17:51.85993 INFO::Done
#> 2025-11-11 13:17:51.860406 INFO::Acquiring quality metrics after batch effect correction.
#> 2025-11-11 13:17:51.862681 INFO::ASW Batch was 0.440355021914032 prior to batch effect correction and is now -0.087480278736629 .
#> 2025-11-11 13:17:51.863125 INFO::ASW Label was 0.373906827748893 prior to batch effect correction and is now 0.919791677398366 .
#> 2025-11-11 13:17:51.863684 INFO::Total function execution time is 0.105177402496338 s and adjustment time is 0.0931856632232666 s ( 88.6 )Issues can be reported in the GitHub forum, the BioConductor forum or directly to the authors.
This code is published under the GPLv3.0 License and is available for non-commercial academic purposes.
Please cite our manuscript, if you use BERT for your research: Schumann Y, Gocke A, Neumann J (2024). Computational Methods for Data Integration and Imputation of Missing Values in Omics Datasets. PROTEOMICS. ISSN 1615-9861, doi:10.1002/pmic.202400100
sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] BERT_1.6.0 BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] KEGGREST_1.51.0 SummarizedExperiment_1.41.0
#> [3] xfun_0.54 bslib_0.9.0
#> [5] Biobase_2.71.0 lattice_0.22-7
#> [7] vctrs_0.6.5 tools_4.5.2
#> [9] generics_0.1.4 stats4_4.5.2
#> [11] parallel_4.5.2 AnnotationDbi_1.72.0
#> [13] RSQLite_2.4.4 cluster_2.1.8.1
#> [15] blob_1.2.4 logging_0.10-108
#> [17] Matrix_1.7-4 S4Vectors_0.49.0
#> [19] lifecycle_1.0.4 compiler_4.5.2
#> [21] stringr_1.6.0 Biostrings_2.79.2
#> [23] statmod_1.5.1 janitor_2.2.1
#> [25] Seqinfo_1.1.0 codetools_0.2-20
#> [27] snakecase_0.11.1 htmltools_0.5.8.1
#> [29] sys_3.4.3 buildtools_1.0.0
#> [31] sass_0.4.10 yaml_2.3.10
#> [33] crayon_1.5.3 jquerylib_0.1.4
#> [35] comprehenr_0.6.10 BiocParallel_1.45.0
#> [37] limma_3.67.0 DelayedArray_0.37.0
#> [39] cachem_1.1.0 iterators_1.0.14
#> [41] abind_1.4-8 foreach_1.5.2
#> [43] nlme_3.1-168 sva_3.59.0
#> [45] genefilter_1.93.0 locfit_1.5-9.12
#> [47] tidyselect_1.2.1 digest_0.6.37
#> [49] stringi_1.8.7 splines_4.5.2
#> [51] maketools_1.3.2 fastmap_1.2.0
#> [53] grid_4.5.2 cli_3.6.5
#> [55] invgamma_1.2 SparseArray_1.11.1
#> [57] magrittr_2.0.4 S4Arrays_1.11.0
#> [59] survival_3.8-3 XML_3.99-0.20
#> [61] edgeR_4.9.0 bit64_4.6.0-1
#> [63] lubridate_1.9.4 timechange_0.3.0
#> [65] rmarkdown_2.30 XVector_0.51.0
#> [67] httr_1.4.7 matrixStats_1.5.0
#> [69] bit_4.6.0 png_0.1-8
#> [71] memoise_2.0.1 evaluate_1.0.5
#> [73] knitr_1.50 GenomicRanges_1.63.0
#> [75] IRanges_2.45.0 mgcv_1.9-4
#> [77] rlang_1.1.6 xtable_1.8-4
#> [79] glue_1.8.0 DBI_1.2.3
#> [81] BiocManager_1.30.26 BiocGenerics_0.57.0
#> [83] annotate_1.88.0 jsonlite_2.0.0
#> [85] R6_2.6.1 MatrixGenerics_1.23.0Matrices and SummarizedExperiments work as well, but will automatically be converted to dataframes.↩︎
In particular, the row and column names are in the same order and the optional columns are preserved.↩︎
The optimum of ASW Label is 1, which is typically however not achieved on real-world datasets. Also, the optimum of ASW Batch can vary, depending on the class distributions of the batches.↩︎
E.g. consider a BERT call with 8 batches and 8
processes. Further adjustment is not possible with this number of
processes, since batches are always processed in pairs. With
corereduction=2, the number of processes for the following
adjustment steps would be set to \(8/2=4\), which is the maximum number of
usable processes for this example.↩︎