ProBatchFeatures

Overview

ProBatchFeatures is an extension of proBatch based on QFeatures (1), which extends it’s functionality by storing each processing stage (for example, raw data → log-transformation → normalization → batch effect correction → …) as a new SummarizedExperiment assay, and logging every step. It maintains full compatibility with QFeatures functions, and facilitates quick benchmarking of data preprocessing pipelines.

This vignette demonstrates how to:

  1. Construct a ProBatchFeatures object from matrix-formatted quantification data.

  2. Apply and record standard preprocessing steps (filtering features with too many missing values, log-transformation, normalization, and batch-effect correction) in a chained workflow.

  3. Perform diagnostic checks.

  4. Retrieve processed assay data and the corresponding preprocessing pipeline for benchmarking or reuse.

Setup

Installation

To install the latest version of proBatch package, use BiocManager package:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("proBatch")

Loading required packages

This vignette uses dplyr, tibble, and ggplot2 (alongside proBatch) for data manipulation and visualization.

library(proBatch)

library(dplyr)
library(tibble)
library(ggplot2)
library(QFeatures)

Loading and preparing the example dataset

As an example, we will use the E. coli dataset described in the FedProt paper (2):

data("example_ecoli_data", package = "proBatch")

The dataset consists of data from five centers that were measured and analyzed independently. For simplicity, data from all centers have been merged into a single dataset, and a subset containing 400 protein groups and their corresponding peptides is used in this example.

# Extracting data
all_metadata <- example_ecoli_data$all_metadata
all_precursors <- example_ecoli_data$all_precursors
all_protein_groups <- example_ecoli_data$all_protein_groups
all_precursor_pg_match <- example_ecoli_data$all_precursor_pg_match

# Keeping only essential
rm(example_ecoli_data)

Constructing a ProBatchFeatures object

The dataset contains two hierarchical levels. While additional levels can be added in a manner similar to QFeatures, this example focuses on the peptide and protein group levels. The all_precursor_pg_match dataframe stores the mapping between these levels. We begin by using the ProBatchFeatures() constructor to create an object representing the peptide-level data:

# Building from a (features x samples) matrix
pbf <- ProBatchFeatures(
    data_matrix = all_precursors, # matrix of peptide intensities (features x samples)
    sample_annotation = all_metadata, # data.frame of sample metadata
    sample_id_col = "Run", # column in metadata that matches sample IDs
    level = "peptide" # label this assay level as "peptide"
)

pbf # show basic info about the ProBatchFeatures object
#> An instance of class ProBatchFeatures (type: bulk) with 1 set:
#> 
#>  [1] peptide::raw: SummarizedExperiment with 1463 rows and 118 columns 
#>   Processing chain: unprocessed data (raw)

Assay names follow the convention “::”, for example, “peptide::raw”, or “protein::median_on_log”.

Next, we add the protein-level data as a second assay in the same ProBatchFeatures instance. Internally, this is achieved by creating a SummarizedExperiment for the protein-level data and adding it to the existing object using QFeatures functionality. The same sample annotation should be used for the new assay to ensure correct column alignment:

# Adding proteins as a new level and mapping them to precursors
#    all_precursor_pg_match has columns: "Precursor.Id", "Protein.Ids"
pbf <- pb_add_level(
    object = pbf,
    from = "peptide::raw",
    new_matrix = all_protein_groups,
    to_level = "protein", # will name "protein::raw" by default
    mapping_df = all_precursor_pg_match,
    from_id = "Precursor.Id",
    to_id = "Protein.Ids",
    map_strategy = "as_is"
)

pbf
#> An instance of class ProBatchFeatures (type: bulk) with 2 sets:
#> 
#>  [1] peptide::raw: SummarizedExperiment with 1463 rows and 118 columns 
#>  [2] protein::raw: SummarizedExperiment with 400 rows and 118 columns 
#>   Processing chain:
#>   [1] add_level(protein)_byVar
#>   Steps logged: 1 (see get_operation_log())

If a precursor (peptide) maps to multiple protein groups, map_strategy parameter specifies how to resolve multimappers. Available options are “first”, “longest”, and “as_is”. The “first” and “longest” strategies select a single ProteinID per peptide, where first takes the first match found in the mapping dataframe, and longest selects the ProteinID with the longest sequence, while “as_is” assumes a one-to-one mapping between identifiers.

# Check
validObject(pbf) # should be TRUE
#> [1] TRUE
assayLink(pbf, "protein::raw") # verify link summary
#> AssayLink for assay <protein::raw>
#> [from:peptide::raw|fcol:ProteinID|hits:1463]

# Remove unused variables to clean up the workspace
rm(all_metadata, all_precursor_pg_match, all_precursors, all_protein_groups)

Processing pipeline with diagnostics

With the data now loaded into a ProBatchFeatures object with peptide- and protein-level assays, we can demonstrate an example of typical preprocessing workflow. This pipeline includes filtering low-quality features, applying a log2-transformation, median normalization, and batch effect correction.

Step 1 - filtering features with too many missing values

Features detected in only a small subset of samples are unreliable and may obscure systematic effects in subsequent analyses. To examine the proportion of missing values across datasets, pb_nNA() function is used:

Function pb_nNA() summarizes missingness per assay:

# pb_nNA returns a DataFrame with counts of NAs per feature/sample
na_counts <- pb_nNA(pbf)

# na_counts contains info about total number of NAs and % in the data:
na_counts[["nNA"]] %>% as_tibble()
#> # A tibble: 2 × 3
#>   assay          nNA   pNA
#>   <chr>        <int> <dbl>
#> 1 peptide::raw 78870 0.457
#> 2 protein::raw  9163 0.194

Here nNA is a number of missing values in the assay and pNA is its proportion. It is possible to inspect the per-feature and per-sample breakdown for each assay:

# check NAs per sample:
head(na_counts[["peptide::raw"]]$nNAcols) %>% as_tibble()
#> # A tibble: 6 × 4
#>   assay        name       nNA   pNA
#>   <chr>        <chr>    <int> <dbl>
#> 1 peptide::raw lab_A_S1   496 0.339
#> 2 peptide::raw lab_A_S2   483 0.330
#> 3 peptide::raw lab_A_S3   524 0.358
#> 4 peptide::raw lab_A_S4   525 0.359
#> 5 peptide::raw lab_A_S5   541 0.370
#> 6 peptide::raw lab_A_S6   555 0.379

For each assay the following tables are availabe:

print(names(na_counts[["peptide::raw"]]))
#> [1] "nNA"     "nNArows" "nNAcols"

Visualisations helps explore batch-specific patterns of missingness. Firstly, we define a color scheme which will be used in all diagnostic plots:

# color scheme for lab and condition labels
color_scheme <- sample_annotation_to_colors(
    pbf,
    sample_id_col = "Run",
    factor_columns = c("Lab", "Condition"),
    numeric_columns = NULL
)

Displaying features that contain at least one missing as a heatmap:

plot_NA_heatmap(
    pbf, # by default the last created assay
    show_rownames = FALSE, show_row_dend = FALSE,
    color_by = "Lab"
)

To display all features, set the parameter drop_complete=F. When the number of features or samples exceeds 5000, a random subset of 5000 rows/columns is used for visualization by default. To display the entire dataset, set use_subset = T.

To display multiple assays in one plot, specify them using pbf_name parameter:

plot_NA_heatmap(
    pbf,
    pbf_name = c("peptide::raw", "protein::raw"),
    show_rownames = FALSE, show_row_dend = FALSE,
    color_by = "Lab"
)

The overlap between peptide and protein identifications across samples can be visualized as follows:

plot_NA_frequency(
    pbf,
    pbf_name = c("peptide::raw", "protein::raw"),
    show_percent = FALSE
)

In the resulting plot, particularly for peptides, batch-specific patterns are evident, reflecting the 19-20 samples contributed by each laboratory.

Here, we apply a simple filter that retains only features quantified in at least 40% of the samples. This threshold can be adjusted if necessary.

pbf <- pb_filterNA(
    pbf, # without specifying, the filter will be applied to all assays in the object
    inplace = TRUE, # if false, filtered matrix will be saved as a new assay.
    pNA = 0.6 # the maximal proportion of missing values per feature
)
pbf
#> An instance of class ProBatchFeatures (type: bulk) with 2 sets:
#> 
#>  [1] peptide::raw: SummarizedExperiment with 867 rows and 118 columns 
#>  [2] protein::raw: SummarizedExperiment with 331 rows and 118 columns 
#>   Processing chain:
#>   [1] add_level(protein)_byVar filterNA                 filterNA                
#>   Steps logged: 3 (see get_operation_log())

After filtering, the ProBatchFeatures object stores fewer features, which sharpens downstream diagnostics.

plot_NA_heatmap(
    pbf,
    pbf_name = c("peptide::raw", "protein::raw"),
    show_rownames = FALSE, show_row_dend = FALSE,
    color_by = "Lab", # currently only one level is supported
    border_color = NA # pheatmap parameter to remove cell borders
)

However, as the plots indicate, the remaining patterns of missingness are still associated with the lab.
To remove the residual systematic effects from the data, we further perform normalization and batch-effect correction.

Step 2 - log2-transformation

Proteomics data are commonly log-transformed to stabilize variance as required for furter statistical analyzes and data visualization. Here, we perform a log2-transformation of both peptide- and protein-level data:

pbf <- log_transform_dm(
    pbf,
    log_base = 2, offset = 1,
    pbf_name = "protein::raw"
)

pbf <- log_transform_dm(
    pbf,
    log_base = 2, offset = 1,
    pbf_name = "peptide::raw"
)

pbf
#> An instance of class ProBatchFeatures (type: bulk) with 2 sets:
#> 
#>  [1] peptide::raw: SummarizedExperiment with 867 rows and 118 columns 
#>  [2] protein::raw: SummarizedExperiment with 331 rows and 118 columns 
#>   Processing chain:
#>   [1] add_level(protein)_byVar filterNA                 filterNA                ; [4] log2                     log2                    
#>    - peptide: log2_on_raw
#>    - protein: log2_on_raw
#>   Steps logged: 5 (see get_operation_log())

After log-transformation, the ProBatchFeatures instance pbf contains additional assays (peptide::log2_on_raw and protein::log2_on_raw). If the assay was transformed using “fast to recompute” step (for exmaple, log-transformation or median normalization), the transformed data will not be saved in the list of assays, but only re-computed for plotting. To force storing the results of this steps store_fast_steps = T parameter can be added.

The intensity distributions of each sample after log-transformation can be visualized as boxplots to identify potential batch-specific shifts:

plot_boxplot(
    pbf,
    pbf_name = c("peptide::log2_on_raw", "protein::log2_on_raw"), # plot multiple on one
    sample_id_col = "Run",
    order_col = "Run",
    batch_col = "Lab",
    color_by_batch = TRUE,
    color_scheme = color_scheme,
    base_size = 7,
    plot_ncol = 1, # how many columns use for plotting multy-assay plot
    plot_title = c( # title for each assay
        "Peptide level - log2 scale",
        "Protein level - log2 scale"
    )
)

Additionally, the intensity distribution of peptides and proteins with and without NAs can be plotted:

plot_NA_density(
    pbf,
    pbf_name = c("peptide::log2_on_raw", "protein::log2_on_raw")
)

The comparison confirms that log2 stabilises the variance for both peptides and proteins, yet the cross-lab shifts in median intensity persist and will need to be addressed downstream.

Next, we will perform a Principal Component Analysis (PCA) to get a high-level overview of the sample clustering. We expect to see a strong batch effect, where samples cluster by their center of origin (“Lab”) rather than their biological condition (“Condition”).

pca1 <- plot_PCA(
    pbf,
    pbf_name = "protein::log2_on_raw",
    sample_id_col = "Run",
    color_scheme = color_scheme,
    color_by = "Lab",
    shape_by = "Condition",
    fill_the_missing = NULL, # remove all rows with missing values. Can be also "remove"
    plot_title = "NA rows removed, protein, log2",
    base_size = 10, point_size = 3, point_alpha = 0.5
)

pca2 <- plot_PCA(
    pbf,
    pbf_name = "protein::log2_on_raw",
    sample_id_col = "Run",
    color_scheme = color_scheme,
    color_by = "Condition",
    shape_by = "Lab",
    fill_the_missing = 0, # default value is -1
    plot_title = "NA replaced with 0, protein, log2",
    base_size = 10, point_size = 3, point_alpha = 0.5
)

grid.arrange(pca1, pca2, ncol = 2, nrow = 1)

Simiarly, the PCA plots clearly demonstrate that the primary source of variance in the raw data is the lab.

Step 3 - median normalization

To make the samples more comparable by removing the effect of sample load on intensities, we apply median normalization:

pbf <- pb_transform(
    object = pbf,
    from = "peptide::log2_on_raw",
    steps = "medianNorm"
)

pbf <- pb_transform(
    object = pbf,
    from = "protein::log2_on_raw",
    steps = "medianNorm"
)

pbf
#> An instance of class ProBatchFeatures (type: bulk) with 2 sets:
#> 
#>  [1] peptide::raw: SummarizedExperiment with 867 rows and 118 columns 
#>  [2] protein::raw: SummarizedExperiment with 331 rows and 118 columns 
#>   Processing chain:
#>   [1] add_level(protein)_byVar filterNA                 filterNA                ; [4] log2                     log2                     medianNorm              ; [7] medianNorm              
#>    - peptide: log2_on_raw, medianNorm_on_log2_on_raw
#>    - protein: log2_on_raw, medianNorm_on_log2_on_raw
#>   Steps logged: 7 (see get_operation_log())

pb_transform() appends peptide::medianNorm_on_log2_on_raw and protein::medianNorm_on_log2_on_raw. We can now plot and compare these assays before and after normalization:

# boxplot
plot_boxplot(
    pbf,
    sample_id_col = "Run",
    order_col = "Run",
    batch_col = "Lab",
    color_by_batch = TRUE,
    color_scheme = color_scheme,
    base_size = 7,
    pbf_name = c("protein::log2_on_raw", "protein::medianNorm_on_log2_on_raw"),
    plot_ncol = 1,
    plot_title = c(
        "Before Median Normalization, protein level",
        "After Median Normalization, protein level"
    )
)

# plot PCA plot
plot_PCA(
    pbf,
    pbf_name = c("protein::log2_on_raw", "protein::medianNorm_on_log2_on_raw"),
    sample_id_col = "Run",
    color_scheme = color_scheme,
    color_by = "Lab",
    shape_by = "Condition",
    fill_the_missing = NULL, # remove all rows with missing values. Can be also "remove"
    plot_title = c(
        "Before Median Normalization, protein level",
        "After Median Normalization, protein level"
    ),
    base_size = 10, point_size = 3, point_alpha = 0.5
)

Although median normalization made itensity distributions more similar, the residual lab-specific batch effects remain evident on the PCA plots and need to be corrected.

Step 4 - batch-effect correction

The user can choose from two methods for batch effects correction: ComBat (3) from sva and removeBatchEffect() from limma (4). Both methods can be accessed via the same function pb_transform().

# sample annotations used by both correction methods
sa <- colData(pbf) %>% as.data.frame()

Apply limma’s removeBatchEffect() to log-transformed and normalized data:

# limma::removeBatchEffect method
pbf <- pb_transform(
    pbf,
    from = "protein::log2_on_raw",
    steps = "limmaRBE",
    params_list = list(list(
        sample_annotation = sa,
        batch_col = "Lab",
        covariates_cols = c("Condition"),
        fill_the_missing = FALSE
    ))
)

pbf <- pb_transform(
    pbf,
    from = "protein::medianNorm_on_log2_on_raw",
    steps = "limmaRBE",
    params_list = list(list(
        sample_annotation = sa,
        batch_col = "Lab",
        covariates_cols = c("Condition"),
        fill_the_missing = FALSE
    ))
)

As batch effect correction does not listed as a fast step, two new protein-level assay are added to the pbd: - [3] protein::limmaRBE_on_log2_on_raw and - [4] protein::limmaRBE_on_medianNorm_on_log2_on_raw.

print(pbf)
#> An instance of class ProBatchFeatures (type: bulk) with 4 sets:
#> 
#>  [1] peptide::raw: SummarizedExperiment with 867 rows and 118 columns 
#>  [2] protein::raw: SummarizedExperiment with 331 rows and 118 columns 
#>  [3] protein::limmaRBE_on_log2_on_raw: SummarizedExperiment with 331 rows and 118 columns 
#>  [4] protein::limmaRBE_on_medianNorm_on_log2_on_raw: SummarizedExperiment with 331 rows and 118 columns 
#>   Processing chain:
#>   [1] add_level(protein)_byVar filterNA                 filterNA                ; [4] log2                     log2                     medianNorm              ; [7] medianNorm               limmaRBE                 limmaRBE                
#>    - peptide: log2_on_raw, medianNorm_on_log2_on_raw
#>    - protein: log2_on_raw, limmaRBE_on_log2_on_raw, medianNorm_on_log2_on_raw, limmaRBE_on_medianNorm_on_log2_on_raw
#>   Steps logged: 9 (see get_operation_log())

Alternatively, batch effect correction can be performed using ComBat:

# sva::ComBat wrapped via pb_transform()
pbf <- pb_transform(
    pbf,
    from = "protein::log2_on_raw",
    steps = "combat",
    params_list = list(list(
        sample_annotation = sa,
        batch_col = "Lab",
        sample_id_col = "Run",
        par.prior = TRUE,
        fill_the_missing = "remove"
    ))
)

pbf <- pb_transform(
    pbf,
    from = "protein::medianNorm_on_log2_on_raw",
    steps = "combat",
    params_list = list(list(
        sample_annotation = sa,
        batch_col = "Lab",
        sample_id_col = "Run",
        par.prior = TRUE,
        fill_the_missing = "remove"
    ))
)

The ComBat method from the sva package cannot be applied to data containing missing values.
Therefore, features with any missing values are removed by default.
Alternatively, missing values can be imputed with a user-defined constant using the fill_the_missing parameter (default: -1).

print(pbf)
#> An instance of class ProBatchFeatures (type: bulk) with 6 sets:
#> 
#>  [1] peptide::raw: SummarizedExperiment with 867 rows and 118 columns 
#>  [2] protein::raw: SummarizedExperiment with 331 rows and 118 columns 
#>  [3] protein::limmaRBE_on_log2_on_raw: SummarizedExperiment with 331 rows and 118 columns 
#>  [4] protein::limmaRBE_on_medianNorm_on_log2_on_raw: SummarizedExperiment with 331 rows and 118 columns 
#>  [5] protein::combat_on_log2_on_raw: SummarizedExperiment with 244 rows and 118 columns 
#>  [6] protein::combat_on_medianNorm_on_log2_on_raw: SummarizedExperiment with 244 rows and 118 columns 
#>   Processing chain:
#>    [1] add_level(protein)_byVar filterNA                 filterNA                ;  [4] log2                     log2                     medianNorm              ;  [7] medianNorm               limmaRBE                 limmaRBE                ; [10] combat                   combat                  
#>    - peptide: log2_on_raw, medianNorm_on_log2_on_raw
#>    - protein: log2_on_raw, combat_on_log2_on_raw, limmaRBE_on_log2_on_raw, medianNorm_on_log2_on_raw, combat_on_medianNorm_on_log2_on_raw, limmaRBE_on_medianNorm_on_log2_on_raw
#>   Steps logged: 11 (see get_operation_log())

Similarly, two new protein-level assays were added to pdf: - [5] protein::combat_on_log2_on_raw: … with 244 rows - [6] protein::combat_on_medianNorm_on_log2_on_raw: … with 244 rows.

And as we removed features with missing values, these assays contain less protein groups than the raw filtered data (331 rows).

All steps can be applied to peptide-level data in the same way:

pbf <- pb_transform(
    pbf,
    from = "peptide::medianNorm_on_log2_on_raw",
    steps = "limmaRBE",
    params_list = list(
        list(
            sample_annotation = sa,
            batch_col = "Lab",
            covariates_cols = c("Condition"),
            fill_the_missing = FALSE
        )
    )
)

pbf <- pb_transform(
    pbf,
    from = "peptide::medianNorm_on_log2_on_raw",
    steps = "combat",
    params_list = list(
        list(
            sample_annotation = sa,
            batch_col = "Lab",
            sample_id_col = "Run",
            par.prior = TRUE,
            fill_the_missing = "remove"
        )
    )
)

pbf
#> An instance of class ProBatchFeatures (type: bulk) with 8 sets:
#> 
#>  [1] peptide::raw: SummarizedExperiment with 867 rows and 118 columns 
#>  [2] protein::raw: SummarizedExperiment with 331 rows and 118 columns 
#>  [3] protein::limmaRBE_on_log2_on_raw: SummarizedExperiment with 331 rows and 118 columns 
#>  ...
#>  [6] protein::combat_on_medianNorm_on_log2_on_raw: SummarizedExperiment with 244 rows and 118 columns 
#>  [7] peptide::limmaRBE_on_medianNorm_on_log2_on_raw: SummarizedExperiment with 867 rows and 118 columns 
#>  [8] peptide::combat_on_medianNorm_on_log2_on_raw: SummarizedExperiment with 152 rows and 118 columns 
#>   Processing chain:
#>    [1] add_level(protein)_byVar filterNA                 filterNA                ;  [4] log2                     log2                     medianNorm              ;  [7] medianNorm               limmaRBE                 limmaRBE                ; [10] combat                   combat                   limmaRBE                ; [13] combat                  
#>    - peptide: log2_on_raw, medianNorm_on_log2_on_raw, combat_on_medianNorm_on_log2_on_raw, limmaRBE_on_medianNorm_on_log2_on_raw
#>    - protein: log2_on_raw, combat_on_log2_on_raw, limmaRBE_on_log2_on_raw, medianNorm_on_log2_on_raw, combat_on_medianNorm_on_log2_on_raw, limmaRBE_on_medianNorm_on_log2_on_raw
#>   Steps logged: 13 (see get_operation_log())

Two new peptide-level assays were added to pdf: - [7] peptide::limmaRBE_on_medianNorm_on_log2_on_raw: … with 867 rows - [8] peptide::combat_on_medianNorm_on_log2_on_raw: … with 152 rows (as NA-containing features were removed before ComBat).

Step 5 - assess processed assays

The four new assays (protein::limmaRBE_on_… and protein::combat_on_…) are now stored alongside the previous processing steps. We can explore these assays using PCA to verify that the batch effect has been successfully removed and compare them to the data before batch effect correction.

Principal component analysis

The progression from log\(_2\)-transformed to batch-corrected assays demonstrates the expected reduction in lab-specific variance and a clearer separation by biological condition in the final panel.

plot_PCA(
    pbf,
    pbf_name = c(
        "protein::log2_on_raw",
        "protein::medianNorm_on_log2_on_raw",
        "protein::limmaRBE_on_medianNorm_on_log2_on_raw",
        "protein::combat_on_medianNorm_on_log2_on_raw"
    ),
    sample_id_col = "Run",
    color_scheme = color_scheme,
    color_by = "Lab",
    shape_by = "Condition",
    fill_the_missing = NULL,
    base_size = 8, point_size = 3, point_alpha = 0.5
)

Similarly, for peptide-level data:

plot_PCA(
    pbf,
    pbf_name = c(
        "peptide::log2_on_raw",
        "peptide::medianNorm_on_log2_on_raw",
        "peptide::limmaRBE_on_medianNorm_on_log2_on_raw",
        "peptide::combat_on_medianNorm_on_log2_on_raw"
    ),
    sample_id_col = "Run",
    color_scheme = color_scheme,
    color_by = "Lab",
    shape_by = "Condition",
    fill_the_missing = "remove",
    base_size = 8, point_size = 3, point_alpha = 0.5
)

Hierarchical clustering

Hierarchical clustering provides another way to explore sample similarities. Here we display clustering results for protein::combat_on_medianNorm_on_log2_on_raw and protein::limmaRBE_on_medianNorm_on_log2_on_raw assays and colour samples by lab and condition:

plot_hierarchical_clustering(
    pbf,
    pbf_name = "protein::combat_on_medianNorm_on_log2_on_raw",
    sample_id_col = "Run",
    label_font = 0.6,
    color_list = color_scheme
)

plot_hierarchical_clustering(
    pbf,
    pbf_name = "protein::limmaRBE_on_medianNorm_on_log2_on_raw",
    sample_id_col = "Run",
    label_font = 0.6,
    color_list = color_scheme,
    fill_the_missing = "remove"
)

Principal Variance Component Analysis (PVCA)

PVCA quantifies the contribution of known factors to the observed variance (5) which can be compared across raw, normalised, and ComBat-corrected assays:

plot_PVCA(
    pbf,
    pbf_name = c(
        "protein::medianNorm_on_log2_on_raw",
        "protein::combat_on_medianNorm_on_log2_on_raw"
    ),
    sample_id_col = "Run",
    technical_factors = c("Lab"),
    biological_factors = c("Condition"),
    fill_the_missing = NULL,
    base_size = 7,
    plot_ncol = 2, # the number of plots in a row
    variance_threshold = 0 # the the percentile value of weight each of the
    # covariates needs to explain
    # (the rest will be lumped together)
)

Inspecting operation log

Each call to pb_transform() or pb_filterNA() creates a log entry that records the source assay, the applied operation, and the resulting assay name. Reviewing this log ensures that complex processing pipelines remain transparent and reproducible.

get_operation_log(pbf) %>%
    as_tibble()
#> # A tibble: 13 × 7
#>    step                 fun   from  to    params       timestamp           pkg  
#>    <chr>                <chr> <chr> <chr> <list>       <dttm>              <chr>
#>  1 add_level(protein)_… addA… pept… prot… <named list> 2026-01-29 04:17:10 proB…
#>  2 filterNA             filt… pept… pept… <named list> 2026-01-29 04:17:12 proB…
#>  3 filterNA             filt… prot… prot… <named list> 2026-01-29 04:17:12 proB…
#>  4 log2                 log2  prot… prot… <named list> 2026-01-29 04:17:13 proB…
#>  5 log2                 log2  pept… pept… <named list> 2026-01-29 04:17:13 proB…
#>  6 medianNorm           medi… pept… pept… <list [0]>   2026-01-29 04:17:18 proB…
#>  7 medianNorm           medi… prot… prot… <list [0]>   2026-01-29 04:17:18 proB…
#>  8 limmaRBE             limm… prot… prot… <named list> 2026-01-29 04:17:22 proB…
#>  9 limmaRBE             limm… prot… prot… <named list> 2026-01-29 04:17:22 proB…
#> 10 combat               comb… prot… prot… <named list> 2026-01-29 04:17:23 proB…
#> 11 combat               comb… prot… prot… <named list> 2026-01-29 04:17:23 proB…
#> 12 limmaRBE             limm… pept… pept… <named list> 2026-01-29 04:17:23 proB…
#> 13 combat               comb… pept… pept… <named list> 2026-01-29 04:17:23 proB…

If you need a compact name for the current assay chain (for example, to label benchmark results), use pb_pipeline_name(pbf, "protein::combat_on_medianNorm_on_log2_on_raw").

pb_pipeline_name(pbf, "protein::combat_on_medianNorm_on_log2_on_raw")
#> [1] "combat_on_medianNorm_on_log2_on_raw"

Extracting data matrices from pbf object

It is possible to extract a specific data matrix from a ProBatchFeatures object using the assay name:

extracted_matrix <- pb_assay_matrix(
    pbf,
    assay = "peptide::raw"
)
# show
extracted_matrix[1:5, 1:5]
#>                 lab_A_S1 lab_A_S2 lab_A_S3 lab_A_S4 lab_A_S5
#> ELGNWKDFIEVMLR3       NA       NA       NA       NA       NA
#> DFIEVMLR2             NA       NA       NA       NA       NA
#> AAADLISR2       14647900 13463600  8518130 12172600 11612300
#> AGIGINAGR2      11785300 11025600  9713860  9612410 13795300
#> AEQYLLENETTK2   30835300 29444100 19451100 20090000 22758800
rm(extracted_matrix)

For consistency with the main ProBatch vignette, assay can also be extracted in the long format. In this case, the sample_id_col needs to be specified:

extracted_long <- pb_as_long(
    pbf,
    sample_id_col = "Run",
    pbf_name = "peptide::raw"
)

# show
extracted_long[1:3, ]
#>     feature_label      Run Intensity   Lab Condition
#> 1 ELGNWKDFIEVMLR3 lab_A_S1        NA lab_A       Pyr
#> 2       DFIEVMLR2 lab_A_S1        NA lab_A       Pyr
#> 3       AAADLISR2 lab_A_S1  14647900 lab_A       Pyr
rm(extracted_long)

Session info

sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] QFeatures_1.21.0            MultiAssayExperiment_1.37.2
#>  [3] SummarizedExperiment_1.41.0 Biobase_2.71.0             
#>  [5] GenomicRanges_1.63.1        Seqinfo_1.1.0              
#>  [7] IRanges_2.45.0              S4Vectors_0.49.0           
#>  [9] BiocGenerics_0.57.0         generics_0.1.4             
#> [11] MatrixGenerics_1.23.0       matrixStats_1.5.0          
#> [13] gridExtra_2.3               knitr_1.51                 
#> [15] proBatch_1.99.4             ggplot2_4.0.1              
#> [17] tibble_3.3.1                dplyr_1.1.4                
#> 
#> loaded via a namespace (and not attached):
#>   [1] splines_4.5.2           ggplotify_0.1.3         preprocessCore_1.73.0  
#>   [4] XML_3.99-0.20           rpart_4.1.24            lifecycle_1.0.5        
#>   [7] Rdpack_2.6.5            fastcluster_1.3.0       edgeR_4.9.2            
#>  [10] doParallel_1.0.17       lattice_0.22-7          MASS_7.3-65            
#>  [13] backports_1.5.0         magrittr_2.0.4          limma_3.67.0           
#>  [16] Hmisc_5.2-5             sass_0.4.10             rmarkdown_2.30         
#>  [19] jquerylib_0.1.4         yaml_2.3.12             wesanderson_0.3.7      
#>  [22] otel_0.2.0              MsCoreUtils_1.23.2      DBI_1.2.3              
#>  [25] buildtools_1.0.0        minqa_1.2.8             RColorBrewer_1.1-3     
#>  [28] lubridate_1.9.4         abind_1.4-8             purrr_1.2.1            
#>  [31] AnnotationFilter_1.35.0 yulab.utils_0.2.3       nnet_7.3-20            
#>  [34] rappdirs_0.3.4          sva_3.59.0              maketools_1.3.2        
#>  [37] genefilter_1.93.0       pheatmap_1.0.13         annotate_1.89.0        
#>  [40] codetools_0.2-20        DelayedArray_0.37.0     tidyselect_1.2.1       
#>  [43] farver_2.1.2            lme4_1.1-38             viridis_0.6.5          
#>  [46] dynamicTreeCut_1.63-1   base64enc_0.1-3         jsonlite_2.0.0         
#>  [49] Formula_1.2-5           survival_3.8-6          iterators_1.0.14       
#>  [52] foreach_1.5.2           tools_4.5.2             Rcpp_1.1.1             
#>  [55] glue_1.8.0              SparseArray_1.11.10     BiocBaseUtils_1.13.0   
#>  [58] xfun_0.56               mgcv_1.9-4              ggfortify_0.4.19       
#>  [61] HDF5Array_1.39.0        withr_3.0.2             BiocManager_1.30.27    
#>  [64] fastmap_1.2.0           pvca_1.51.0             boot_1.3-32            
#>  [67] rhdf5filters_1.23.3     digest_0.6.39           timechange_0.3.0       
#>  [70] R6_2.6.1                gridGraphics_0.5-1      colorspace_2.1-2       
#>  [73] GO.db_3.22.0            RSQLite_2.4.5           h5mread_1.3.1          
#>  [76] utf8_1.2.6              tidyr_1.3.2             data.table_1.18.2.1    
#>  [79] httr_1.4.7              htmlwidgets_1.6.4       S4Arrays_1.11.1        
#>  [82] pkgconfig_2.0.3         gtable_0.3.6            blob_1.3.0             
#>  [85] S7_0.2.1                impute_1.85.0           XVector_0.51.0         
#>  [88] sys_3.4.3               htmltools_0.5.9         ProtGenerics_1.43.0    
#>  [91] clue_0.3-66             scales_1.4.0            png_0.1-8              
#>  [94] reformulas_0.4.3.1      corrplot_0.95           rstudioapi_0.18.0      
#>  [97] reshape2_1.4.5          checkmate_2.3.3         nlme_3.1-168           
#> [100] nloptr_2.2.1            cachem_1.1.0            rhdf5_2.55.12          
#> [103] stringr_1.6.0           parallel_4.5.2          foreign_0.8-90         
#> [106] AnnotationDbi_1.73.0    vsn_3.79.5              pillar_1.11.1          
#> [109] grid_4.5.2              vctrs_0.7.1             xtable_1.8-4           
#> [112] cluster_2.1.8.1         htmlTable_2.4.3         evaluate_1.0.5         
#> [115] cli_3.6.5               locfit_1.5-9.12         compiler_4.5.2         
#> [118] rlang_1.1.7             crayon_1.5.3            labeling_0.4.3         
#> [121] affy_1.89.0             plyr_1.8.9              fs_1.6.6               
#> [124] stringi_1.8.7           viridisLite_0.4.2       WGCNA_1.73             
#> [127] BiocParallel_1.45.0     Biostrings_2.79.4       lazyeval_0.2.2         
#> [130] Matrix_1.7-4            bit64_4.6.0-1           Rhdf5lib_1.33.0        
#> [133] KEGGREST_1.51.1         statmod_1.5.1           rbibutils_2.4.1        
#> [136] igraph_2.2.1            memoise_2.0.1           affyio_1.81.0          
#> [139] bslib_0.10.0            bit_4.6.0

Citation

To cite this package, please use:

citation("proBatch")
#> To cite proBatch in publications use:
#> 
#>   Cuklina J., Lee C.H., Williams E.G., Sajic T., Collins B.,
#>   Rodriguez-Martinez M., Aebersold R., Pedrioli P. Diagnostics and
#>   correction of batch effects in large-scale proteomic studies: a
#>   tutorial Molecular Systems Biology 17, e10240 (2021)
#>   https://doi.org/10.15252/msb.202110240
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     title = {Diagnostics and correction of batch effects in large-scale proteomic studies: a tutorial},
#>     author = {Jelena Cuklina and Chloe H. Lee and Evan G. Willams and Tatjana Sajic and Ben Collins and Maria Rodriguez-Martinez and Patrick Pedrioli and Ruedi Aebersold},
#>     journal = {Molecular Systems Biology},
#>     year = {2021},
#>     url = {https://doi.org/10.15252/msb.202110240},
#>     doi = {10.15252/msb.202110240},
#>   }

References

1.
Gatto, L. & Vanderaa, C. QFeatures: Quantitative features for mass spectrometry data. (2025). at <https://github.com/RforMassSpectrometry/QFeatures>
2.
Burankova, Y. et al. Privacy-preserving multicenter differential protein abundance analysis with FedProt. Nature Computational Science 1–14 (2025).
3.
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics 8, 118–127 (2006).
4.
Ritchie, M. E. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43, e47–e47 (2015).
5.
Bushel, P. pvca: Principal variance component analysis (PVCA). (2025). doi:10.18129/B9.bioc.pvca
  • true