Accurate cell type annotation is essential but challenging in single-cell RNA sequencing (scRNA-seq). Manual approaches are time-consuming, subjective, and inconsistent, while automated classifiers often fail to generalize across tissues, conditions, or closely related cell types. A key limitation for both is the lack of true ground truth for benchmarking.
scTypeEval provides a ground-truth-agnostic framework to
systematically assess annotation quality using internal validation metrics. It
enables:
Load the package:
library(scTypeEval)
library(Matrix)
For this vignette, we’ll create a synthetic dataset representing single-cell RNA-seq data from multiple samples with distinct cell types. This simulates a typical scenario where we want to evaluate the consistency of cell type annotations across biological replicates.
set.seed(22)
# Number of genes and cells
n_genes <- 5000
n_cells_per_sample <- 500
n_samples <- 6
cell_types <- c("CD4_T", "CD8_T", "B_cells", "NK_cells")
# Generate gene names
gene_names <- paste0("Gene_", seq_len(n_genes))
# Total cells
total_cells <- n_cells_per_sample * n_samples
# Generate metadata: samples and cell types
samples <- rep(paste0("Sample", seq_len(n_samples)), each = n_cells_per_sample)
# Assign cell types evenly within each sample
n_cell_types <- length(cell_types)
cells_per_type <- n_cells_per_sample %/% n_cell_types
celltypes <- rep(rep(cell_types, each = cells_per_type), times = n_samples)
# Initialize count matrix
count_matrix <- matrix(0, nrow = n_genes, ncol = total_cells)
rownames(count_matrix) <- gene_names
colnames(count_matrix) <- paste0("Cell_", seq_len(total_cells))
# Add cell-type-specific expression patterns
for (i in seq_along(cell_types)) {
ct <- cell_types[i]
ct_indices <- which(celltypes == ct)
# Marker genes for this cell type (50 genes per type)
marker_start <- ((i - 1) * 50 + 1)
marker_end <- min(marker_start + 49, n_genes)
marker_genes <- marker_start:marker_end
# High expression in marker genes (lambda = 50)
count_matrix[marker_genes, ct_indices] <- matrix(
rpois(length(marker_genes) * length(ct_indices), lambda = 50),
nrow = length(marker_genes)
)
# Background expression in other genes (lambda = 5)
other_genes <- setdiff(seq_len(n_genes), marker_genes)
count_matrix[other_genes, ct_indices] <- matrix(
rpois(length(other_genes) * length(ct_indices), lambda = 5),
nrow = length(other_genes)
)
}
# Add sample-specific batch effects
for (s in seq_len(n_samples)) {
sample_cells <- which(samples == paste0("Sample", s))
batch_effect <- rnorm(1, mean = 1, sd = 0.1)
count_matrix[, sample_cells] <- round(
count_matrix[, sample_cells] * batch_effect
)
}
# Convert to sparse matrix
count_matrix <- Matrix(count_matrix, sparse = TRUE)
# Create metadata dataframe
metadata <- data.frame(
celltype = celltypes,
sample = samples,
batch = rep(
c("Batch1", "Batch2"),
each = n_cells_per_sample * (n_samples/2)
),
row.names = colnames(count_matrix),
stringsAsFactors = FALSE
)
# Preview the data
head(metadata)
#> celltype sample batch
#> Cell_1 CD4_T Sample1 Batch1
#> Cell_2 CD4_T Sample1 Batch1
#> Cell_3 CD4_T Sample1 Batch1
#> Cell_4 CD4_T Sample1 Batch1
#> Cell_5 CD4_T Sample1 Batch1
#> Cell_6 CD4_T Sample1 Batch1
dim(count_matrix)
#> [1] 5000 3000
scTypeEval objects can be created from:
# Create scTypeEval object from matrix and metadata
sceval <- create_scTypeEval(
matrix = count_matrix,
metadata = metadata
)
The scTypeEval object stores the raw count data and metadata, which will be
processed in subsequent steps.
Process and normalize the data, creating both single-cell and pseudobulk representations. This step:
# Process data with filtering criteria
sceval <- run_processing_data(
scTypeEval = sceval,
ident = "celltype", # Column defining cell type identities
sample = "sample", # Column defining sample IDs
min_samples = 3, # Minimum samples required per cell type
min_cells = 10 # Minimum cells per sample-celltype combination
)
The processed data is stored as data_assay objects within the scTypeEval
object, accessible for downstream analysis.
Identify biologically relevant features for evaluating consistency.
scTypeEval supports:
# Identify HVGs using the basic method
sceval <- run_hvg(
scTypeEval = sceval,
var_method = "basic", # Method for variance modeling: "basic" or "scran"
ngenes = 1000, # Number of HVGs to retain
sample = TRUE # Perform sample-level blocking
)
# View stored gene lists
gene_lists <- sceval@gene_lists
names(gene_lists)
#> [1] "HVG"
if ("HVG" %in% names(gene_lists)) {
length(gene_lists$HVG)
}
#> [1] 1000
# Identify marker genes per cell type
sceval <- run_gene_markers(
scTypeEval = sceval,
method = "scran.findMarkers", # Method for finding markers
ngenes_celltype = 50 # Maximum markers per cell type
)
# View marker genes
gene_lists <- sceval@gene_lists
if (!is.null(gene_lists$scran.findMarkers)) {
head(gene_lists$scran.findMarkers, 20)
}
#> [1] "Gene_119" "Gene_131" "Gene_125" "Gene_104" "Gene_149" "Gene_129"
#> [7] "Gene_147" "Gene_138" "Gene_133" "Gene_103" "Gene_112" "Gene_128"
#> [13] "Gene_121" "Gene_115" "Gene_148" "Gene_107" "Gene_116" "Gene_124"
#> [19] "Gene_110" "Gene_143"
You can also add custom gene lists using add_gene_list():
# Example: Add a custom list of immune response genes
custom_genes <- c("Gene_1", "Gene_50", "Gene_100", "Gene_150")
sceval <- add_gene_list(
scTypeEval = sceval,
gene_list = list("custom_genes" = custom_genes)
)
These different selected features are stored within the scTypeEval object and can be used in the subsequent steps.
gene_lists <- sceval@gene_lists
names(gene_lists)
#> [1] "HVG" "scran.findMarkers" "custom_genes"
Consistency metrics can be computed directly on the selected features. However, computing them in a low-dimensional space (e.g., PCA) significantly speeds up the analysis while yielding similar results.
# Run PCA on processed data
sceval <- run_pca(
scTypeEval = sceval,
ndim = 20, # Number of principal components
gene_list = "HVG" # specify gene list to subset if multiple were added
)
# View available reductions
reductions <- sceval@reductions
if (length(reductions) > 0) {
names(reductions)
}
#> [1] "single-cell" "pseudobulk"
Alternatively, pre-computed embeddings (PCA, UMAP, t-SNE) can be inserted using
add_dim_reduction().
The core of scTypeEval is computing pairwise dissimilarities between cell
types across samples. Multiple dissimilarity methods are supported:
Compute distances between pseudobulk gene expression profiles:
# Euclidean distance on pseudobulk profiles
sceval <- run_dissimilarity(
sceval,
method = "Pseudobulk:Euclidean",
reduction = FALSE # Use processed expression data of selected features
)
# Cosine distance on PCA embeddings
sceval <- run_dissimilarity(
sceval,
method = "Pseudobulk:Cosine",
reduction = TRUE # Use PCA space
)
Compute Wasserstein distances between single-cell distributions:
# Wasserstein distance on PCA embeddings (faster)
sceval <- run_dissimilarity(
sceval,
method = "WasserStein",
reduction = TRUE
)
Match cells across samples using a classifier:
# Reciprocal classification with SingleR
sceval <- run_dissimilarity(
sceval,
method = "recip_classif:Match",
reciprocal_classifier = "SingleR",
# usage of dimensional reduction not supported
# for this dissimilarity approach
reduction = FALSE
)
# List all computed dissimilarity matrices
dissim <- sceval@dissimilarity
if (length(dissim) > 0) {
names(dissim)
}
#> [1] "Pseudobulk:Euclidean" "Pseudobulk:Cosine" "WasserStein"
#> [4] "recip_classif:Match"
Evaluate inter-sample consistency for each cell type using internal validation
metrics. scTypeEval supports multiple consistency metrics:
# Compute silhouette consistency on Euclidean dissimilarity
consistency_sil <- get_consistency(
sceval,
dissimilarity_slot = "Pseudobulk:Euclidean",
consistency_metric = "silhouette"
)
# View results
print(consistency_sil)
#> celltype measure consistency_metric dissimilarity_method ident
#> CD4.T CD4.T 0.9392103 silhouette Pseudobulk:Euclidean celltype
#> CD8.T CD8.T 0.9372690 silhouette Pseudobulk:Euclidean celltype
#> B.cells B.cells 0.9382785 silhouette Pseudobulk:Euclidean celltype
#> NK.cells NK.cells 0.9370830 silhouette Pseudobulk:Euclidean celltype
Higher values indicate better consistency (samples with the same cell type annotation are more similar to each other than to samples with different annotations).
# Compute neighborhood purity on Wasserstein dissimilarity
consistency_nbhd <- get_consistency(
sceval,
dissimilarity_slot = "WasserStein",
consistency_metric = "NeighborhoodPurity"
)
# View results
print(consistency_nbhd)
#> celltype measure consistency_metric dissimilarity_method ident
#> CD4.T CD4.T 1 NeighborhoodPurity WasserStein celltype
#> CD8.T CD8.T 1 NeighborhoodPurity WasserStein celltype
#> B.cells B.cells 1 NeighborhoodPurity WasserStein celltype
#> NK.cells NK.cells 1 NeighborhoodPurity WasserStein celltype
consistency_all <-
get_consistency(
sceval,
dissimilarity_slot = c("Pseudobulk:Euclidean",
"WasserStein"), # allow multiple arguments
consistency_metric = c("silhouette",
"NeighborhoodPurity",
"Average_similarity") # allow multiple arguments
)
print(consistency_all)
#> celltype measure consistency_metric dissimilarity_method ident
#> CD4.T CD4.T 0.9392103 silhouette Pseudobulk:Euclidean celltype
#> CD8.T CD8.T 0.9372690 silhouette Pseudobulk:Euclidean celltype
#> B.cells B.cells 0.9382785 silhouette Pseudobulk:Euclidean celltype
#> NK.cells NK.cells 0.9370830 silhouette Pseudobulk:Euclidean celltype
#> CD4.T1 CD4.T 1.0000000 NeighborhoodPurity Pseudobulk:Euclidean celltype
#> CD8.T1 CD8.T 1.0000000 NeighborhoodPurity Pseudobulk:Euclidean celltype
#> B.cells1 B.cells 1.0000000 NeighborhoodPurity Pseudobulk:Euclidean celltype
#> NK.cells1 NK.cells 1.0000000 NeighborhoodPurity Pseudobulk:Euclidean celltype
#> CD4.T2 CD4.T 0.9427627 Average_similarity Pseudobulk:Euclidean celltype
#> CD8.T2 CD8.T 0.9410225 Average_similarity Pseudobulk:Euclidean celltype
#> B.cells2 B.cells 0.9419125 Average_similarity Pseudobulk:Euclidean celltype
#> NK.cells2 NK.cells 0.9408865 Average_similarity Pseudobulk:Euclidean celltype
#> CD4.T3 CD4.T 0.9008518 silhouette WasserStein celltype
#> CD8.T3 CD8.T 0.9009151 silhouette WasserStein celltype
#> B.cells3 B.cells 0.8998458 silhouette WasserStein celltype
#> NK.cells3 NK.cells 0.8996690 silhouette WasserStein celltype
#> CD4.T11 CD4.T 1.0000000 NeighborhoodPurity WasserStein celltype
#> CD8.T11 CD8.T 1.0000000 NeighborhoodPurity WasserStein celltype
#> B.cells11 B.cells 1.0000000 NeighborhoodPurity WasserStein celltype
#> NK.cells11 NK.cells 1.0000000 NeighborhoodPurity WasserStein celltype
#> CD4.T21 CD4.T 0.9099102 Average_similarity WasserStein celltype
#> CD8.T21 CD8.T 0.9099267 Average_similarity WasserStein celltype
#> B.cells21 B.cells 0.9090413 Average_similarity WasserStein celltype
#> NK.cells21 NK.cells 0.9089435 Average_similarity WasserStein celltype
Visualize dissimilarity matrices as annotated heatmaps. Cell types can be ordered by similarity or consistency scores:
# Plot dissimilarity heatmap with consistency-based ordering
plot_heatmap(
sceval,
dissimilarity_slot = "Pseudobulk:Euclidean",
sort_consistency = "silhouette", # Order by silhouette scores
sort_similarity = NULL # Or order by similarity
)
The heatmap shows pairwise dissimilarities between samples, grouped by cell type. Cell types with higher consistency (e.g., more coherent annotations) show tighter clustering of their samples.
Pseudobulk PCA per sample & cell type
plot_pca(
sceval,
reduction_slot = "pseudobulk"
)
Cell types with low consistency scores may indicate:
sessionInfo()
#> R version 4.6.0 alpha (2026-04-05 r89794)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SingleCellExperiment_1.33.2 SummarizedExperiment_1.41.1
#> [3] Biobase_2.71.0 GenomicRanges_1.63.2
#> [5] Seqinfo_1.1.0 IRanges_2.45.0
#> [7] S4Vectors_0.49.1-1 BiocGenerics_0.57.0
#> [9] generics_0.1.4 MatrixGenerics_1.23.0
#> [11] matrixStats_1.5.0 Seurat_5.4.0
#> [13] SeuratObject_5.4.0 sp_2.2-1
#> [15] Matrix_1.7-5 scTypeEval_0.99.31
#> [17] BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] RcppAnnoy_0.0.23 splines_4.6.0 later_1.4.8
#> [4] tibble_3.3.1 polyclip_1.10-7 fastDummies_1.7.5
#> [7] lifecycle_1.0.5 edgeR_4.9.7 globals_0.19.1
#> [10] lattice_0.22-9 MASS_7.3-65 magrittr_2.0.5
#> [13] limma_3.67.1 plotly_4.12.0 sass_0.4.10
#> [16] rmarkdown_2.31 jquerylib_0.1.4 yaml_2.3.12
#> [19] metapod_1.19.2 httpuv_1.6.17 otel_0.2.0
#> [22] sctransform_0.4.3 spam_2.11-3 spatstat.sparse_3.1-0
#> [25] reticulate_1.46.0 cowplot_1.2.0 pbapply_1.7-4
#> [28] RColorBrewer_1.1-3 abind_1.4-8 Rtsne_0.17
#> [31] purrr_1.2.2 SingleR_2.13.5 transport_0.15-4
#> [34] ggrepel_0.9.8 irlba_2.3.7 listenv_0.10.1
#> [37] spatstat.utils_3.2-2 goftest_1.2-3 RSpectra_0.16-2
#> [40] spatstat.random_3.4-5 dqrng_0.4.1 fitdistrplus_1.2-6
#> [43] parallelly_1.46.1 codetools_0.2-20 DelayedArray_0.37.1
#> [46] scuttle_1.21.6 tidyselect_1.2.1 farver_2.1.2
#> [49] ScaledMatrix_1.19.0 spatstat.explore_3.8-0 jsonlite_2.0.0
#> [52] BiocNeighbors_2.5.4 progressr_0.19.0 ggridges_0.5.7
#> [55] survival_3.8-6 tools_4.6.0 ica_1.0-3
#> [58] Rcpp_1.1.1-1 glue_1.8.0 gridExtra_2.3
#> [61] SparseArray_1.11.13 xfun_0.57 dplyr_1.2.1
#> [64] withr_3.0.2 BiocManager_1.30.27 fastmap_1.2.0
#> [67] bluster_1.21.1 digest_0.6.39 rsvd_1.0.5
#> [70] R6_2.6.1 mime_0.13 scattermore_1.2
#> [73] tensor_1.5.1 dichromat_2.0-0.1 spatstat.data_3.1-9
#> [76] tidyr_1.3.2 data.table_1.18.2.1 httr_1.4.8
#> [79] htmlwidgets_1.6.4 S4Arrays_1.11.1 uwot_0.2.4
#> [82] pkgconfig_2.0.3 gtable_0.3.6 lmtest_0.9-40
#> [85] S7_0.2.1-1 XVector_0.51.0 htmltools_0.5.9
#> [88] dotCall64_1.2 bookdown_0.46 scales_1.4.0
#> [91] png_0.1-9 spatstat.univar_3.1-7 scran_1.39.2
#> [94] knitr_1.51 reshape2_1.4.5 nlme_3.1-169
#> [97] cachem_1.1.0 zoo_1.8-15 stringr_1.6.0
#> [100] KernSmooth_2.23-26 parallel_4.6.0 miniUI_0.1.2
#> [103] pillar_1.11.1 grid_4.6.0 vctrs_0.7.3
#> [106] RANN_2.6.2 promises_1.5.0 BiocSingular_1.27.1
#> [109] beachmat_2.27.5 xtable_1.8-8 cluster_2.1.8.2
#> [112] evaluate_1.0.5 tinytex_0.59 magick_2.9.1
#> [115] cli_3.6.6 locfit_1.5-9.12 compiler_4.6.0
#> [118] rlang_1.2.0 future.apply_1.20.2 labeling_0.4.3
#> [121] plyr_1.8.9 stringi_1.8.7 viridisLite_0.4.3
#> [124] deldir_2.0-4 BiocParallel_1.45.0 lazyeval_0.2.3
#> [127] spatstat.geom_3.7-3 RcppHNSW_0.6.0 patchwork_1.3.2
#> [130] future_1.70.0 ggplot2_4.0.2 statmod_1.5.1
#> [133] shiny_1.13.0 ROCR_1.0-12 igraph_2.2.3
#> [136] bslib_0.10.0
The manuscript describing these methods is currently in preparation. For questions or issues, please visit: