R version: R Under development (unstable) (2025-12-22 r89219)
Bioconductor version: 3.23
Package version: 0.99.0
CellMentor is a novel supervised cell type aware non-negative matrix factorization (NMF) method designed for enhanced cell type resolution in single-cell RNA sequencing analysis. By integrating cell type annotations into the NMF framework, CellMentor enables improved cell type separation, clustering, and annotation by leveraging latent patterns from reference datasets.
Traditional dimensionality reduction methods for single-cell RNA sequencing data often fail to capture cell type-specific variation effectively. CellMentor addresses this limitation by incorporating supervised learning into the NMF framework, resulting in improved separation of cell types in the reduced dimensional space. This approach is particularly valuable when working with heterogeneous cell populations where subtle differences between cell types need to be preserved during dimensionality reduction.
CellMentor distinguishes itself from other dimensionality reduction methods in the Bioconductor ecosystem by:
To install CellMentor from Bioconductor:
if (!require("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("CellMentor")
This section demonstrates the core functionality of CellMentor using a practical example with pancreatic cell data.
set.seed(100)
library(CellMentor)
library(Matrix)
library(ggplot2)
library(scRNAseq)
library(SingleCellExperiment)
library(scater)
We’ll use the Baron pancreas dataset as our reference and the Muraro pancreas dataset as our query data:
# Loading reference dataset (Baron)
baron <- h.baron_dataset()
reference_matrix <- baron$data
reference_celltypes <- baron$celltypes
# Loading query dataset (Muraro)
muraro <- muraro_dataset()
query_matrix <- muraro$data
query_celltypes <- muraro$celltypes # This would be unknown in a real application
# We keep it here for evaluation
# Function to create balanced subsets
create_subset <- function(matrix, celltypes, cells_per_type = 30) {
# Get unique cell types
unique_types <- unique(celltypes)
# Select cells for each type
selected_cells <- c()
for (cell_type in unique_types) {
# Get cells of this type
type_cells <- names(celltypes)[celltypes == cell_type]
# If fewer cells than requested, take all of them
n_to_select <- min(cells_per_type, length(type_cells))
# Randomly select cells
selected <- sample(type_cells, n_to_select)
selected_cells <- c(selected_cells, selected)
}
# Return subset
list(
matrix = matrix[, selected_cells],
celltypes = celltypes[selected_cells]
)
}
# Create balanced subsets with 30 cells per type
baron_subset <- create_subset(reference_matrix, reference_celltypes, 30)
muraro_subset <- create_subset(query_matrix, query_celltypes, 30)
# Update variable names for clarity
reference_matrix <- baron_subset$matrix
reference_celltypes <- baron_subset$celltypes
query_matrix <- muraro_subset$matrix
query_celltypes <- muraro_subset$celltypes # This would be unknown in a real application
# We keep it here for evaluation
The CSFNMF object is the core data structure that holds both reference and query data:
# Create the CSFNMF object
csfnmf_obj <- CreateCSFNMFobject(
ref_matrix = reference_matrix,
ref_celltype = reference_celltypes,
data_matrix = query_matrix,
norm = TRUE,
most.variable = TRUE,
scale = TRUE,
scale_by = "cells",
verbose = TRUE,
num_cores = 1
)
CellMentor automatically searches for optimal hyperparameters:
# Run CellMentor with hyperparameter optimization
optimal_params <- CellMentor(
csfnmf_obj,
alpha_range = c(1, 5), # Limited alpha range
beta_range = c(5), # use only one beta for speed
gamma_range = c(0.1), # use only one gamma for speed
delta_range = c(1), # use only one delta for speed
num_cores = 1,
verbose = TRUE
)
# Get best model
best_model <- optimal_params$best_model
K_VALUE <- cm_rank(best_model)
Project the query data onto the learned space:
# Project query data onto the learned space
h_project <- project_data(
W = W(best_model), # Learned gene weights
X = data_matrix(matrices(best_model)), # Query data matrix
num_cores = 5,
verbose = TRUE
)
Integrate the CellMentor results with SingleCellExperiment for visualization:
# Ensure unique rownames for genes
rownames(query_matrix) <- make.unique(rownames(query_matrix))
sce <- SingleCellExperiment(
assays = list(counts = query_matrix)
)
# Store any per-cell annotations you have (e.g., CellMentor cell types)
colData(sce)$celltype <- query_celltypes
# Cell embeddings (cells x K)
H_cell <- t(as.matrix(h_project)) # ensure cells x K
colnames(H_cell) <- paste0("CM", seq_len(ncol(H_cell)))
reducedDim(sce, "CellMentor") <- H_cell
# Gene loadings (genes x K) — align to SCE row order
W_mat <- as.matrix(W(best_model)) # genes x K
# Make sure rownames are set and alignable
if (!is.null(rownames(W_mat))) {
# Match to SCE rows; missing become NA
W_mat <- W_mat[match(rownames(sce), rownames(W_mat)), , drop = FALSE]
# Add each factor as a rowData column for convenience
for (j in seq_len(ncol(W_mat))) {
rowData(sce)[[paste0("CM_loading_", j)]] <- W_mat[, j]
}
}
# UMAP using the precomputed CellMentor embedding
sce <- runUMAP(
sce,
dimred = "CellMentor",
name = "UMAP_CellMentor",
ncomponents = 2
)
# Quick plots
plotReducedDim(sce, dimred = "UMAP_CellMentor", colour_by = "celltype")
This section provides guidance for applying CellMentor to your own datasets.
To use CellMentor, you need:
ref_counts)ncol(ref_counts) (ref_celltypes), names must match colnames(ref_counts)qry_counts), with overlapping gene IDsTip: Rows should be genes, columns should be cells. Use sparse matrices (Matrix::dgCMatrix) for improved speed and memory efficiency.
library(Matrix)
library(CellMentor)
# 1) Build CSFNMF object
csfnmf_obj <- CreateCSFNMFobject(
ref_matrix = ref_counts,
ref_celltype = ref_celltypes, # names(ref_celltypes) == colnames(ref_counts)
data_matrix = qry_counts,
norm = TRUE,
most.variable = TRUE,
scale = TRUE,
scale_by = "cells",
num_cores = 1,
verbose = TRUE
)
# 2) Hyperparameter search & training
optimal <- CellMentor(csfnmf_obj)
# 3) Get best model
best_model <- optimal$best_model
# 4) Project data
h_project <- project_data(
W = W(best_model),
X = data_matrix(matrices(best_model))
)
# 5) Optional: Seurat integration & UMAP
# (Follow the same steps as in the demo section above)
sessionInfo()
#> R Under development (unstable) (2025-12-22 r89219)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] future_1.68.0 scater_1.39.2
#> [3] scuttle_1.21.0 scRNAseq_2.25.0
#> [5] SingleCellExperiment_1.33.0 SummarizedExperiment_1.41.0
#> [7] Biobase_2.71.0 GenomicRanges_1.63.1
#> [9] Seqinfo_1.1.0 IRanges_2.45.0
#> [11] S4Vectors_0.49.0 BiocGenerics_0.57.0
#> [13] generics_0.1.4 MatrixGenerics_1.23.0
#> [15] matrixStats_1.5.0 ggplot2_4.0.1
#> [17] Matrix_1.7-4 CellMentor_0.99.0
#> [19] BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] ProtGenerics_1.43.0 spatstat.sparse_3.1-0
#> [3] bitops_1.0-9 httr_1.4.7
#> [5] RColorBrewer_1.1-3 tools_4.6.0
#> [7] sctransform_0.4.3 alabaster.base_1.11.1
#> [9] R6_2.6.1 HDF5Array_1.39.0
#> [11] lazyeval_0.2.2 uwot_0.2.4
#> [13] rhdf5filters_1.23.3 withr_3.0.2
#> [15] sp_2.2-0 prettyunits_1.2.0
#> [17] gridExtra_2.3 progressr_0.18.0
#> [19] cli_3.6.5 spatstat.explore_3.6-0
#> [21] fastDummies_1.7.5 labeling_0.4.3
#> [23] slam_0.1-55 alabaster.se_1.11.0
#> [25] entropy_1.3.2 sass_0.4.10
#> [27] Seurat_5.4.0 nnls_1.6
#> [29] S7_0.2.1 spatstat.data_3.1-9
#> [31] ggridges_0.5.7 pbapply_1.7-4
#> [33] Rsamtools_2.27.0 aricode_1.0.3
#> [35] dichromat_2.0-0.1 parallelly_1.46.1
#> [37] MLmetrics_1.1.3 RSQLite_2.4.5
#> [39] FNN_1.1.4.1 BiocIO_1.21.0
#> [41] ica_1.0-3 spatstat.random_3.4-3
#> [43] dplyr_1.1.4 ggbeeswarm_0.7.3
#> [45] abind_1.4-8 lifecycle_1.0.5
#> [47] yaml_2.3.12 rhdf5_2.55.12
#> [49] SparseArray_1.11.10 BiocFileCache_3.1.0
#> [51] Rtsne_0.17 grid_4.6.0
#> [53] blob_1.2.4 promises_1.5.0
#> [55] ExperimentHub_3.1.0 crayon_1.5.3
#> [57] miniUI_0.1.2 lattice_0.22-7
#> [59] beachmat_2.27.2 cowplot_1.2.0
#> [61] GenomicFeatures_1.63.1 cigarillo_1.1.0
#> [63] KEGGREST_1.51.1 magick_2.9.0
#> [65] pillar_1.11.1 knitr_1.51
#> [67] rjson_0.2.23 future.apply_1.20.1
#> [69] codetools_0.2-20 glue_1.8.0
#> [71] spatstat.univar_3.1-5 data.table_1.18.0
#> [73] vctrs_0.6.5 png_0.1-8
#> [75] gypsum_1.7.0 spam_2.11-3
#> [77] gtable_0.3.6 cachem_1.1.0
#> [79] xfun_0.55 S4Arrays_1.11.1
#> [81] mime_0.13 skmeans_0.2-18
#> [83] survival_3.8-3 tinytex_0.58
#> [85] fitdistrplus_1.2-4 ROCR_1.0-11
#> [87] lsa_0.73.4 nlme_3.1-168
#> [89] bit64_4.6.0-1 alabaster.ranges_1.11.0
#> [91] progress_1.2.3 filelock_1.0.3
#> [93] RcppAnnoy_0.0.23 GenomeInfoDb_1.47.2
#> [95] SnowballC_0.7.1 bslib_0.9.0
#> [97] irlba_2.3.5.1 vipor_0.4.7
#> [99] KernSmooth_2.23-26 otel_0.2.0
#> [101] DBI_1.2.3 tidyselect_1.2.1
#> [103] bit_4.6.0 compiler_4.6.0
#> [105] curl_7.0.0 httr2_1.2.2
#> [107] BiocNeighbors_2.5.0 h5mread_1.3.1
#> [109] DelayedArray_0.37.0 plotly_4.11.0
#> [111] bookdown_0.46 rtracklayer_1.71.3
#> [113] scales_1.4.0 lmtest_0.9-40
#> [115] rappdirs_0.3.3 stringr_1.6.0
#> [117] digest_0.6.39 goftest_1.2-3
#> [119] spatstat.utils_3.2-1 sparsesvd_0.2-3
#> [121] alabaster.matrix_1.11.0 rmarkdown_2.30
#> [123] XVector_0.51.0 htmltools_0.5.9
#> [125] pkgconfig_2.0.3 SingleR_2.13.1
#> [127] sparseMatrixStats_1.23.0 dbplyr_2.5.1
#> [129] fastmap_1.2.0 ensembldb_2.35.0
#> [131] rlang_1.1.7 htmlwidgets_1.6.4
#> [133] UCSC.utils_1.7.1 shiny_1.12.1
#> [135] DelayedMatrixStats_1.33.0 farver_2.1.2
#> [137] jquerylib_0.1.4 zoo_1.8-15
#> [139] jsonlite_2.0.0 BiocParallel_1.45.0
#> [141] BiocSingular_1.27.1 RCurl_1.98-1.17
#> [143] magrittr_2.0.4 dotCall64_1.2
#> [145] patchwork_1.3.2 Rhdf5lib_1.33.0
#> [147] Rcpp_1.1.1 viridis_0.6.5
#> [149] reticulate_1.44.1 stringi_1.8.7
#> [151] alabaster.schemas_1.11.0 MASS_7.3-65
#> [153] AnnotationHub_4.1.0 plyr_1.8.9
#> [155] parallel_4.6.0 listenv_0.10.0
#> [157] ggrepel_0.9.6 deldir_2.0-4
#> [159] Biostrings_2.79.4 splines_4.6.0
#> [161] tensor_1.5.1 hms_1.1.4
#> [163] igraph_2.2.1 spatstat.geom_3.6-1
#> [165] RcppHNSW_0.6.0 ScaledMatrix_1.19.0
#> [167] reshape2_1.4.5 BiocVersion_3.23.1
#> [169] XML_3.99-0.20 evaluate_1.0.5
#> [171] SeuratObject_5.3.0 BiocManager_1.30.27
#> [173] httpuv_1.6.16 RANN_2.6.2
#> [175] tidyr_1.3.2 RMTstat_0.3.1
#> [177] purrr_1.2.1 polyclip_1.10-7
#> [179] clue_0.3-66 alabaster.sce_1.11.0
#> [181] scattermore_1.2 rsvd_1.0.5
#> [183] xtable_1.8-4 restfulr_0.0.16
#> [185] AnnotationFilter_1.35.0 RSpectra_0.16-2
#> [187] later_1.4.5 viridisLite_0.4.2
#> [189] tibble_3.3.1 beeswarm_0.4.0
#> [191] memoise_2.0.1 AnnotationDbi_1.73.0
#> [193] GenomicAlignments_1.47.0 cluster_2.1.8.1
#> [195] globals_0.18.0
CellMentor: Cell-Type Aware Dimensionality Reduction for Single-cell RNA-Sequencing Data
Or Hevdeli†, Ekaterina Petrenko†, Dvir Aran
bioRxiv 2025.06.17.660094
doi: https://doi.org/10.1101/2025.06.17.660094
† These authors contributed equally to this work.