Contents

R version: R Under development (unstable) (2025-12-22 r89219)
Bioconductor version: 3.23
Package version: 0.99.0

1 Introduction

CellMentor is a novel supervised cell type aware non-negative matrix factorization (NMF) method designed for enhanced cell type resolution in single-cell RNA sequencing analysis. By integrating cell type annotations into the NMF framework, CellMentor enables improved cell type separation, clustering, and annotation by leveraging latent patterns from reference datasets.

Traditional dimensionality reduction methods for single-cell RNA sequencing data often fail to capture cell type-specific variation effectively. CellMentor addresses this limitation by incorporating supervised learning into the NMF framework, resulting in improved separation of cell types in the reduced dimensional space. This approach is particularly valuable when working with heterogeneous cell populations where subtle differences between cell types need to be preserved during dimensionality reduction.

1.1 Key Features

  • Improved cell type separation through constrained supervised factorization (CSFNMF)
  • Automated parameter optimization for optimal performance
  • Efficient projection of query datasets onto learned cell type spaces
  • Seamless integration with Seurat for visualization and downstream analysis

1.2 Comparison to Existing Methods

CellMentor distinguishes itself from other dimensionality reduction methods in the Bioconductor ecosystem by:

  • Incorporating cell type annotations directly into the factorization process, unlike unsupervised methods such as PCA or standard NMF
  • Providing interpretable gene loadings that reflect cell type-specific expression patterns
  • Enabling transfer learning from reference to query datasets while maintaining cell type resolution

2 Installation

To install CellMentor from Bioconductor:

if (!require("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

BiocManager::install("CellMentor")

3 Basic Usage

This section demonstrates the core functionality of CellMentor using a practical example with pancreatic cell data.

3.1 Load Required Packages

set.seed(100)

library(CellMentor)
library(Matrix)
library(ggplot2)
library(scRNAseq)
library(SingleCellExperiment)
library(scater) 

3.2 1. Load Example Data

We’ll use the Baron pancreas dataset as our reference and the Muraro pancreas dataset as our query data:

# Loading reference dataset (Baron)
baron <- h.baron_dataset()
reference_matrix <- baron$data
reference_celltypes <- baron$celltypes

# Loading query dataset (Muraro)
muraro <- muraro_dataset()
query_matrix <- muraro$data
query_celltypes <- muraro$celltypes # This would be unknown in a real application
# We keep it here for evaluation

4 (Optional) Create smaller subsets for faster demonstration

# Function to create balanced subsets
create_subset <- function(matrix, celltypes, cells_per_type = 30) {
  # Get unique cell types
  unique_types <- unique(celltypes)

  # Select cells for each type
  selected_cells <- c()
  for (cell_type in unique_types) {
    # Get cells of this type
    type_cells <- names(celltypes)[celltypes == cell_type]

    # If fewer cells than requested, take all of them
    n_to_select <- min(cells_per_type, length(type_cells))

    # Randomly select cells
    selected <- sample(type_cells, n_to_select)
    selected_cells <- c(selected_cells, selected)
  }

  # Return subset
  list(
    matrix = matrix[, selected_cells],
    celltypes = celltypes[selected_cells]
  )
}

# Create balanced subsets with 30 cells per type
baron_subset <- create_subset(reference_matrix, reference_celltypes, 30)
muraro_subset <- create_subset(query_matrix, query_celltypes, 30)

# Update variable names for clarity
reference_matrix <- baron_subset$matrix
reference_celltypes <- baron_subset$celltypes
query_matrix <- muraro_subset$matrix
query_celltypes <- muraro_subset$celltypes # This would be unknown in a real application
# We keep it here for evaluation

4.1 2. Create CSFNMF Object

The CSFNMF object is the core data structure that holds both reference and query data:

# Create the CSFNMF object
csfnmf_obj <- CreateCSFNMFobject(
  ref_matrix = reference_matrix,
  ref_celltype = reference_celltypes,
  data_matrix = query_matrix,
  norm = TRUE,
  most.variable = TRUE,
  scale = TRUE,
  scale_by = "cells",
  verbose = TRUE,
  num_cores = 1
)

4.2 3. Run CellMentor with Hyperparameter Optimization

CellMentor automatically searches for optimal hyperparameters:

# Run CellMentor with hyperparameter optimization
optimal_params <- CellMentor(
  csfnmf_obj,
  alpha_range = c(1, 5), # Limited alpha range
  beta_range = c(5), # use only one beta for speed
  gamma_range = c(0.1), # use only one gamma for speed
  delta_range = c(1), # use only one delta for speed
  num_cores = 1,
  verbose = TRUE
)

# Get best model
best_model <- optimal_params$best_model
K_VALUE <- cm_rank(best_model)

4.3 4. Project Data

Project the query data onto the learned space:

# Project query data onto the learned space
h_project <- project_data(
  W = W(best_model), # Learned gene weights
  X = data_matrix(matrices(best_model)), # Query data matrix
  num_cores = 5,
  verbose = TRUE
)

4.4 5. Integration with SingleCellExperiment

Integrate the CellMentor results with SingleCellExperiment for visualization:

# Ensure unique rownames for genes
rownames(query_matrix) <- make.unique(rownames(query_matrix))

sce <- SingleCellExperiment(
  assays = list(counts = query_matrix)
)

# Store any per-cell annotations you have (e.g., CellMentor cell types)
colData(sce)$celltype <- query_celltypes

# Cell embeddings (cells x K)
H_cell <- t(as.matrix(h_project))      # ensure cells x K
colnames(H_cell) <- paste0("CM", seq_len(ncol(H_cell)))
reducedDim(sce, "CellMentor") <- H_cell

# Gene loadings (genes x K) — align to SCE row order
W_mat <- as.matrix(W(best_model))      # genes x K
# Make sure rownames are set and alignable
if (!is.null(rownames(W_mat))) {
  # Match to SCE rows; missing become NA
  W_mat <- W_mat[match(rownames(sce), rownames(W_mat)), , drop = FALSE]
  # Add each factor as a rowData column for convenience
  for (j in seq_len(ncol(W_mat))) {
    rowData(sce)[[paste0("CM_loading_", j)]] <- W_mat[, j]
  }
}

# UMAP using the precomputed CellMentor embedding
sce <- runUMAP(
  sce,
  dimred = "CellMentor",
  name   = "UMAP_CellMentor",
  ncomponents = 2
)

# Quick plots
plotReducedDim(sce, dimred = "UMAP_CellMentor", colour_by = "celltype")

5 Running CellMentor on Your Own Data

This section provides guidance for applying CellMentor to your own datasets.

5.1 Required Inputs

To use CellMentor, you need:

  • Reference counts matrix: genes × cells (ref_counts)
  • Reference annotations: vector of length ncol(ref_counts) (ref_celltypes), names must match colnames(ref_counts)
  • Query counts matrix: genes × cells (qry_counts), with overlapping gene IDs

Tip: Rows should be genes, columns should be cells. Use sparse matrices (Matrix::dgCMatrix) for improved speed and memory efficiency.

5.2 Workflow

library(Matrix)
library(CellMentor)

# 1) Build CSFNMF object
csfnmf_obj <- CreateCSFNMFobject(
  ref_matrix = ref_counts,
  ref_celltype = ref_celltypes, # names(ref_celltypes) == colnames(ref_counts)
  data_matrix = qry_counts,
  norm = TRUE,
  most.variable = TRUE,
  scale = TRUE,
  scale_by = "cells",
  num_cores = 1,
  verbose = TRUE
)

# 2) Hyperparameter search & training
optimal <- CellMentor(csfnmf_obj)

# 3) Get best model
best_model <- optimal$best_model

# 4) Project data
h_project <- project_data(
  W = W(best_model),
  X = data_matrix(matrices(best_model))
)

# 5) Optional: Seurat integration & UMAP
# (Follow the same steps as in the demo section above)

6 Session Information

sessionInfo()
#> R Under development (unstable) (2025-12-22 r89219)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] future_1.68.0               scater_1.39.2              
#>  [3] scuttle_1.21.0              scRNAseq_2.25.0            
#>  [5] SingleCellExperiment_1.33.0 SummarizedExperiment_1.41.0
#>  [7] Biobase_2.71.0              GenomicRanges_1.63.1       
#>  [9] Seqinfo_1.1.0               IRanges_2.45.0             
#> [11] S4Vectors_0.49.0            BiocGenerics_0.57.0        
#> [13] generics_0.1.4              MatrixGenerics_1.23.0      
#> [15] matrixStats_1.5.0           ggplot2_4.0.1              
#> [17] Matrix_1.7-4                CellMentor_0.99.0          
#> [19] BiocStyle_2.39.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] ProtGenerics_1.43.0       spatstat.sparse_3.1-0    
#>   [3] bitops_1.0-9              httr_1.4.7               
#>   [5] RColorBrewer_1.1-3        tools_4.6.0              
#>   [7] sctransform_0.4.3         alabaster.base_1.11.1    
#>   [9] R6_2.6.1                  HDF5Array_1.39.0         
#>  [11] lazyeval_0.2.2            uwot_0.2.4               
#>  [13] rhdf5filters_1.23.3       withr_3.0.2              
#>  [15] sp_2.2-0                  prettyunits_1.2.0        
#>  [17] gridExtra_2.3             progressr_0.18.0         
#>  [19] cli_3.6.5                 spatstat.explore_3.6-0   
#>  [21] fastDummies_1.7.5         labeling_0.4.3           
#>  [23] slam_0.1-55               alabaster.se_1.11.0      
#>  [25] entropy_1.3.2             sass_0.4.10              
#>  [27] Seurat_5.4.0              nnls_1.6                 
#>  [29] S7_0.2.1                  spatstat.data_3.1-9      
#>  [31] ggridges_0.5.7            pbapply_1.7-4            
#>  [33] Rsamtools_2.27.0          aricode_1.0.3            
#>  [35] dichromat_2.0-0.1         parallelly_1.46.1        
#>  [37] MLmetrics_1.1.3           RSQLite_2.4.5            
#>  [39] FNN_1.1.4.1               BiocIO_1.21.0            
#>  [41] ica_1.0-3                 spatstat.random_3.4-3    
#>  [43] dplyr_1.1.4               ggbeeswarm_0.7.3         
#>  [45] abind_1.4-8               lifecycle_1.0.5          
#>  [47] yaml_2.3.12               rhdf5_2.55.12            
#>  [49] SparseArray_1.11.10       BiocFileCache_3.1.0      
#>  [51] Rtsne_0.17                grid_4.6.0               
#>  [53] blob_1.2.4                promises_1.5.0           
#>  [55] ExperimentHub_3.1.0       crayon_1.5.3             
#>  [57] miniUI_0.1.2              lattice_0.22-7           
#>  [59] beachmat_2.27.2           cowplot_1.2.0            
#>  [61] GenomicFeatures_1.63.1    cigarillo_1.1.0          
#>  [63] KEGGREST_1.51.1           magick_2.9.0             
#>  [65] pillar_1.11.1             knitr_1.51               
#>  [67] rjson_0.2.23              future.apply_1.20.1      
#>  [69] codetools_0.2-20          glue_1.8.0               
#>  [71] spatstat.univar_3.1-5     data.table_1.18.0        
#>  [73] vctrs_0.6.5               png_0.1-8                
#>  [75] gypsum_1.7.0              spam_2.11-3              
#>  [77] gtable_0.3.6              cachem_1.1.0             
#>  [79] xfun_0.55                 S4Arrays_1.11.1          
#>  [81] mime_0.13                 skmeans_0.2-18           
#>  [83] survival_3.8-3            tinytex_0.58             
#>  [85] fitdistrplus_1.2-4        ROCR_1.0-11              
#>  [87] lsa_0.73.4                nlme_3.1-168             
#>  [89] bit64_4.6.0-1             alabaster.ranges_1.11.0  
#>  [91] progress_1.2.3            filelock_1.0.3           
#>  [93] RcppAnnoy_0.0.23          GenomeInfoDb_1.47.2      
#>  [95] SnowballC_0.7.1           bslib_0.9.0              
#>  [97] irlba_2.3.5.1             vipor_0.4.7              
#>  [99] KernSmooth_2.23-26        otel_0.2.0               
#> [101] DBI_1.2.3                 tidyselect_1.2.1         
#> [103] bit_4.6.0                 compiler_4.6.0           
#> [105] curl_7.0.0                httr2_1.2.2              
#> [107] BiocNeighbors_2.5.0       h5mread_1.3.1            
#> [109] DelayedArray_0.37.0       plotly_4.11.0            
#> [111] bookdown_0.46             rtracklayer_1.71.3       
#> [113] scales_1.4.0              lmtest_0.9-40            
#> [115] rappdirs_0.3.3            stringr_1.6.0            
#> [117] digest_0.6.39             goftest_1.2-3            
#> [119] spatstat.utils_3.2-1      sparsesvd_0.2-3          
#> [121] alabaster.matrix_1.11.0   rmarkdown_2.30           
#> [123] XVector_0.51.0            htmltools_0.5.9          
#> [125] pkgconfig_2.0.3           SingleR_2.13.1           
#> [127] sparseMatrixStats_1.23.0  dbplyr_2.5.1             
#> [129] fastmap_1.2.0             ensembldb_2.35.0         
#> [131] rlang_1.1.7               htmlwidgets_1.6.4        
#> [133] UCSC.utils_1.7.1          shiny_1.12.1             
#> [135] DelayedMatrixStats_1.33.0 farver_2.1.2             
#> [137] jquerylib_0.1.4           zoo_1.8-15               
#> [139] jsonlite_2.0.0            BiocParallel_1.45.0      
#> [141] BiocSingular_1.27.1       RCurl_1.98-1.17          
#> [143] magrittr_2.0.4            dotCall64_1.2            
#> [145] patchwork_1.3.2           Rhdf5lib_1.33.0          
#> [147] Rcpp_1.1.1                viridis_0.6.5            
#> [149] reticulate_1.44.1         stringi_1.8.7            
#> [151] alabaster.schemas_1.11.0  MASS_7.3-65              
#> [153] AnnotationHub_4.1.0       plyr_1.8.9               
#> [155] parallel_4.6.0            listenv_0.10.0           
#> [157] ggrepel_0.9.6             deldir_2.0-4             
#> [159] Biostrings_2.79.4         splines_4.6.0            
#> [161] tensor_1.5.1              hms_1.1.4                
#> [163] igraph_2.2.1              spatstat.geom_3.6-1      
#> [165] RcppHNSW_0.6.0            ScaledMatrix_1.19.0      
#> [167] reshape2_1.4.5            BiocVersion_3.23.1       
#> [169] XML_3.99-0.20             evaluate_1.0.5           
#> [171] SeuratObject_5.3.0        BiocManager_1.30.27      
#> [173] httpuv_1.6.16             RANN_2.6.2               
#> [175] tidyr_1.3.2               RMTstat_0.3.1            
#> [177] purrr_1.2.1               polyclip_1.10-7          
#> [179] clue_0.3-66               alabaster.sce_1.11.0     
#> [181] scattermore_1.2           rsvd_1.0.5               
#> [183] xtable_1.8-4              restfulr_0.0.16          
#> [185] AnnotationFilter_1.35.0   RSpectra_0.16-2          
#> [187] later_1.4.5               viridisLite_0.4.2        
#> [189] tibble_3.3.1              beeswarm_0.4.0           
#> [191] memoise_2.0.1             AnnotationDbi_1.73.0     
#> [193] GenomicAlignments_1.47.0  cluster_2.1.8.1          
#> [195] globals_0.18.0

7 References

Appendix

CellMentor: Cell-Type Aware Dimensionality Reduction for Single-cell RNA-Sequencing Data
Or Hevdeli†, Ekaterina Petrenko†, Dvir Aran
bioRxiv 2025.06.17.660094
doi: https://doi.org/10.1101/2025.06.17.660094

† These authors contributed equally to this work.