The goal of lemur is to simplify analysis the of
multi-condition single-cell data. If you have collected a single-cell
RNA-seq dataset with more than one condition, lemur
predicts for each cell and gene how much the expression would change if
the cell had been in the other condition. Furthermore,
lemur finds neighborhoods of cells that show consistent
differential expression. The results are statistically validated using a
pseudo-bulk differential expression test on hold-out data using glmGamPoi
or edgeR.
lemur implements a novel framework to disentangle the
effects of known covariates, latent cell states, and their interactions.
At the core, is a combination of matrix factorization and regression
analysis implemented as geodesic regression on Grassmann manifolds. We
call this latent embedding multivariate regression. For more
details see our preprint.
You can install lemur directly from Bioconductor
(available since version 3.18). Just paste the following snippet into
your R console:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("lemur")Alternatively, you can install the package from Github using
devtools:
lemur depends on recent features from glmGamPoi,
so make sure that packageVersion("glmGamPoi") is larger
than 1.12.0.
library("lemur")
library("SingleCellExperiment")
fit <- lemur(sce, design = ~ patient_id + condition, n_embedding = 15)
fit <- align_harmony(fit) # This step is optional
fit <- test_de(fit, contrast = cond(condition = "ctrl") - cond(condition = "panobinostat"))
nei <- find_de_neighborhoods(fit, group_by = vars(patient_id, condition))We will demonstrate lemur using a dataset published by
Zhao et
al. (2021). The data consist of tumor biopsies from five
glioblastomas which were treated with the drug panobinostat and with a
control. Accordingly, we will analyze ten samples (patient-treatment
combinations) using a paired experimental design.
We start by loading some required packages.
We use a reduced-size version of the glioblastoma data that ships
with the lemur package.
data(glioblastoma_example_data)
glioblastoma_example_data
#> class: SingleCellExperiment
#> dim: 300 5000
#> metadata(0):
#> assays(2): counts logcounts
#> rownames(300): ENSG00000210082 ENSG00000118785 ... ENSG00000167468
#> ENSG00000139289
#> rowData names(6): gene_id symbol ... strand. source
#> colnames(5000): CGCCAGAGCGCA AGCTTTACTGCG ... TGAACAGTGCGT TGACCGGAATGC
#> colData names(10): patient_id treatment_id ... sample_id id
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):Initially, the data separates by the known covariates
patient_id and condition.
orig_umap <- uwot::umap(as.matrix(t(logcounts(glioblastoma_example_data))))
as_tibble(colData(glioblastoma_example_data)) %>%
mutate(umap = orig_umap) %>%
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = patient_id, shape = condition), size = 0.5) +
labs(title = "UMAP of logcounts")We fit the LEMUR model by calling lemur(). We provide
the experimental design using a formula. The elements of the formula can
refer to columns of the colData of the
SingleCellExperiment object.
We also set the number of latent dimensions
(n_embedding), which has a similar interpretation as the
number of dimensions in PCA.
The test_fraction argument sets the fraction of cells
which are exclusively used to test for differential expression and not
for inferring the LEMUR parameters. It balances the sensitivity to
detect subtle patterns in the latent space against the power to detect
differentially expressed genes.
fit <- lemur(glioblastoma_example_data, design = ~ patient_id + condition,
n_embedding = 15, test_fraction = 0.5)
#> Storing 50% of the data (2500 cells) as test data.
#> Regress out global effects using linear method.
#> Find base point for differential embedding
#> Fit differential embedding model
#> Initial error: 1.78e+06
#> ---Fit Grassmann linear model
#> Final error: 1.12e+06
fit
#> class: lemur_fit
#> dim: 300 5000
#> metadata(9): n_embedding design ... use_assay row_mask
#> assays(2): counts logcounts
#> rownames(300): ENSG00000210082 ENSG00000118785 ... ENSG00000167468
#> ENSG00000139289
#> rowData names(6): gene_id symbol ... strand. source
#> colnames(5000): CGCCAGAGCGCA AGCTTTACTGCG ... TGAACAGTGCGT TGACCGGAATGC
#> colData names(10): patient_id treatment_id ... sample_id id
#> reducedDimNames(2): linearFit embedding
#> mainExpName: NULL
#> altExpNames(0):The lemur() function returns a lemur_fit
object which extends SingleCellExperiment. It supports
subsetting and all the usual data accessor methods (e.g.,
nrow, assay, colData,
rowData). In addition, lemur overloads the
$ operator to allow easy access to additional fields
produced by the LEMUR model. For example, the low-dimensional embedding
can be accessed using fit$embedding:
Optionally, we can further align corresponding cells using manually
annotated cell types (align_by_grouping) or an automated
alignment procedure (e.g., align_harmony). This ensures
that corresponding cells are close to each other in the
fit$embedding.
I will make a UMAP of the fit$embedding. This is similar
to working on the integrated PCA space in a traditional single-cell
analysis.
umap <- uwot::umap(t(fit$embedding))
as_tibble(fit$colData) %>%
mutate(umap = umap) %>%
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = patient_id), size = 0.5) +
facet_wrap(vars(condition)) +
labs(title = "UMAP of latent space from LEMUR")Next, we will predict the effect of the panobinostat treatment for
each gene and cell. The test_de function takes a
lemur_fit object and returns the object with a new assay
"DE". This assay contains the predicted log fold change
between the conditions specified in contrast. Note that
lemur implements a special notation for contrasts. Instead
of providing a contrast vector or design matrix column names, you
provide for each condition the levels, and lemur
automatically forms the contrast vector. This makes the contrast more
readable.
We can pick any gene and show the differential expression pattern on the UMAP plot:
sel_gene <- "ENSG00000172020" # is GAP43
tibble(umap = umap) %>%
mutate(de = assay(fit, "DE")[sel_gene,]) %>%
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = de)) +
scale_color_gradient2() +
labs(title = "Differential expression on UMAP plot")Alternatively, we can use the matrix of differential expression
values (assay(fit, "DE")) to guide the selection of cell
neighborhoods that show consistent differential expression.
find_de_neighborhoods validates the results with a
pseudobulked diferential expression test. For this it uses the
fit$test_data which was put aside in the first
lemur() call. In addition,
find_de_neighborhoods assess if the difference between the
conditions is significantly larger for the cells inside the neighborhood
than the cells outside the neighborhood (see columns starting with
did, short for difference-in-difference).
The group_by argument determines how the pseudobulk
samples are formed. It specifies the columns in the
fit$colData that are used to define a sample and is
inspired by the group_by function in dplyr.
Typically, you provide the covariates that were used for the
experimental design plus the sample id (in this case
patient_id).
neighborhoods <- find_de_neighborhoods(fit, group_by = vars(patient_id, condition))
#> Find optimal neighborhood using zscore.
#> Validate neighborhoods using test data
#> Form pseudobulk (summing counts)
#> Calculate size factors for each gene
#> Fit glmGamPoi model on pseudobulk data
#> Fit diff-in-diff effect
as_tibble(neighborhoods) %>%
left_join(as_tibble(rowData(fit)[,1:2]), by = c("name" = "gene_id")) %>%
relocate(symbol, .before = "name") %>%
arrange(pval) %>%
dplyr::select(symbol, neighborhood, name, n_cells, pval, adj_pval, lfc, did_lfc)
#> # A tibble: 300 × 8
#> symbol neighborhood name n_cells pval adj_pval lfc did_lfc
#> <chr> <I<list>> <chr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 KNG1 <chr [4,005]> ENSG00000113889 4005 2.97e-5 0.00892 2.81 0.494
#> 2 POLR2L <chr [4,161]> ENSG00000177700 4161 1.31e-4 0.0139 1.41 -0.323
#> 3 PMP2 <chr [4,828]> ENSG00000147588 4828 1.61e-4 0.0139 -1.45 -1.17
#> 4 MT2A <chr [3,025]> ENSG00000125148 3025 2.12e-4 0.0139 1.81 -0.319
#> 5 NUCKS1 <chr [3,490]> ENSG00000069275 3490 2.32e-4 0.0139 -1.25 0.495
#> 6 DNAJB1 <chr [2,615]> ENSG00000132002 2615 3.43e-4 0.0172 0.848 -0.0641
#> 7 NEAT1 <chr [3,284]> ENSG00000245532 3284 5.75e-4 0.0246 1.85 -0.320
#> 8 SKP1 <chr [2,492]> ENSG00000113558 2492 7.73e-4 0.0251 0.869 -0.434
#> 9 MT1E <chr [3,223]> ENSG00000169715 3223 8.08e-4 0.0251 1.96 0.378
#> 10 A2M <chr [1,821]> ENSG00000175899 1821 8.37e-4 0.0251 -1.51 0.00693
#> # ℹ 290 more rowsTo continue, we investigate one gene for which the neighborhood shows
a significant differential expression pattern: here we choose a
CXCL8 (also known as interleukin 8), an important inflammation
signalling molecule. We see that it is upregulated by panobinostat in a
subset of cells (blue). We chose this gene because it (1) had a
significant change between panobinostat and negative control condition
(adj_pval column) and (2) showed much larger differential
expression for the cells inside the neighborhood than for the cells
outside (did_lfc column).
sel_gene <- "ENSG00000169429" # is CXCL8
tibble(umap = umap) %>%
mutate(de = assay(fit, "DE")[sel_gene,]) %>%
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = de)) +
scale_color_gradient2() +
labs(title = "Differential expression on UMAP plot")To plot the boundaries of the differential expression neighborhood,
we create a helper dataframe and use the geom_density2d
function from ggplot2. To avoid the cutting of the boundary
to the extremes of the cell coordinates, add lims to the
plot with an appropriately large limit.
neighborhood_coordinates <- neighborhoods %>%
dplyr::filter(name == sel_gene) %>%
unnest(c(neighborhood)) %>%
dplyr::rename(cell_id = neighborhood) %>%
left_join(tibble(cell_id = rownames(umap), umap), by = "cell_id") %>%
dplyr::select(name, cell_id, umap)
tibble(umap = umap) %>%
mutate(de = assay(fit, "DE")[sel_gene,]) %>%
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = de)) +
scale_color_gradient2() +
geom_density2d(data = neighborhood_coordinates, breaks = 0.5,
contour_var = "ndensity", color = "black") +
labs(title = "Differential expression with neighborhood boundary")To summarize the results, we make a volcano plot of the differential expression results to better understand the expression differences across all genes.
neighborhoods %>%
drop_na() %>%
ggplot(aes(x = lfc, y = -log10(pval))) +
geom_point(aes(color = adj_pval < 0.1)) +
labs(title = "Volcano plot of the neighborhoods")
neighborhoods %>%
drop_na() %>%
ggplot(aes(x = n_cells, y = -log10(pval))) +
geom_point(aes(color = adj_pval < 0.1)) +
labs(title = "Neighborhood size vs neighborhood significance")This analysis was conducted without using any cell type information. Often, additional cell type information is available or can be annotated manually. Here, we can for example distinguish the tumor cells from cells of the microenvironment, because the tumors had a chromosome 10 deletion and chromosome 7 duplication. We build a simple classifier to distinguish the cells accordingly. (This is just to illustrate the process; for a real analysis, we would use more sophisticated methods.)
tumor_label_df <- tibble(cell_id = colnames(fit),
chr7_total_expr = colMeans(logcounts(fit)[rowData(fit)$chromosome == "7",]),
chr10_total_expr = colMeans(logcounts(fit)[rowData(fit)$chromosome == "10",])) %>%
mutate(is_tumor = chr7_total_expr > 0.8 & chr10_total_expr < 2.5)
ggplot(tumor_label_df, aes(x = chr10_total_expr, y = chr7_total_expr)) +
geom_point(aes(color = is_tumor), size = 0.5) +
geom_hline(yintercept = 0.8) +
geom_vline(xintercept = 2.5) +
labs(title = "Simple gating strategy to find tumor cells")
tibble(umap = umap) %>%
mutate(is_tumor = tumor_label_df$is_tumor) %>%
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = is_tumor), size = 0.5) +
labs(title = "The tumor cells are enriched in parts of the big blob") +
facet_wrap(vars(is_tumor))We use the cell annotation, to focus our neighborhood finding on subpopulations of the tumor.
tumor_fit <- fit[, tumor_label_df$is_tumor]
tum_nei <- find_de_neighborhoods(tumor_fit, group_by = vars(patient_id, condition), verbose = FALSE)
as_tibble(tum_nei) %>%
left_join(as_tibble(rowData(fit)[,1:2]), by = c("name" = "gene_id")) %>%
dplyr::relocate(symbol, .before = "name") %>%
filter(adj_pval < 0.1) %>%
arrange(did_pval) %>%
dplyr::select(symbol, name, neighborhood, n_cells, adj_pval, lfc, did_pval, did_lfc) %>%
print(n = 10)
#> # A tibble: 31 × 8
#> symbol name neighborhood n_cells adj_pval lfc did_pval did_lfc
#> <chr> <chr> <I<list>> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 SERPINE1 ENSG00000106… <chr> 2838 0.0597 1.06 0.00627 -1.15
#> 2 CXCL8 ENSG00000169… <chr> 2187 0.0547 1.43 0.00757 -1.93
#> 3 CLU ENSG00000120… <chr> 2375 0.0793 0.808 0.0121 -1.18
#> 4 NCL ENSG00000115… <chr> 2513 0.0682 -0.538 0.0403 0.574
#> 5 MT2A ENSG00000125… <chr> 1812 0.00929 2.04 0.0506 -1.05
#> 6 TUBA1B ENSG00000123… <chr> 2899 0.0790 -0.529 0.0688 0.649
#> 7 NEAT1 ENSG00000245… <chr> 2247 0.0206 1.80 0.0785 -0.713
#> 8 TUBA1A ENSG00000167… <chr> 1521 0.0328 -0.809 0.108 0.290
#> 9 RPL10 ENSG00000147… <chr> 2831 0.0791 -0.456 0.154 0.319
#> 10 TNFRSF12A ENSG00000006… <chr> 1138 0.0328 -0.795 0.172 0.271
#> # ℹ 21 more rowsFocusing on RPS11, we see that panobinostat mostly has no effect on
its expression, except for a subpopulation of tumor cells where RPS11
was originally upregulated and panobinostat downregulates the
expression. A small caveat: this analysis is conducted on a subset of
all cells and should be interpreted carefully. Yet, this section
demonstrates how lemur can be used to find tumor
subpopulations which show differential responses to treatments.
sel_gene <- "ENSG00000142534" # is RPS11
as_tibble(colData(fit)) %>%
mutate(expr = assay(fit, "logcounts")[sel_gene,]) %>%
mutate(is_tumor = tumor_label_df$is_tumor) %>%
mutate(in_neighborhood = id %in% filter(tum_nei, name == sel_gene)$neighborhood[[1]]) %>%
ggplot(aes(x = condition, y = expr)) +
geom_jitter(size = 0.3, stroke = 0) +
geom_point(data = . %>% summarize(expr = mean(expr), .by = c(condition, patient_id, is_tumor, in_neighborhood)),
aes(color = patient_id), size = 2) +
stat_summary(fun.data = mean_se, geom = "crossbar", color = "red") +
facet_wrap(vars(is_tumor, in_neighborhood), labeller = label_both) lemur directly with the aligned data?No. You need to call lemur with the unaligned data so
that it can learn how much the expression of each gene changes between
conditions.
Yes. You can call lemur with any variance stabilized count matrix. Based on a previous project, I recommend to use log-transformation, but other methods will work just fine.
lemur()
than before. What is happening?!This is a known issue and can be caused if the data has large
compositional shifts (for example, if one cell type disappears). The
problem is that the initial linear regression step, which centers the
conditions relative to each other, overcorrects and introduces a
consistent shift in the latent space. You can either use
align_by_grouping / align_harmony to correct
for this effect or manually fix the regression coefficient to zero:
align_harmony /
align_neighbors. What should I do?You can try to increase n_embedding. If this still does
not help, there is little use in inferring differential expression
neighborhoods. But as I haven’t encountered such a dataset yet, I would
like to try it out myself. If you can share the data publicly, please
open an issue.
lemur faster?Several parameters influence the duration to fit the LEMUR model and find differentially expressed neighborhoods:
DelayedArray) either as a sparse dgCMatrix or dense
matrix.test_fraction means fewer cells are used to
fit the model (and more cells are used for the DE test), which speeds up
many steps.n_embedding reduces the latent dimensions of
the fit, which makes the model less flexible, but speeds up the
lemur() call.align_grouping is faster than
align_harmony.selection_procedure = "contrast" in
find_de_neighborhoods often produces better neighborhoods,
but is a lot slower than
selection_procedure = "zscore".size_factor_method = "ratio" in
find_de_neighborhoods makes the DE more powerful, but is a
lot slower than size_factor_method = "normed_sum".sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] lubridate_1.9.4 forcats_1.0.1
#> [3] stringr_1.6.0 dplyr_1.1.4
#> [5] purrr_1.2.1 readr_2.1.6
#> [7] tidyr_1.3.2 tibble_3.3.0
#> [9] ggplot2_4.0.1 tidyverse_2.0.0
#> [11] SingleCellExperiment_1.33.0 SummarizedExperiment_1.41.0
#> [13] Biobase_2.71.0 GenomicRanges_1.63.1
#> [15] Seqinfo_1.1.0 IRanges_2.45.0
#> [17] S4Vectors_0.49.0 BiocGenerics_0.57.0
#> [19] generics_0.1.4 MatrixGenerics_1.23.0
#> [21] matrixStats_1.5.0 lemur_1.9.0
#> [23] BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 farver_2.1.2
#> [3] S7_0.2.1 fastmap_1.2.0
#> [5] digest_0.6.39 timechange_0.3.0
#> [7] lifecycle_1.0.5 magrittr_2.0.4
#> [9] compiler_4.5.2 rlang_1.1.7
#> [11] sass_0.4.10 tools_4.5.2
#> [13] utf8_1.2.6 yaml_2.3.12
#> [15] knitr_1.51 S4Arrays_1.11.1
#> [17] labeling_0.4.3 DelayedArray_0.37.0
#> [19] RColorBrewer_1.1-3 abind_1.4-8
#> [21] withr_3.0.2 sys_3.4.3
#> [23] grid_4.5.2 beachmat_2.27.1
#> [25] MASS_7.3-65 scales_1.4.0
#> [27] isoband_0.3.0 cli_3.6.5
#> [29] rmarkdown_2.30 RSpectra_0.16-2
#> [31] glmGamPoi_1.21.0 tzdb_0.5.0
#> [33] DelayedMatrixStats_1.33.0 cachem_1.1.0
#> [35] splines_4.5.2 BiocManager_1.30.27
#> [37] XVector_0.51.0 vctrs_0.6.5
#> [39] Matrix_1.7-4 jsonlite_2.0.0
#> [41] hms_1.1.4 irlba_2.3.5.1
#> [43] maketools_1.3.2 harmony_1.2.4
#> [45] jquerylib_0.1.4 glue_1.8.0
#> [47] codetools_0.2-20 cowplot_1.2.0
#> [49] uwot_0.2.4 RcppAnnoy_0.0.22
#> [51] stringi_1.8.7 gtable_0.3.6
#> [53] pillar_1.11.1 htmltools_0.5.9
#> [55] R6_2.6.1 sparseMatrixStats_1.23.0
#> [57] evaluate_1.0.5 lattice_0.22-7
#> [59] RhpcBLASctl_0.23-42 bslib_0.9.0
#> [61] Rcpp_1.1.0.8.2 SparseArray_1.11.10
#> [63] xfun_0.55 buildtools_1.0.0
#> [65] pkgconfig_2.0.3