LRDE 0.99.6
LRDE implements a Hurdle Negative Binomial (Hurdle-NB) generalized linear model (GLM) tailored for long-read RNA-seq data. Compared to short-read technologies, long-read sequencing often involves smaller sample sizes and substantial expression dropout (i.e., excess zeros), posing challenges for standard differential expression methods.
This package is designed to address these issues by combining a flexible two-part model with stabilized parameter estimation.
The core contribution of LRDE is the integration of a hurdle model with empirical Bayes (EB) shrinkage, enabling robust and stable inference across genes.
For each gene gene \(g\) and sample \(i\), the observed count \(y_{ig}\) is modeled using a two-component framework:
Zero component:
A Bernoulli distribution models the probability of observing a structural zero (dropout).
Count component:
Conditional on being non-zero, counts follow a zero-truncated Negative Binomial distribution, capturing overdispersed expression levels.
This formulation explicitly separates biological absence from technical dropout, improving interpretability and model fit.
Long-read RNA-seq data are often noisy, particularly for lowly expressed genes. To mitigate this, LRDE borrows strength across genes using an empirical Bayes strategy:
Binning:
Genes are grouped into bins according to similar mean expression levels.
Prior estimation:
Within each bin, empirical priors are estimated for:
Shrinkage:
Gene-specific (tag-wise) estimates are shrunk toward their bin-specific priors, resulting in more stable dispersion estimates and improved statistical power.
LRDE provides two complementary approaches for differential expression analysis:
Likelihood Ratio Test (hurdle_LRT())
Recommended for general hypothesis testing and comparison of nested models.
Wald Test (hurdle_Wald_Test())
A computationally efficient alternative for testing individual coefficients.
The main object returned by LRDE is a list containing the following components:
counts: normalized count matrixgroup: sample group labelssize.factors: normalization factorstagwise.disp: gene-wise dispersion estimateszero_prob_matrix: estimated zero probabilities for each grouplrt_stats: likelihood ratio test statisticswald_stats: Wald test statisticsp.values: p-values for differential expressionThese components can be accessed directly using standard list indexing.
LRDE also supports input from SummarizedExperiment objects, enabling integration with standard Bioconductor workflows.
This section demonstrates a minimal end-to-end workflow for differential expression analysis using LRDE.
# Load the package
suppressPackageStartupMessages(library(LRDE))
set.seed(123)
# 1. Simulate a count matrix (Negative Binomial)
mat <- matrix(rnbinom(300, size = 5, mu = 5), nrow = 50)
# Define sample groups
grp <- factor(c("A", "A", "A", "B", "B", "B"))
# 2. Prepare the data object
y <- prepareDGE(mat, grp)
# 3. Estimate size factors for normalization
y <- sizeFactorsEst(y)
# 4. Estimate tag-wise dispersions
y <- tagwiseEst(y)
# 5a. Differential expression testing: Wald test
y <- hurdle_Wald_Test(y)
# 5b. Differential expression testing: Likelihood Ratio Test
y <- hurdle_LRT(y)
# 6. Inspect results
head(y$lrt_stats) # LRT statistics per gene
#> [1] 3.454142872 0.899998178 0.041541545 0.261982772 6.372198462 0.005056271
head(y$wald_stats) # Wald statistics per gene
#> [1] -1.71589930 0.94594439 0.20390895 -0.51176389 -2.26213529 0.07112874
head(y$p.values) # p-values
#> [1] 0.06309344 0.34278220 0.83849617 0.60876122 0.01159219 0.94331223
head(y$tagwise.disp) # estimated dispersions
#> [1] 0.02735738 0.04630069 0.03428276 0.44069007 0.05461926 0.03239385
str(y, max.level = 1) # Data structure visualization
#> List of 8
#> $ counts : int [1:50, 1:6] 6 8 7 10 8 6 8 13 3 10 ...
#> ..- attr(*, "dimnames")=List of 2
#> $ samples :'data.frame': 6 obs. of 3 variables:
#> $ baseMean : num [1:50] 3.81 7.36 6.62 7.19 4.88 ...
#> $ tagwise.disp : num [1:50] 0.0274 0.0463 0.0343 0.4407 0.0546 ...
#> $ zero_prob_matrix: num [1:50, 1:2] 0.0196 0.0215 0.0215 0.0215 0.0215 ...
#> ..- attr(*, "dimnames")=List of 2
#> $ wald_stats : num [1:50] -1.716 0.946 0.204 -0.512 -2.262 ...
#> $ p.values : num [1:50] 0.0631 0.3428 0.8385 0.6088 0.0116 ...
#> $ lrt_stats : num [1:50] 3.4541 0.9 0.0415 0.262 6.3722 ...
LRDE returns gene-level statistics for differential expression analysis, including test statistics and p-values.
Genes with small p-values provide evidence against the null hypothesis of no differential expression between groups.
Since thousands of genes are typically tested simultaneously, it is important to control for multiple comparisons. A common approach is to control the false discovery rate (FDR) using the Benjamini–Hochberg method:
padj <- p.adjust(y$p.values, method = "BH")
head(padj)
#> [1] 0.3943340 0.6801222 0.9691057 0.8226503 0.2898047 0.9691057
Genes with adjusted p-values below a chosen threshold (e.g., FDR < 0.05) are typically considered significantly differentially expressed.
In addition to standard matrix and data.frame, prepareDGE supports SummarizedExperiment objects, allowing LRDE to integrate seamlessly into existing Bioconductor workflows.
# 1. Create a SummarizedExperiment object
suppressPackageStartupMessages(library(SummarizedExperiment))
set.seed(123)
# Simulate data
counts_mat <- matrix(rnbinom(300, size = 5, mu = 5), nrow = 50)
colnames(counts_mat) <- paste0("Sample", 1:6)
group_info <- factor(c("A", "A", "A", "B", "B", "B"))
# Construct SE with colData
se <- SummarizedExperiment(
assays = list(counts = counts_mat),
colData = data.frame(condition = group_info)
)
# 2. Start the LRDE pipeline using the SE object
# Extract the condition directly from colData
y_se <- prepareDGE(se, group = colData(se)$condition)
# 3. Standard downstream workflow
y_se <- sizeFactorsEst(y_se)
y_se <- tagwiseEst(y_se)
y_se <- hurdle_LRT(y_se)
# View results
head(y_se$lrt_stats)
#> [1] 3.454142872 0.899998178 0.041541545 0.261982772 6.372198462 0.005056271
#> R version 4.6.0 alpha (2026-04-05 r89794)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SummarizedExperiment_1.41.1 Biobase_2.71.0
#> [3] GenomicRanges_1.63.2 Seqinfo_1.1.0
#> [5] IRanges_2.45.0 S4Vectors_0.49.1
#> [7] BiocGenerics_0.57.0 generics_0.1.4
#> [9] MatrixGenerics_1.23.0 matrixStats_1.5.0
#> [11] LRDE_0.99.6 BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] cli_3.6.6 knitr_1.51 rlang_1.2.0
#> [4] xfun_0.57 otel_0.2.0 DelayedArray_0.37.1
#> [7] jsonlite_2.0.0 htmltools_0.5.9 sass_0.4.10
#> [10] rmarkdown_2.31 grid_4.6.0 abind_1.4-8
#> [13] evaluate_1.0.5 jquerylib_0.1.4 fastmap_1.2.0
#> [16] yaml_2.3.12 lifecycle_1.0.5 bookdown_0.46
#> [19] BiocManager_1.30.27 compiler_4.6.0 XVector_0.51.0
#> [22] lattice_0.22-9 digest_0.6.39 R6_2.6.1
#> [25] SparseArray_1.11.13 Matrix_1.7-5 bslib_0.10.0
#> [28] tools_4.6.0 S4Arrays_1.11.1 cachem_1.1.0