1 Introduction

LRDE implements a Hurdle Negative Binomial (Hurdle-NB) generalized linear model (GLM) tailored for long-read RNA-seq data. Compared to short-read technologies, long-read sequencing often involves smaller sample sizes and substantial expression dropout (i.e., excess zeros), posing challenges for standard differential expression methods.

This package is designed to address these issues by combining a flexible two-part model with stabilized parameter estimation.

2 Statistical Methodology

The core contribution of LRDE is the integration of a hurdle model with empirical Bayes (EB) shrinkage, enabling robust and stable inference across genes.

2.1 1. Hurdle Model

For each gene gene \(g\) and sample \(i\), the observed count \(y_{ig}\) is modeled using a two-component framework:

Zero component:
A Bernoulli distribution models the probability of observing a structural zero (dropout).
Count component:
Conditional on being non-zero, counts follow a zero-truncated Negative Binomial distribution, capturing overdispersed expression levels.

This formulation explicitly separates biological absence from technical dropout, improving interpretability and model fit.

2.2 2. Information Borrowing via Empirical Bayes

Long-read RNA-seq data are often noisy, particularly for lowly expressed genes. To mitigate this, LRDE borrows strength across genes using an empirical Bayes strategy:

Binning:
Genes are grouped into bins according to similar mean expression levels.
Prior estimation:
Within each bin, empirical priors are estimated for:
- the dispersion parameter (\(\phi\))
- the zero-component probability (treated as fixed during tagwise estimation and differential testing)
Shrinkage:
Gene-specific (tag-wise) estimates are shrunk toward their bin-specific priors, resulting in more stable dispersion estimates and improved statistical power.

2.3 3. Differential Expression Testing

LRDE provides two complementary approaches for differential expression analysis:

Likelihood Ratio Test (hurdle_LRT())
Recommended for general hypothesis testing and comparison of nested models.
Wald Test (hurdle_Wald_Test())
A computationally efficient alternative for testing individual coefficients.

3 Data Structure

The main object returned by LRDE is a list containing the following components:

counts: normalized count matrix
group: sample group labels
size.factors: normalization factors
tagwise.disp: gene-wise dispersion estimates
zero_prob_matrix: estimated zero probabilities for each group
lrt_stats: likelihood ratio test statistics
wald_stats: Wald test statistics
p.values: p-values for differential expression

These components can be accessed directly using standard list indexing. LRDE also supports input from SummarizedExperiment objects, enabling integration with standard Bioconductor workflows.

4 Example Workflow

This section demonstrates a minimal end-to-end workflow for differential expression analysis using LRDE.

# Load the package
suppressPackageStartupMessages(library(LRDE))
set.seed(123)

# 1. Simulate a count matrix (Negative Binomial)
mat <- matrix(rnbinom(300, size = 5, mu = 5), nrow = 50)

# Define sample groups
grp <- factor(c("A", "A", "A", "B", "B", "B"))

# 2. Prepare the data object
y <- prepareDGE(mat, grp)

# 3. Estimate size factors for normalization
y <- sizeFactorsEst(y)

# 4. Estimate tag-wise dispersions
y <- tagwiseEst(y)

# 5a. Differential expression testing: Wald test
y <- hurdle_Wald_Test(y)

# 5b. Differential expression testing: Likelihood Ratio Test
y <- hurdle_LRT(y)

# 6. Inspect results
head(y$lrt_stats)        # LRT statistics per gene
#> [1] 3.454142872 0.899998178 0.041541545 0.261982772 6.372198462 0.005056271
head(y$wald_stats)       # Wald statistics per gene
#> [1] -1.71589930  0.94594439  0.20390895 -0.51176389 -2.26213529  0.07112874
head(y$p.values)         # p-values
#> [1] 0.06309344 0.34278220 0.83849617 0.60876122 0.01159219 0.94331223
head(y$tagwise.disp)     # estimated dispersions
#> [1] 0.02735738 0.04630069 0.03428276 0.44069007 0.05461926 0.03239385
str(y, max.level = 1)    # Data structure visualization
#> List of 8
#>  $ counts          : int [1:50, 1:6] 6 8 7 10 8 6 8 13 3 10 ...
#>   ..- attr(*, "dimnames")=List of 2
#>  $ samples         :'data.frame':    6 obs. of  3 variables:
#>  $ baseMean        : num [1:50] 3.81 7.36 6.62 7.19 4.88 ...
#>  $ tagwise.disp    : num [1:50] 0.0274 0.0463 0.0343 0.4407 0.0546 ...
#>  $ zero_prob_matrix: num [1:50, 1:2] 0.0196 0.0215 0.0215 0.0215 0.0215 ...
#>   ..- attr(*, "dimnames")=List of 2
#>  $ wald_stats      : num [1:50] -1.716 0.946 0.204 -0.512 -2.262 ...
#>  $ p.values        : num [1:50] 0.0631 0.3428 0.8385 0.6088 0.0116 ...
#>  $ lrt_stats       : num [1:50] 3.4541 0.9 0.0415 0.262 6.3722 ...

5 Interpreting Results

LRDE returns gene-level statistics for differential expression analysis, including test statistics and p-values.

Genes with small p-values provide evidence against the null hypothesis of no differential expression between groups.

5.1 Multiple Testing Correction

Since thousands of genes are typically tested simultaneously, it is important to control for multiple comparisons. A common approach is to control the false discovery rate (FDR) using the Benjamini–Hochberg method:

padj <- p.adjust(y$p.values, method = "BH")
head(padj)
#> [1] 0.3943340 0.6801222 0.9691057 0.8226503 0.2898047 0.9691057

Genes with adjusted p-values below a chosen threshold (e.g., FDR < 0.05) are typically considered significantly differentially expressed.

6 Supported Input Format

In addition to standard matrix and data.frame, prepareDGE supports SummarizedExperiment objects, allowing LRDE to integrate seamlessly into existing Bioconductor workflows.

# 1. Create a SummarizedExperiment object
suppressPackageStartupMessages(library(SummarizedExperiment))
set.seed(123)

# Simulate data
counts_mat <- matrix(rnbinom(300, size = 5, mu = 5), nrow = 50)
colnames(counts_mat) <- paste0("Sample", 1:6)
group_info <- factor(c("A", "A", "A", "B", "B", "B"))

# Construct SE with colData
se <- SummarizedExperiment(
    assays = list(counts = counts_mat),
    colData = data.frame(condition = group_info)
)

# 2. Start the LRDE pipeline using the SE object
# Extract the condition directly from colData
y_se <- prepareDGE(se, group = colData(se)$condition)

# 3. Standard downstream workflow
y_se <- sizeFactorsEst(y_se)
y_se <- tagwiseEst(y_se)
y_se <- hurdle_LRT(y_se)

# View results
head(y_se$lrt_stats)
#> [1] 3.454142872 0.899998178 0.041541545 0.261982772 6.372198462 0.005056271

7 Session Information

#> R version 4.6.0 alpha (2026-04-05 r89794)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SummarizedExperiment_1.41.1 Biobase_2.71.0             
#>  [3] GenomicRanges_1.63.2        Seqinfo_1.1.0              
#>  [5] IRanges_2.45.0              S4Vectors_0.49.1           
#>  [7] BiocGenerics_0.57.0         generics_0.1.4             
#>  [9] MatrixGenerics_1.23.0       matrixStats_1.5.0          
#> [11] LRDE_0.99.6                 BiocStyle_2.39.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] cli_3.6.6           knitr_1.51          rlang_1.2.0        
#>  [4] xfun_0.57           otel_0.2.0          DelayedArray_0.37.1
#>  [7] jsonlite_2.0.0      htmltools_0.5.9     sass_0.4.10        
#> [10] rmarkdown_2.31      grid_4.6.0          abind_1.4-8        
#> [13] evaluate_1.0.5      jquerylib_0.1.4     fastmap_1.2.0      
#> [16] yaml_2.3.12         lifecycle_1.0.5     bookdown_0.46      
#> [19] BiocManager_1.30.27 compiler_4.6.0      XVector_0.51.0     
#> [22] lattice_0.22-9      digest_0.6.39       R6_2.6.1           
#> [25] SparseArray_1.11.13 Matrix_1.7-5        bslib_0.10.0       
#> [28] tools_4.6.0         S4Arrays_1.11.1     cachem_1.1.0

Differential Expression Analysis with Long Read RNA-Seq Data

10 April 2026

Package