DOTSeq is an R package for detecting differential
translation of open reading frames (ORFs) from ribosome profiling
(Ribo-seq)1,2 and matched
RNA-seq datasets3. Translation
is a tightly regulated process, where multiple ORFs within the same gene
can modulate protein synthesis.
While useful, traditional gene-level approaches (e.g., annota2seq
and terapadog
available at Bioconductor4,5)
may overlook these regulatory layers. DOTSeq addresses this
limitation by modelling translation at the ORF level.
DOTSeq provides two complementary modules:
Differential ORF Usage (DOU): Detects cis-translational regulation by testing whether the relative usage of ORFs within a gene changes across biological conditions. Implemented via a beta-binomial generalised linear model (GLM) (glmmTMB)6.
Differential Translation Efficiency (DTE): Tests for changes in ribosome occupancy relative to RNA abundance across conditions using a negative-binomial GLM (DESeq2)7.
Together, DOU and DTE modules allow global identification of translation-specific effects, revealing regulatory events such as uORF-mediated repression.
We start by loading the DOTSeq package and its
dependencies. DOTSeq builds on the Bioconductor ecosystem,
e.g., rtracklayer
and txdbmaker
for parsing GTF files, and the SummarizedExperiment
container for storing count matrices and metadata8–11.
We demonstrate DOTSeq functionalities using a HeLa cell
cycle dataset12. This dataset
includes the snapshots of translation landscapes in HeLa cells
synchronised at Mitotic Cycling, Mitotic Arrest, and Interphase. Each
condition contains two biological replicates.
Here, we use only a small portion of the dataset to speed up the vignette execution while still demonstrating all key functionalities.
## [1] "featureCounts.cell_cycle_subset.txt.gz"
## [2] "gencode.v47.orf_flattened_subset.bed.gz"
## [3] "gencode.v47.orf_flattened_subset.gtf.gz"
## [4] "metadata.txt.gz"
Ribo-seq and RNA-seq reads are preprocessed using standard pipelines. Reads were adapter-trimmed using Cutadapt13, aligned to the reference genome using STAR14, and summarised at the ORF level using featureCounts15.
Each row of the input count matrix corresponds to an ORF, while columns contain read counts for individual samples. The first six columns represent ORF-level annotation fields such as gene ID, transcript ID, chromosome, and coordinates. This ensures accurate mapping between read counts and annotated ORFs, which is critical for our analysis.
cnt <- read.table(
file.path(dir, "featureCounts.cell_cycle_subset.txt.gz"),
header = TRUE,
comment.char = "#"
)
names(cnt) <- gsub(".*(SRR[0-9]+).*", "\\1", names(cnt))
head(cnt)## Geneid Chr Start End Strand Length SRR24230465 SRR24230467
## 1 ENSG00000003056.8 chr12 8940362 8940426 - 65 0 1
## 2 ENSG00000003056.8 chr12 8940428 8940634 - 207 0 1
## 3 ENSG00000003056.8 chr12 8940636 8940698 - 63 0 0
## 4 ENSG00000003056.8 chr12 8941261 8941269 - 9 0 0
## 5 ENSG00000003056.8 chr12 8941316 8941360 - 45 0 0
## 6 ENSG00000003056.8 chr12 8941369 8941392 - 24 0 0
## SRR24230469 SRR24230471 SRR24230480 SRR24230482 SRR24230462 SRR24230466
## 1 0 0 0 0 4 15
## 2 0 0 0 0 2 1
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 1 2
## 6 0 0 0 0 5 5
## SRR24230472 SRR24230474 SRR24230477 SRR24230479
## 1 7 6 5 40
## 2 0 0 3 10
## 3 0 0 0 0
## 4 1 1 0 0
## 5 2 0 1 4
## 6 2 7 4 30
Flattened ORF-level annotations were generated using the
orf_to_gtf.py wrapper for RIBOSS16. Alternatively, annotations can be
prepared within R using getORFs() (see Prepare annotations). This pre-processing step
ensures that each ORF occupies a distinct, non-overlapping genomic
interval, avoiding ambiguity when assigning reads. This design follows
the same logic as DEXSeq’s exon flattening, but at the
level of ORFs rather than exons17.
The condition table defines the experimental design for both Ribo-seq
and RNA-seq samples. It allows DOTSeq to match Ribo/RNA-seq
pairs and to model translation-specific interactions.
Each row corresponds to one sequencing run. The table must include four essential columns: - run: unique identifier for each sequencing run
strategy: either "ribo" or "rna" (or
"0" or "1"), indicating the sequencing
strategy
replicate: biological replicate identifier
condition: experimental condition (e.g., control, treatment)
Additional columns (e.g., batch) can also be included. In the example below, we load the metadata and restrict the analysis to samples pre-treated with cycloheximide (chx) prior to library preparation:
cond <- read.table(file.path(dir, "metadata.txt.gz"))
# Ensure required column names for DOTSeq
names(cond) <- c("run", "strategy", "replicate", "treatment", "condition")
# Filter to include only chx-treated samples
cond <- cond[cond$treatment == "chx", ]
head(cond)## run strategy replicate treatment condition
## 1 SRR24230462 rna 2 chx Mitotic_Cycling
## 4 SRR24230465 ribo 2 chx Mitotic_Cycling
## 5 SRR24230466 rna 1 chx Mitotic_Cycling
## 6 SRR24230467 ribo 1 chx Mitotic_Cycling
## 8 SRR24230469 ribo 2 chx Mitotic_Arrest
## 10 SRR24230471 ribo 1 chx Mitotic_Arrest
The main function DOTSeq() performs the differential
analysis workflow, starting from parsing input files to post hoc
inference. It sequentially runs three key steps:
Construction of two SummarizedExperiment objects
(DOTSeqDataSet()).
model fitting for DOU via beta-binomial GLM
(fitDOU() using glmmTMB) and DTE via
negative-binomial GLM (DESeq2).
Post hoc inference using emmeans18 for contrasts and
ashr19 for
adaptive shrinkage of effect sizes (testDOU()).
While each step can be run individually, we construct
DOTSeqDataSets with
DOTSeqDataSetsFromFeatureCounts(), followed by the
DOTSeq() main function here for simplicity. To capture the
translation-specific interaction effect (\(\beta_{\text{int}}\)),
DOTSeq() uses formula = ~ condition * strategy
with dispersion_modeling = "auto" by default. The design
and dispersion formulas can be customised to include batch and random
effects.
# Create a DOTSeqDataSets object
datasets <- DOTSeqDataSetsFromFeatureCounts(
count_table = cnt,
condition_table = cond,
flattened_gtf = gtf,
flattened_bed = bed,
verbose = FALSE
)
# Run the DOTSeq workflow
d <- DOTSeq(datasets = datasets)## starting Differential ORF Usage (DOU) analysis
## starting post hoc analysis
## effect sizes for interaction_specific contrasts calculated and shrunk succesfully
## effect sizes for strategy_specific contrasts calculated and shrunk successfully
## DOU runtime: 2 mins 38.614 secs
## DOU model fitting summary: Mitotic_Arrest - Interphase
## models fitted: 701
## non-convergence or invalid Hessian: 1428
## not fitted (filtered out): 4513
## DOU model fitting summary: Mitotic_Cycling - Interphase
## models fitted: 701
## non-convergence or invalid Hessian: 1428
## not fitted (filtered out): 4513
## DOU model fitting summary: Mitotic_Cycling - Mitotic_Arrest
## models fitted: 701
## non-convergence or invalid Hessian: 1428
## not fitted (filtered out): 4513
## starting Differential Translation Efficiency (DTE) analysis
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## starting post hoc analysis
## DTE runtime: 4.546 secs
## DTE model fitting summary: Mitotic_Cycling - Mitotic_Arrest
## models fitted: 1137
## padj = NA (filtered or flagged): 5505
## DTE model fitting summary: Mitotic_Arrest - Interphase
## models fitted: 1941
## padj = NA (filtered or flagged): 4701
## DTE model fitting summary: Mitotic_Cycling - Interphase
## models fitted: 1712
## padj = NA (filtered or flagged): 4930
In this case Interphase was automatically assigned as the baseline.
The target and baseline conditions can explicitly defined as
target = "Mitotic_Cycling" and
baseline = "Interphase".
The output is a DOTSeqDataSets object that encapsulates
the post hoc contrast results for DOU and DTE analyses.
## DOTSeqDataSets
## DOU: DOUData
## DTE: DTEData
## Use getDOU() or getDTE() to access contents.
## Use getContrasts() if post hoc inference has been performed.
As indicated, we will use getContrasts() to extract the
results.
## $DOU
## DataFrame with 2103 rows and 10 columns
## betahat sebetahat waldpval waldpadj posterior qvalue
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 -0.315677 0.291932 2.79546e-01 0.342101731 -0.0465988 0.224022321
## 2 1.868553 0.457288 4.38572e-05 0.000106229 1.8372843 0.000122752
## 3 0.896892 0.305651 3.34228e-03 0.005848987 0.7912075 0.010162569
## 4 -1.420373 0.538238 8.31677e-03 0.013782688 -1.2013225 0.012757058
## 5 -1.303596 0.440822 3.10453e-03 0.005460222 -1.1988370 0.006550372
## ... ... ... ... ... ... ...
## 2099 -21.21977 1.86395e+04 0.99909167 0.9999930 -9.06581e-09 0.4952516
## 2100 -1.38197 9.02273e-01 0.12560757 0.3051780 -1.90961e-01 0.3711788
## 2101 2.09437 7.46692e-01 0.00503382 0.0301168 8.35207e-01 0.0842961
## 2102 -1.09952 4.58713e-01 0.01653135 0.0761715 -5.25870e-01 0.1282433
## 2103 1.04066 3.59667e-01 0.00381092 0.0249313 7.78644e-01 0.0490608
## lfdr lfsr orf_id contrast
## <numeric> <numeric> <character> <Rle>
## 1 0.85238475 0.87301738 ENSG00000003056.8:O008 Mitotic_Arrest - Int..
## 2 0.00160708 0.00162958 ENSG00000003056.8:O013 Mitotic_Arrest - Int..
## 3 0.11783390 0.11930812 ENSG00000004478.8:O001 Mitotic_Arrest - Int..
## 4 0.14819051 0.15175690 ENSG00000004478.8:O002 Mitotic_Arrest - Int..
## 5 0.07973440 0.08116372 ENSG00000004478.8:O005 Mitotic_Arrest - Int..
## ... ... ... ... ...
## 2099 0.832511 0.916255 ENSG00000133872.14:O.. Mitotic_Cycling - Mi..
## 2100 0.784519 0.805045 ENSG00000133872.14:O.. Mitotic_Cycling - Mi..
## 2101 0.357765 0.362696 ENSG00000133872.14:O.. Mitotic_Cycling - Mi..
## 2102 0.484903 0.489571 ENSG00000134153.10:O.. Mitotic_Cycling - Mi..
## 2103 0.233079 0.234587 ENSG00000134153.10:O.. Mitotic_Cycling - Mi..
##
## $DTE
## log2 fold change (MMSE): 0,0,0,0,-1,+1
## Wald test p-value: 0,0,0,0,-1,+1
## DataFrame with 19926 rows and 7 columns
## baseMean log2FoldChange lfcSE pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 6.612288 0.0517657 0.782074 0.819230 NA
## 2 1.566572 -0.0695095 0.829717 0.638754 NA
## 3 0.000000 0.0000000 0.854322 NA NA
## 4 0.224656 0.0000000 0.843721 1.000000 NA
## 5 0.967834 0.0000000 0.833729 1.000000 NA
## ... ... ... ... ... ...
## 19922 11.4769 -0.19809388 0.891331 7.36964e-01 8.91883e-01
## 19923 10.0378 0.00327321 1.050203 9.94842e-01 9.95424e-01
## 19924 30.6207 -1.41437812 0.723772 1.37107e-02 4.97302e-02
## 19925 10.2243 -0.13656856 1.028994 7.95281e-01 9.17725e-01
## 19926 1918.8799 1.42139618 0.194002 1.07081e-13 2.90989e-12
## orf_id contrast
## <character> <character>
## 1 ENSG00000003056.8:O001 Mitotic_Cycling - Mi..
## 2 ENSG00000003056.8:O002 Mitotic_Cycling - Mi..
## 3 ENSG00000003056.8:O003 Mitotic_Cycling - Mi..
## 4 ENSG00000003056.8:O004 Mitotic_Cycling - Mi..
## 5 ENSG00000003056.8:O005 Mitotic_Cycling - Mi..
## ... ... ...
## 19922 ENSG00000134153.10:O.. Mitotic_Cycling - In..
## 19923 ENSG00000134153.10:O.. Mitotic_Cycling - In..
## 19924 ENSG00000134153.10:O.. Mitotic_Cycling - In..
## 19925 ENSG00000134153.10:O.. Mitotic_Cycling - In..
## 19926 ENSG00000134153.10:O.. Mitotic_Cycling - In..
These data frames summarise differential translation results for each ORF. Key statistics include:
posterior: The shrunken log-odds fold change in ORF
usage (DOU), reflecting the estimated effect size after empirical Bayes
shrinkage.
lfsr (Local False Sign Rate): The probability that
the estimated effect has the wrong sign, providing a measure of
directional uncertainty.
log2FoldChange: The shrunken log2 fold change in ORF
translation efficiency (DTE), indicating relative changes in ribosome
loading.
padj: The adjusted p-value for DTE, accounting for
multiple testing.
The DTE effect size (log2FoldChange) was derived from
DESeq2::lfcShrink() within the DOTSeq() main
function. We specify type = ashr to apply adaptive
shrinkage and compute all pairwise contrasts between conditions.
Additional statistics for DOU results include:
betahat: The raw (unshrunken) effect size
estimate.
sebetahat: The standard error of the raw
estimate.
qvalue and lfdr (Local False Discovery
Rate): Alternative measures of significance and false discovery, also
computed by ashr.
waldpval and waldpadj: Wald test
results are included for completeness, providing frequentist estimates
of significance.
We can also inspect the rowData and
rowRanges of the SummarizedExperiment objects
to explore ORF level annotation. This includes mappings from gene to
transcript to ORF IDs, as well as orf_type, which currently
supports uORF, mORF, and dORF. Filtering criteria
count_filter and singlet_filter determine
whether an ORF is retained (is_kept) for model fitting.
DOUResults stores the model fit results and the
emmeans objects for post hoc contrasts.
## DataFrame with 2129 rows and 7 columns
## gene_id orf_number orf_type count_filter
## <character> <character> <character> <logical>
## ENSG00000003056.8:O007 ENSG00000003056.8 007 dORF TRUE
## ENSG00000003056.8:O008 ENSG00000003056.8 008 dORF TRUE
## ENSG00000003056.8:O009 ENSG00000003056.8 009 mORF TRUE
## ENSG00000003056.8:O013 ENSG00000003056.8 013 uORF TRUE
## ENSG00000004478.8:O001 ENSG00000004478.8 001 mORF TRUE
## ... ... ... ... ...
## ENSG00000134153.10:O001 ENSG00000134153.10 001 dORF TRUE
## ENSG00000134153.10:O002 ENSG00000134153.10 002 dORF TRUE
## ENSG00000134153.10:O003 ENSG00000134153.10 003 dORF TRUE
## ENSG00000134153.10:O004 ENSG00000134153.10 004 dORF TRUE
## ENSG00000134153.10:O005 ENSG00000134153.10 005 mORF TRUE
## singlet_filter is_kept DOUResults
## <logical> <logical> <list>
## ENSG00000003056.8:O007 TRUE TRUE <PostHoc>
## ENSG00000003056.8:O008 TRUE TRUE <PostHoc>
## ENSG00000003056.8:O009 TRUE TRUE <PostHoc>
## ENSG00000003056.8:O013 TRUE TRUE <PostHoc>
## ENSG00000004478.8:O001 TRUE TRUE <PostHoc>
## ... ... ... ...
## ENSG00000134153.10:O001 TRUE TRUE <PostHoc>
## ENSG00000134153.10:O002 TRUE TRUE <PostHoc>
## ENSG00000134153.10:O003 TRUE TRUE <PostHoc>
## ENSG00000134153.10:O004 TRUE TRUE <PostHoc>
## ENSG00000134153.10:O005 TRUE TRUE <PostHoc>
The rowRanges slot retains the genomic coordinates
corresponding to ORFs, extending from the translational start (AUG) to
stop codons. These ranges may include intronic regions, depending on the
gene structures.
## GRanges object with 2129 ranges and 7 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## ENSG00000003056.8:O007 chr12 8941510-8941518 - |
## ENSG00000003056.8:O008 chr12 8941590-8941805 - |
## ENSG00000003056.8:O009 chr12 8941818-8946404 - |
## ENSG00000003056.8:O013 chr12 8949621-8949713 - |
## ENSG00000004478.8:O001 chr12 2795140-2803258 + |
## ... ... ... ... .
## ENSG00000134153.10:O001 chr15 34084030-34084071 - |
## ENSG00000134153.10:O002 chr15 34084109-34084123 - |
## ENSG00000134153.10:O003 chr15 34084135-34084188 - |
## ENSG00000134153.10:O004 chr15 34084192-34084206 - |
## ENSG00000134153.10:O005 chr15 34084334-34101839 - |
## gene_id orf_number orf_type
## <character> <character> <character>
## ENSG00000003056.8:O007 ENSG00000003056.8 007 dORF
## ENSG00000003056.8:O008 ENSG00000003056.8 008 dORF
## ENSG00000003056.8:O009 ENSG00000003056.8 009 mORF
## ENSG00000003056.8:O013 ENSG00000003056.8 013 uORF
## ENSG00000004478.8:O001 ENSG00000004478.8 001 mORF
## ... ... ... ...
## ENSG00000134153.10:O001 ENSG00000134153.10 001 dORF
## ENSG00000134153.10:O002 ENSG00000134153.10 002 dORF
## ENSG00000134153.10:O003 ENSG00000134153.10 003 dORF
## ENSG00000134153.10:O004 ENSG00000134153.10 004 dORF
## ENSG00000134153.10:O005 ENSG00000134153.10 005 mORF
## count_filter singlet_filter is_kept DOUResults
## <logical> <logical> <logical> <list>
## ENSG00000003056.8:O007 TRUE TRUE TRUE <PostHoc>
## ENSG00000003056.8:O008 TRUE TRUE TRUE <PostHoc>
## ENSG00000003056.8:O009 TRUE TRUE TRUE <PostHoc>
## ENSG00000003056.8:O013 TRUE TRUE TRUE <PostHoc>
## ENSG00000004478.8:O001 TRUE TRUE TRUE <PostHoc>
## ... ... ... ... ...
## ENSG00000134153.10:O001 TRUE TRUE TRUE <PostHoc>
## ENSG00000134153.10:O002 TRUE TRUE TRUE <PostHoc>
## ENSG00000134153.10:O003 TRUE TRUE TRUE <PostHoc>
## ENSG00000134153.10:O004 TRUE TRUE TRUE <PostHoc>
## ENSG00000134153.10:O005 TRUE TRUE TRUE <PostHoc>
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
DOTSeq provides the plotDOT() function for
visualising DOU and DTE results. The following plots can be
generated:
ou <- results$DOU[results$DOU$contrast == "Mitotic_Cycling - Interphase", ]
te <- results$DTE[results$DTE$contrast == "Mitotic_Cycling - Interphase", ]
results <- merge(ou, te, by = c("orf_id", "contrast"), all = TRUE)To explore the relationship between DOU and DTE, we can visualise the overlap between significant ORFs identified by each module. ORFs that are significant in DOU and DTE analyses suggest changes in both usage and translation efficiency, potentially indicating strong translational control.
## Venn diagram plotted
To compare DOU and DTE, we generate a composite scatter plot of their respective effect sizes. Marginal density plots along each axis illustrate the distribution of effect sizes for each metric. Spearman correlation is calculated and reported.
For this scatter plot, points are colour-coded by
significance, allowing us to identify ORFs with strong
effects in one analysis but weak in the other. This visualisation helps
highlight patterns of differential translation across ORFs.
plotDOT(
plot_type = "composite",
results = results,
plot_params = list(color_by = "significance", legend_position = "bottomright"),
force_new_device = FALSE
)## Spearman correlation between DOU and DTE estimates: 0.823, p-value: <2e-16
## composite plot colored by significance plotted
If DOUData is given as an input, points can also be
colour-coded by orf_type. This enables quick identification
of patterns in differential translation across different ORF types.
plotDOT(
plot_type = "composite",
results = results,
data = getDOU(d),
plot_params = list(color_by = "orf_type", legend_position = "bottomright"),
force_new_device = FALSE
)## Spearman correlation between DOU and DTE estimates: 0.823, p-value: <2e-16
## composite plot colored by orf_type plotted
To highlight ORFs with strong differential usage, we construct a volcano plot comparing significance versus DOU effect size.
By default, gene identifiers are mapped from Ensembl gene IDs to HGNC
gene symbols using biomaRt This mapping improves plot
readability by displaying gene symbols. The resulting ID mapping
dataframe can be stored as a dataframe and reused across plots to avoid
repeated downloads.
mapping <- plotDOT(
plot_type = "volcano",
results = results,
id_mapping = TRUE,
plot_params = list(color_by = "significance", top_hits = 3, legend_position = "topright"),
force_new_device = FALSE
)## retrieving gene annotation from BioMart
## retrieved 275 gene symbols
## capping volcano plot y-axis at -log10(lfsr) = 20
## volcano plot colored by significance plotted
Similarly, volcano plots can be colour-coded by significance or ORF types, depending on whether rowdata is given as an input.
plotDOT(
plot_type = "volcano",
results = results,
data = getDOU(d),
id_mapping = mapping,
plot_params = list(color_by = "orf_type", top_hits = 3, legend_position = "topright"),
force_new_device = FALSE
)## capping volcano plot y-axis at -log10(lfsr) = 20
## volcano plot colored by orf_type plotted
To visualise the top-ranked genes, we generate a heatmap based on DOU metrics for the top 20 uORF-regulated genes.
plotDOT(
plot_type = "heatmap",
results = results,
data = getDOU(d),
id_mapping = mapping,
plot_params = list(rank_by = "score", top_hits = 20),
force_new_device = FALSE
)## plotting heatmap for the top 20 DOU genes
## heatmap plotted
We can visualise the estimated ORF usage within a gene as inferred by the DOU module. This plot combines the probabilistic estimates of ORF usage with post hoc statistical significance annotations, allowing interpretation of DOU across conditions. If the ID mapping dataframe provided, the gene symbol is displayed as the plot title.
orderby <- c("Mitotic_Cycling", "Mitotic_Arrest", "Interphase")
id <- "TAF2"
plotDOT(
plot_type = "usage",
data = getDOU(d),
gene_id = id,
id_mapping = mapping,
plot_params = list(order_by = orderby),
force_new_device = FALSE
)## ORF usage plot for TAF2 plotted
The plotDOT() function supports querying according to
the gene IDs or gene symbols, as listed in the rowData, rowRanges and/or
the ID mapping dataframe. Stable gene IDs (e.g., ENSG00000060339) can be
used without their version suffix (e.g., .14).
id <- "ENSG00000060339"
plotDOT(
plot_type = "usage",
data = getDOU(d),
gene_id = id,
id_mapping = mapping,
plot_params = list(order_by = orderby),
force_new_device = FALSE
)## ORF usage plot for ENSG00000060339 plotted
DOTSeq offers a unified framework for modelling
translational control at the ORF level. DOTSeq has two
modules capturing distinct biological signals: DOU
tests for shifts in relative usage within genes
(cis-regulation), whereas DTE tests for
changes in per-ORF translation efficiency relative to transcript
abundance. They can be used together to provide a fuller picture of
translational regulation.
Besides using the orf_to_gtf.py wrapper16, we can generate ORF-level
annotation natively in R using the getORFs() function,
which is based on the ORFik’s
engine20. This approach creates
a GRanges object.
library(withr)
falink <- "https://ftp.ensembl.org/pub/release-78/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP5.cdna.all.fa.gz"
gtflink <- "https://ftp.ensembl.org/pub/release-78/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP5.78.gtf.gz"
annotation <- basename(gtflink)
sequences <- basename(falink)
download.file(gtflink, destfile=annotation)
download.file(falink, destfile=sequences)
# This will generate flattened ORF-level annotation
gr <- getORFs(
sequences = sequences,
annotation = annotation,
organism = "Drosophila melanogaster"
)
withr::defer(unlink(c(annotation, sequences)))We perform read counting using the RNA-seq BAM files from the pasillaBamSubset package. We can optionally clean up the BAM files before read counting by discarding reads mapped to introns, noncoding genes, and intergenic regions. Note that this dataset is not derived from a Ribo-seq experiment and is used solely for illustrative purposes. It should not be used for biological interpretation.
library(pasillaBamSubset)
library(GenomeInfoDb)
bam_list <- c(untreated1_chr4(), untreated3_chr4())
# Keep only reads mapped to the exons of coding genes
temp_dir <- tempdir()
getExonicReads(gr = gr, seqlevels_style = "UCSC", bam_files = bam_list, bam_output_dir = temp_dir, coding_genes_only = TRUE)
# Get the list of filtered BAM files
bam_list <- list.files(
path = temp_dir,
pattern = "*exonic.*",
full.names = TRUE
)
cnt <- countReads(gr = gr, bam_files = bam_list)
withr::defer(unlink(bam_list))Here we make use the ORF-level annotation and create count and
condition tables as input for
DOTSeqDataSetsFromSummarizeOverlaps(). This is alternative
to using DOTSeqDataSetsFromFeatureCounts() as demonstrated
at the beginning.
set.seed(42)
# Create count_table
# Create two replicates for each condition with random scaling
rna_treated_reps <- do.call(cbind, replicate(2, cnt[["untreated3_chr4.exonic.sorted.bam"]] * runif(nrow(cnt), min = 0.5, max = 2), simplify = FALSE))
rna_control_reps <- do.call(cbind, replicate(2, cnt[["untreated3_chr4.exonic.sorted.bam"]] * runif(nrow(cnt), min = 0.1, max = 0.5), simplify = FALSE))
ribo_treated_reps <- do.call(cbind, replicate(2, cnt[["untreated1_chr4.exonic.sorted.bam"]] * runif(nrow(cnt), min = 0.5, max = 2), simplify = FALSE))
ribo_control_reps <- do.call(cbind, replicate(2, cnt[["untreated1_chr4.exonic.sorted.bam"]] * runif(nrow(cnt), min = 0.1, max = 0.5), simplify = FALSE))
# Combine and name columns
colnames(rna_treated_reps) <- paste0("rna_treated", 1:2)
colnames(rna_control_reps) <- paste0("rna_control", 1:2)
colnames(ribo_treated_reps) <- paste0("ribo_treated", 1:2)
colnames(ribo_control_reps) <- paste0("ribo_control", 1:2)
cnt_expanded <- cbind(
rna_treated_reps,
rna_control_reps,
ribo_treated_reps,
ribo_control_reps
)
# Convert numbers to integer
cnt_expanded <- round(cnt_expanded)
storage.mode(cnt_expanded) <- "integer"
rownames(cnt_expanded) <- rownames(cnt)
cnt_expanded <- as.data.frame(cnt_expanded)
# Create condition_table
# Sample names from cnt_expanded
sample_names <- colnames(cnt_expanded)
# Define condition and strategy for each sample
condition <- c(
rep("treated", 2),
rep("control", 2),
rep("treated", 2),
rep("control", 2)
)
strategy <- c(rep("RNA", 4), rep("Ribo", 4))
cond <- data.frame(
run = sample_names,
replicate = c(1,2),
condition = factor(condition, levels = c("control", "treated")),
strategy = factor(strategy, levels = c("RNA", "Ribo"))
)
# Create a DOTSeqDataSets object
d <- DOTSeqDataSetsFromSummarizeOverlaps(
count_table = cnt_expanded,
condition_table = cond,
annotation = gr
)
invisible(file.remove(bam_list))
invisible(file.remove(sub("\\.gz$", ".sqlite", annotation)))DOTSeq includes the function simDOT() for
generating simulated Ribo-seq and RNA-seq count matrices. It enables
benchmarking under controlled scenarios with configurable effect sizes,
batch effects, and ORF types.
Simulations in our manuscript3 show that:
The DOU module performs well at low-to-moderate effect sizes on DOU-type scenarios. This module is specifically designed to detect condition-dependent shifts in the expected proportion of Ribo-seq to RNA-seq counts for individual ORFs relative to other ORFs within the same gene (i.e., changes in ORF usage). This module targets cis-regulatory usage changes that alter relative ORF proportions within genes.
The DTE module performs better at large-magnitude changes in translation efficiency. This module captures ORF-level changes in ribosome occupancy relative to RNA abundance.
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SummarizedExperiment_1.41.0 Biobase_2.71.0
## [3] GenomicRanges_1.63.1 Seqinfo_1.1.0
## [5] IRanges_2.45.0 S4Vectors_0.49.0
## [7] BiocGenerics_0.57.0 generics_0.1.4
## [9] MatrixGenerics_1.23.0 matrixStats_1.5.0
## [11] DOTSeq_0.99.4 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] eulerr_7.0.4 RColorBrewer_1.1-3 sys_3.4.3
## [4] jsonlite_2.0.0 magrittr_2.0.4 estimability_1.5.1
## [7] GenomicFeatures_1.63.1 farver_2.1.2 nloptr_2.2.1
## [10] rmarkdown_2.30 BiocIO_1.21.0 vctrs_0.6.5
## [13] memoise_2.0.1 minqa_1.2.8 Rsamtools_2.27.0
## [16] RCurl_1.98-1.17 SQUAREM_2021.1 mixsqp_0.3-54
## [19] htmltools_0.5.9 S4Arrays_1.11.1 progress_1.2.3
## [22] curl_7.0.0 truncnorm_1.0-9 SparseArray_1.11.10
## [25] sass_0.4.10 bslib_0.9.0 sandwich_3.1-1
## [28] httr2_1.2.2 emmeans_2.0.1 zoo_1.8-15
## [31] cachem_1.1.0 TMB_1.9.19 buildtools_1.0.0
## [34] GenomicAlignments_1.47.0 lifecycle_1.0.5 pkgconfig_2.0.3
## [37] Matrix_1.7-4 R6_2.6.1 fastmap_1.2.0
## [40] GenomeInfoDbData_1.2.15 rbibutils_2.4 digest_0.6.39
## [43] numDeriv_2016.8-1.1 AnnotationDbi_1.73.0 DESeq2_1.51.6
## [46] irlba_2.3.5.1 RSQLite_2.4.5 labeling_0.4.3
## [49] invgamma_1.2 filelock_1.0.3 polyclip_1.10-7
## [52] httr_1.4.7 abind_1.4-8 mgcv_1.9-4
## [55] compiler_4.5.2 withr_3.0.2 bit64_4.6.0-1
## [58] S7_0.2.1 BiocParallel_1.45.0 DBI_1.2.3
## [61] ggsignif_0.6.4 biomaRt_2.67.1 MASS_7.3-65
## [64] rappdirs_0.3.3 DelayedArray_0.37.0 rjson_0.2.23
## [67] tools_4.5.2 otel_0.2.0 glue_1.8.0
## [70] restfulr_0.0.16 nlme_3.1-168 polylabelr_0.3.0
## [73] grid_4.5.2 glmmTMB_1.1.13 gtable_0.3.6
## [76] BSgenome_1.79.1 data.table_1.18.0 hms_1.1.4
## [79] xml2_1.5.1 XVector_0.51.0 pillar_1.11.1
## [82] stringr_1.6.0 splines_4.5.2 dplyr_1.1.4
## [85] BiocFileCache_3.1.0 lattice_0.22-7 rtracklayer_1.71.3
## [88] bit_4.6.0 tidyselect_1.2.1 locfit_1.5-9.12
## [91] maketools_1.3.2 Biostrings_2.79.4 pbapply_1.7-4
## [94] knitr_1.51 reformulas_0.4.3.1 xfun_0.55
## [97] stringi_1.8.7 UCSC.utils_1.7.1 yaml_2.3.12
## [100] boot_1.3-32 evaluate_1.0.5 codetools_0.2-20
## [103] cigarillo_1.1.0 tibble_3.3.1 BiocManager_1.30.27
## [106] cli_3.6.5 xtable_1.8-4 Rdpack_2.6.4
## [109] jquerylib_0.1.4 Rcpp_1.1.1 GenomeInfoDb_1.47.2
## [112] dbplyr_2.5.1 png_0.1-8 XML_3.99-0.20
## [115] parallel_4.5.2 ggplot2_4.0.1 blob_1.3.0
## [118] prettyunits_1.2.0 bitops_1.0-9 lme4_1.1-38
## [121] txdbmaker_1.7.3 mvtnorm_1.3-3 scales_1.4.0
## [124] purrr_1.2.1 crayon_1.5.3 rlang_1.1.7
## [127] ashr_2.2-63 KEGGREST_1.51.1