SpliceImpactR 0.99.4
hits_final, data, res)SpliceImpactR links differential splicing to protein-level outcomes,
including sequence changes, domain gain/loss, and potential PPI rewiring.
SpliceImpactR is a self-contained and efficient pipeline designed to identify how alternative RNA processing events affect resulting proteins. After initial alternative RNA processing quantification, the method runs entirely within R and does not require additional external software.
The pipeline begins by importing alternative RNA processing events that differ across conditions. It is built to support highly flexible input, requiring only inclusion and exclusion coordinates for each event isoform. This makes it compatible with both user-defined events and outputs from existing tools.
Custom input functions detect multiple classes of alternative RNA processing, including alternative first exons (AFE), hybrid first exons (HFE), retained introns (RI), skipped exons (SE), alternative 5′ and 3′ splice sites (A5SS and A3SS), mutually exclusive exons (MXE), hybrid last exons (HLE), and alternative last exons (ALE).
SpliceImpactR then performs differential inclusion analysis to identify condition-specific transcript pairs that differ by at least one alternative RNA processing event. These events are mapped to transcript and protein annotations to evaluate their effects on coding potential, sequence similarity, and frameshifts.
The pipeline next uses curated and user-provided protein feature sets to annotate domain-level changes, assess shifts in domain usage across conditions, and integrate domain-domain interaction data, domain-motif interaction data, and protein-protein interaction information to predict downstream effects on protein interaction networks.
Finally, SpliceImpactR performs a global analysis to detect co-regulated splicing events and quantify the relative contribution of each alternative RNA processing type. The pipeline is also available through an R Shiny interface to improve accessibility and ease of use.
SpliceImpactR is built to handle both S4 SpliceImpactResult object with
custom accession functions and standard data.table input / workflow.
SpliceImpactResult objects contain all relevant data along with detailed
sequence information for each event. This object allows for maximal
interoperability with any level of data from this pipeline.
This vignette uses bundled test data so every core step is runnable.
Install with one of the following methods.
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("SpliceImpactR")
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
devtools::install_github("fiszbein-lab/SpliceImpactR")
rMATS (replicate Multivariate Analysis of Transcript Splicing) provides
event-level splicing tables.HITindex quantifies first/last exon usage and labels hybrid exons.PPIDM Protein-Protein Interaction Domain Miner for domain-domain
Interaction derived from PPI and 3did’s domain-domain interactionsELM Eukaryotic Linear Motif Database for accessing Short Linear Motif
occurences and domain-motif interactionsBiomaRt accesses data from InterPro, PFAM, SignalP, TMHMM, CDD, Mobidb-litesample_frame).get_protein_features() supports these feature sources:
- interpro: integrated domain/family/superfamily signatures.
- pfam: protein domain HMM families.
- cdd: NCBI Conserved Domain Database annotations.
- gene3d: CATH/Gene3D structural domain annotations.
- signalp: signal peptide calls.
- tmhmm: transmembrane helix calls.
- ncoils: coiled-coil region predictions.
- seg: low-complexity segments.
- mobidblite: intrinsically disordered region calls.
- elm: short linear motifs (SLiMs; retrieved from ELM, not BioMart).
For BioMart-backed sources, each feature must provide three attributes:
{feature}, {feature}_start, and {feature}_end (for example
pfam, pfam_start, pfam_end). You can add non-BioMart custom features
with get_manual_features() (explained further down in the
Custom Input Entry Points section) and merge everything through
get_comprehensive_annotations().
We load test annotations and three representative feature sources
(interpro, signalp, elm) and combine them. Loading multiple databases in
one call is possible, but not recommended due to timeouts. These are
automatically cached and loaded from cache if the correct signature (organism,
version, etc) is found. BiomaRt can give difficulties in loading these, so this
procedure will sometimes take a few tries. If this is unsuccessful, restart R
and/or try suggested methods related to this issue (eg: httr::reset_config())
library(SpliceImpactR)
annotation_df <- get_annotation(load = "test")
#> [INFO] Loading bundled test annotation data
interpro_features <- get_protein_features(
biomaRt_databases = c("interpro"),
gtf_df = annotation_df$annotations,
test = TRUE
)
signalp_features <- get_protein_features(
biomaRt_databases = c("signalp"),
gtf_df = annotation_df$annotations,
test = TRUE
)
elm_features <- get_protein_features(
biomaRt_databases = c("elm"),
gtf_df = annotation_df$annotations,
test = TRUE
)
protein_feature_total <- get_comprehensive_annotations(
list(interpro_features, signalp_features, elm_features)
)
exon_features <- get_exon_features(
annotation_df$annotations,
protein_feature_total
)
table(protein_feature_total$database)
#>
#> elm interpro signalp
#> 18 1449 13
Each sample directory contains event files exported by rMATS and/or
HITindex. The manifest below points to the bundled test directories. Using
this function, SpliceImpactR will search each directory for the outputs provided
by rMATS and HIT Index output files ending in .AFEPSI, .ALEPSI, and .exon.
Label one condition case and one condition control.
Required sample manifest columns:
- path: per-sample directory containing splicing output files.
- sample_name: unique sample identifier.
- condition: case/control label used in DI and downstream pairing.
Manifest expectations:
- One row per sample.
- path values must point to readable sample directories.
- Use exactly two conditions for this standard workflow (case, control).
- Include replicate samples per condition for robust differential modeling.
sample_frame <- data.frame(
path = c(
file.path(system.file("extdata", package = "SpliceImpactR"),
"rawData/control_S5/"),
file.path(system.file("extdata", package = "SpliceImpactR"),
"rawData/control_S6/"),
file.path(system.file("extdata", package = "SpliceImpactR"),
"rawData/control_S7/"),
file.path(system.file("extdata", package = "SpliceImpactR"),
"rawData/control_S8/"),
file.path(system.file("extdata", package = "SpliceImpactR"),
"rawData/case_S1/"),
file.path(system.file("extdata", package = "SpliceImpactR"),
"rawData/case_S2/"),
file.path(system.file("extdata", package = "SpliceImpactR"),
"rawData/case_S3/"),
file.path(system.file("extdata", package = "SpliceImpactR"),
"rawData/case_S4/")
),
sample_name = c("S5", "S6", "S7", "S8", "S1", "S2", "S3", "S4"),
condition = c(
"control", "control", "control", "control",
"case", "case", "case", "case"
),
stringsAsFactors = FALSE
)
get_splicing_impact)If your inputs already follow the standard layout, run the full workflow with the top-level wrapper and choose table or S4 return mode.
# data.table-first return
out_dt <- get_splicing_impact(
sample_frame = sample_frame,
source_data = "both",
annotation_df = annotation_df,
protein_feature_total = protein_feature_total,
exon_features = exon_features,
return_class = "data.table"
)
# S4-first return (single object carries all slots)
out_s4 <- get_splicing_impact(
sample_frame = sample_frame,
source_data = "both",
annotation_df = annotation_df,
protein_feature_total = protein_feature_total,
exon_features = exon_features,
return_class = "S4"
)
For the stepwise workflow in this vignette, load event data explicitly:
data <- get_rmats_hit(
sample_frame,
event_types = c("ALE", "AFE", "MXE", "SE", "A3SS", "A5SS", "RI")
)
#> [INFO] Filtered 0 unannotated ALE/AFE events (465 -> 465)
DT <- data.table::as.data.table(data)
DT[, .(
n_rows = .N,
n_events = data.table::uniqueN(event_id),
n_genes = data.table::uniqueN(gene_id)
)]
#> n_rows n_events n_genes
#> <int> <int> <int>
#> 1: 5863 605 41
If rMATS and HITindex outputs are stored in separate per-sample
directories, load them independently and combine on shared columns.
In this test-data example, both manifests point to the same bundled paths;
in real analyses, sample_frame_rmats$path and sample_frame_hit$path
can be different.
sample_frame_rmats <- sample_frame
sample_frame_hit <- sample_frame
rmats_only <- get_rmats(
load_rmats(
sample_frame_rmats,
use = "JCEC",
event_types = c("MXE", "SE", "A3SS", "A5SS", "RI")
)
)
hit_only <- get_hitindex(
sample_frame_hit,
keep_annotated_first_last = TRUE
)
#> [INFO] Filtered 0 unannotated ALE/AFE events (465 -> 465)
shared_cols <- intersect(names(rmats_only), names(hit_only))
data_separate <- data.table::rbindlist(
list(
rmats_only[, ..shared_cols],
hit_only[, ..shared_cols]
),
use.names = TRUE,
fill = TRUE
)
data_separate[, .(
n_rows = .N,
n_events = data.table::uniqueN(event_id),
n_genes = data.table::uniqueN(gene_id)
)]
#> n_rows n_events n_genes
#> <int> <int> <int>
#> 1: 5863 605 41
This summary compares event-level mean HIT values across conditions.
hit_compare <- compare_hit_index(
sample_frame,
condition_map = c(control = "control", test = "case")
)
hit_compare$plot
Figure 1: HITindex condition comparison
head(hit_compare$results[order(fdr)], 6)
#> event_key control test diff_HIT
#> <char> <num> <num> <num>
#> 1: ENSG00000158286|chr1:6210370-6211118 -0.1890000 0.12766667 0.3166667
#> 2: ENSG00000158286|chr1:6211867-6212053 -0.1905000 -0.23400000 -0.0435000
#> 3: ENSG00000158286|chr1:6212218-6212416 -0.1783333 -0.14533333 0.0330000
#> 4: ENSG00000158286|chr1:6212682-6213183 0.4350000 0.04666667 -0.3883333
#> 5: ENSG00000158286|chr1:6218289-6218369 -0.2993333 -0.05766667 0.2416667
#> 6: ENSG00000142599|chr1:8355419-8355599 0.1010000 0.10800000 0.0070000
#> delta_HIT n_control n_test p_value fdr
#> <num> <int> <int> <num> <num>
#> 1: 0.3166667 2 3 1.0000000 1
#> 2: 0.0435000 2 3 1.0000000 1
#> 3: 0.0330000 3 3 0.8337307 1
#> 4: 0.3883333 2 3 1.0000000 1
#> 5: 0.2416667 3 3 0.2567179 1
#> 6: 0.0070000 4 4 0.9338296 1
This panel summarizes event count depth-normalized metrics and PSI eCDF.
overview_plot <- overview_spicing_comparison(
events = data,
sample_df = sample_frame,
depth_norm = "exon_files",
event_type = "AFE"
)
overview_plot
Figure 2: Depth-normalized event counts and PSI overview
We run quasi-binomial GLMs and filter significant events.
Verbose filtering messages are line-wrapped so progress logs remain readable
without horizontal scrolling. For efficient processing, set parallel_glm to
TRUE and provide a BiocParallel object to BPPARAM.
Default thresholds in keep_sig_pairs is FDR < 0.05 and |DELTA PSI| > 0.1.
res <- get_differential_inclusion(
data,
min_total_reads = 10,
parallel_glm = TRUE,
BPPARAM = BiocParallel::SerialParam()
)
#> [INFO] Input contains 41 genes, 605 events, and 2932 event instances.
#> [PROCESSING/INFO] Filtering out low-coverage rows removed 892 event instances; 4971
#> remain.
#> [PROCESSING/INFO] Completing AFE/ALE with zeros per sample (total strategy: event_max)
#> filled 407 AFE rows and 160 ALE rows (totals: AFE=608, ALE=424).
#> [PROCESSING/INFO] Filtering genes with no nonzero values per sample/event_type removed
#> 37 genes from specific sample groups; 40 genes remain overall.
#> [PROCESSING/INFO] Filtering by minimum condition presence removed 211 events; 351
#> events remain.
#> [PROCESSING/INFO] Removed 62 non-changing events; 289 events remain.
#> [PROCESSING/INFO] Filtering events not represented in both conditions removed 30
#> events; 259 events remain.
#> [PROCESSING] Fitting quasi-binomial GLMs per site...
#>
|
| | 0%
|
|======================================================================| 100%
#>
#> [DONE] Fitted 531 sites in 1 chunks.
#> [INFO] Done.
res_di <- keep_sig_pairs(res)
res[, .(
n_rows = .N,
n_sig = sum(padj <= 0.05 & abs(delta_psi) >= 0.1, na.rm = TRUE)
)]
#> n_rows n_sig
#> <int> <int>
#> 1: 533 73
plot_di_volcano_dt(res)
Figure 3: Differential inclusion volcano plot
Significant DI events are mapped to annotation and sequence data, then collapsed into case/control transcript pairs. This matching is done through a strict hierarchy:
chr, strand, and gene_id to keep only compatible
annotation intervals.inc) coordinates to exons and require all inclusion
parts to be covered for a transcript candidate.exc) coordinates.first, internal, last), then reciprocal overlap and intersection
width; protein-linked transcripts are preferred when available.get_pairs(source = "multi") by joining all
positive delta_psi rows (case) with all negative rows (control) for each
event_id, then ordering by strongest |delta_psi|.matched <- get_matched_events_chunked(
res_di,
annotation_df$annotations,
chunk_size = 2000
)
#> Chunk 1/1: rows 1..94
hits_sequences <- attach_sequences(matched, annotation_df$sequences)
pairs <- get_pairs(hits_sequences, source = "multi")
pairs[, .(
n_pairs = .N,
n_events = data.table::uniqueN(event_id)
)]
#> n_pairs n_events
#> <int> <int>
#> 1: 36 28
Once pairing is done, we can probe whether there is a proximal/distal shift in AFE/ALE usage.
proximal_output <- get_proximal_shift_from_hits(pairs)
proximal_output$plot
Figure 4: Proximal versus distal terminal exon summary
compare_sequence_frame() computes protein-coding class,
alignment identity, frameshift/rescue status, and length deltas.
Key labels used downstream:
- protein_coding: both isoforms have protein IDs (both sides coding).
- onePC: only one side has a protein ID.
- noPC: neither side has a protein ID.
- Match: protein sequences are identical.
- FrameShift: frame is disrupted between paired isoforms.
seq_compare <- compare_sequence_frame(pairs, annotation_df$annotations)
#> [1] "[INFO] Processing 36 transcript and protein sequence alignments, this may take a little..."
#> [1] "[Processing] Identifying frame shifts and rescues"
#> [1] "[INFO] 1 frameshifts (0 rescues) and 27 non-frameshifts were identified, "
alignment_plot <- plot_alignment_summary(seq_compare)
alignment_plot
Figure 5: Protein alignment summary by event type
length_plot <- plot_length_comparison(seq_compare)
length_plot
Figure 6: Case versus control isoform length comparison
Background transcript pairs are derived from annotations and compared against observed event-linked domain changes.
bg <- get_background(
source = "annotated",
annotations = annotation_df$annotations,
protein_features = protein_feature_total,
BPPARAM = BiocParallel::SerialParam()
)
hits_domain <- get_domains(seq_compare, exon_features)
hits_domain[, .(
n_hits = .N,
n_domain_changed = sum(diff_n > 0, na.rm = TRUE)
)]
#> n_hits n_domain_changed
#> <int> <int>
#> 1: 36 12
enriched_domains <- enrich_domains_hypergeo(
hits = hits_domain,
background = bg,
db_filter = "interpro"
)
domain_plot <- plot_enriched_domains_counts(enriched_domains, top_n = 20)
domain_plot
Figure 7: Top enriched InterPro domain differences
You can also stratify enrichment by event type or by feature database.
enriched_afe_interpro <- enrich_by_event(
hits = hits_domain,
background = bg,
events = "AFE",
db_filter = "interpro"
)
enriched_interpro_only <- enrich_by_db(
hits = hits_domain,
background = bg,
dbs = "interpro"
)
list(
by_event_rows = nrow(enriched_afe_interpro),
by_db_rows = nrow(enriched_interpro_only)
)
#> $by_event_rows
#> [1] 1
#>
#> $by_db_rows
#> [1] 1
PPI effects are inferred from domain and motif changes. Domain-domain and domain-motif interactions are sourced from 3did/PPDIM and ELM. PPI rewiring currently only works for human inputs.
ppi <- get_ppi_interactions()
hits_final <- get_ppi_switches(
hits_domain,
ppi,
protein_feature_total
)
#> Registered S3 method overwritten by 'bit64':
#> method from
#> print.bitstring tools
#>
ppi_plot <- plot_ppi_summary(hits_final)
ppi_plot
Figure 8: Summary of predicted case/control PPI changes
get_gene_enrichment() extracts foreground genes for enrichment.
mode = "di" expects a DI-like table (res).mode = "ppi"/"domain" expects a hits-like table (hits_final or
hits_domain).fg_di <- get_gene_enrichment(
mode = "di",
res = res,
padj_threshold = 0.1,
delta_psi_threshold = 0.1
)
fg_domain <- get_gene_enrichment(mode = "domain", hits = hits_domain)
fg_ppi <- get_gene_enrichment(mode = "ppi", hits = hits_final)
lengths(list(di = fg_di, domain = fg_domain, ppi = fg_ppi))
#> di domain ppi
#> 40 11 6
intersect(fg_di, fg_ppi)
#> [1] "ENSG00000116171" "ENSG00000142599" "ENSG00000140983" "ENSG00000101391"
#> [5] "ENSG00000116406"
The optional get_enrichment() step depends on additional packages and may
return no significant terms on small toy datasets. If you see all NA
statistics, foreground genes often map too sparsely into the selected
background and ontology. p_adjust_cutoff and top_n_plot are set
to allow for example visualization, but should be adjusted to standard
(eg: 0.05, 10) for non-toy data uses.
needed <- c("clusterProfiler", "msigdbr", "org.Hs.eg.db")
has_needed <- all(vapply(
needed,
requireNamespace,
FUN.VALUE = logical(1),
quietly = TRUE
))
#>
#>
if (has_needed) {
enrichment_di <- get_enrichment(
foreground = fg_domain,
background = bg$gene_id,
species = "human",
gene_id_type = "ensembl",
sources = "GO:BP",
p_adjust_cutoff = 1, ## CHANGE FOR NON-TOY DATA USAGE, eg: 0.05
top_n_plot = 1 ## CHANGE FOR NON-TOY DATA USAGE, eg: NULL or 10
)
enrichment_di$plot
} else {
message(
"Skipping get_enrichment(): install ",
paste(needed, collapse = ", ")
)
}
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> Warning: semData is not provided. It will be calculated automatically.
plot_two_transcripts_with_domains_unified() shows matched isoforms in either
transcript (genome-coordinate) or protein (amino-acid) view.
viz_pair <- hits_final[
!is.na(transcript_id_case) &
!is.na(transcript_id_control) &
transcript_id_case != "" &
transcript_id_control != ""
][1]
if (nrow(viz_pair) == 0) {
stop("No transcript pairs available for visualization.")
}
tx_pair <- c(viz_pair$transcript_id_case, viz_pair$transcript_id_control)
transcript_centric <- plot_two_transcripts_with_domains_unified(
transcripts = tx_pair,
gtf_df = annotation_df$annotations,
protein_features = protein_feature_total,
feature_db = c("interpro"),
combine_domains = TRUE,
view = "transcript"
)
protein_centric <- plot_two_transcripts_with_domains_unified(
transcripts = tx_pair,
gtf_df = annotation_df$annotations,
protein_features = protein_feature_total,
feature_db = c("interpro"),
combine_domains = TRUE,
view = "protein"
)
transcript_centric
Figure 9: Transcript and protein views for one paired event
protein_centric
Figure 10: Transcript and protein views for one paired event
probe_individual_event() plots PSI for one event across samples.
event_to_probe <- res$event_id[1]
probe <- probe_individual_event(data, event = event_to_probe)
probe$plot
Figure 11: PSI distribution for one selected event
head(probe$data)
#> event_id sample condition psi i.condition form
#> <char> <char> <char> <num> <char> <char>
#> 1: A3SS:26 S1 case 0.483 case INC
#> 2: A3SS:26 S2 case 0.483 case INC
#> 3: A3SS:26 S3 case 0.525 case INC
#> 4: A3SS:26 S4 case 0.345 case INC
#> 5: A3SS:26 S5 control 0.208 control INC
#> 6: A3SS:26 S6 control 0.085 control INC
integrated_event_summary() combines sequence, domain, and PPI summaries.
Panel guide:
- Top-left: event classification composition (Match, FrameShift, etc.) by
event type.
- Top-right: alignment score distributions by event type.
- Middle-left: domain-change prevalence (Any, case-only, control-only, both).
- Middle-center/right: fraction and counts of predicted PPI rewiring.
- Bottom-left: relative event retention from pre-filter DI to final hits.
- Bottom-right: gene-level event-type coordination (Jaccard heatmap).
int_summary <- integrated_event_summary(hits_final, res)
int_summary$plot
Figure 12: Integrated multi-layer summary
int_summary$summaries$relative_use
#> event_type n_pre n_post relative_use
#> <fctr> <int> <int> <num>
#> 1: AFE 22 17 0.77272727
#> 2: A3SS 35 5 0.14285714
#> 3: RI 35 2 0.05714286
#> 4: SE 135 4 0.02962963
#> 5: A5SS 11 0 0.00000000
#> 6: ALE 10 0 0.00000000
#> 7: MXE 11 0 0.00000000
hits_final, data, res)The pipeline returns three core tables in data.table mode:
- data: sample-level event measurements before differential modeling.
- res: differential inclusion results per tested event/site.
- hits_final: paired case/control isoform effects with sequence, domain, and
PPI annotations.
Suffix convention used throughout:
- _case = case-preferred isoform values.
- _control = control-preferred isoform values.
hits_final (integrated event-level output)Use this table for biological interpretation and downstream plotting.
event_id, event_type, gene_id, chr, strand,
transcript_id_case, transcript_id_control,
protein_id_case, protein_id_control, form_case, form_control,
exons_case, exons_control.inc_case, inc_control, exc_case, exc_control,
delta_psi_case, delta_psi_control,
p.value_case, p.value_control, padj_case, padj_control,
n_samples_case, n_samples_control,
n_case_case, n_case_control, n_control_case, n_control_control.transcript_seq_case, transcript_seq_control,
protein_seq_case, protein_seq_control, pc_class,
length columns (prot_len_*, tx_len_*, exon_cds_len_*, exon_len_*)
and their *_diff / *_diff_abs derivatives.dna_pid, dna_score, dna_width),
protein alignment (prot_pid, prot_score, prot_width),
frame diagnostics (frame_call, rescue, frame_check_exon1,
frame_check_exon2), and summary_classification.domains_exons_case, domains_exons_control,
case_only_domains, control_only_domains,
case_only_domains_list, control_only_domains_list,
either_domains_list, case_only_n, control_only_n, diff_n.case_ppi, control_ppi, n_case_ppi, n_control_ppi, n_ppi,
case_ppi_drivers, control_ppi_drivers.data (raw sample-level input table)Use data to inspect per-sample evidence feeding DI.
Core columns:
event_id, event_type, form, gene_id, chr, strand, inc, exc,
inclusion_reads, exclusion_reads, psi, sample, condition,
source_file.
Often present depending on import path:
HITindex metadata such as HITindex, class, nFE, nLE, nUP, nDOWN,
nTXPT, psi_original, total_reads, source.
res (differential inclusion output)Use res to rank significant events before pairing/domain/PPI steps.
Core columns:
site_id, event_id, event_type, gene_id, chr, strand, inc, exc,
form, n_samples, n_control, n_case, mean_psi_ctrl, mean_psi_case,
delta_psi, p.value, padj, cooks_max, n, n_used.
Use SpliceImpactResult when you want all pipeline pieces in one object,
with consistent filtering across event-linked slots.
SpliceImpactResult is a custom S4 container that keeps all major pipeline
parts synchronized:
- raw_events (SummarizedExperiment): sample-level table + ranges/assays.
- di_events / res_di (GRanges): differential inclusion rows.
- matched (DFrame): annotation-matched rows (and sequence-attached rows).
- paired_hits (GRanges) + segments (GRangesList): final case/control
event impacts.
- sample_frame (DFrame): sample manifest metadata.
obj <- as_splice_impact_result(
data = data,
res = res,
res_di = res_di,
matched = hits_sequences,
sample_frame = sample_frame,
hits_final = hits_final
)
get_hits_*() wrappers provide compact views for common downstream tasks:
- get_hits_core(): event IDs, transcript/protein IDs, and key DI metadata.
- get_hits_domain(): domain gain/loss columns (case_only_*, control_only_*,
diff_n, mapped domain lists).
- get_hits_ppi(): PPI partner changes (n_ppi, case/control partner lists,
PPI drivers).
- get_hits_sequence(): sequence/alignment/frame fields (pc_class,
frame_call, protein/transcript length deltas, identity metrics).
raw_dt <- as_dt_from_s4(obj, "raw_events")
di_dt <- as_dt_from_s4(obj, "di_events")
hits_dt <- as_dt_from_s4(obj, "paired_hits")
hits_core <- get_hits_core(obj)
hits_domain_view <- get_hits_domain(obj)
hits_ppi_view <- get_hits_ppi(obj)
hits_sequence_view <- get_hits_sequence(obj)
c(
raw_rows = nrow(raw_dt),
di_rows = nrow(di_dt),
hits_rows = nrow(hits_dt),
hits_core_rows = nrow(hits_core),
hits_domain_rows = nrow(hits_domain_view),
hits_ppi_rows = nrow(hits_ppi_view),
hits_sequence_rows = nrow(hits_sequence_view)
)
#> raw_rows di_rows hits_rows hits_core_rows
#> 5863 533 36 36
#> hits_domain_rows hits_ppi_rows hits_sequence_rows
#> 36 36 36
obj_focus <- filter_spliceimpact_hits(
obj,
diff_n > 0,
n_ppi > 0
)
focus_hits <- as_dt_from_s4(obj_focus, "paired_hits")
focus_hits[, .N, by = event_type][order(-N)]
#> event_type N
#> <char> <int>
#> 1: AFE 2
#> 2: RI 2
#> 3: SE 2
fg_di_s4 <- get_gene_enrichment(
mode = "di",
x = obj,
padj_threshold = 0.05,
delta_psi_threshold = 0.1
)
fg_ppi_s4 <- get_gene_enrichment(mode = "ppi", x = obj)
fg_ppi_focus <- get_gene_enrichment(mode = "ppi", x = obj_focus)
list(
di_equal_table = identical(sort(fg_di), sort(fg_di_s4)),
ppi_equal_table = identical(sort(fg_ppi), sort(fg_ppi_s4)),
focused_ppi_genes = head(fg_ppi_focus)
)
#> $di_equal_table
#> [1] TRUE
#>
#> $ppi_equal_table
#> [1] TRUE
#>
#> $focused_ppi_genes
#> [1] "ENSG00000116171" "ENSG00000142599" "ENSG00000140983" "ENSG00000164692"
#> [5] "ENSG00000116406" "ENSG00000101391"
spliceimpact_s4_schema()
#> $slots
#> $slots$raw_events
#> [1] "SummarizedExperiment: sample/form rows with rowRanges + rowData"
#>
#> $slots$di_events
#> [1] "GRanges: differential inclusion rows with mcols"
#>
#> $slots$res_di
#> [1] "GRanges: threshold-filtered differential rows with mcols"
#>
#> $slots$matched
#> [1] "S4Vectors::DataFrame: annotation-matched DI rows (and sequence-attached rows when present)"
#>
#> $slots$sample_frame
#> [1] "S4Vectors::DataFrame: sample manifest (`path`, `sample_name`, `condition`)"
#>
#> $slots$paired_hits
#> [1] "GRanges: paired case/control rows with mcols"
#>
#> $slots$segments
#> [1] "GRangesList: per-pair genomic segments (inc_case/inc_control/exc_case/exc_control)"
#>
#> $slots$metadata
#> [1] "list: provenance and pipeline metadata"
#>
#>
#> $keys
#> $keys$raw_events
#> [1] "raw_key (rowData)"
#>
#> $keys$di_events
#> [1] "di_key (mcols)"
#>
#> $keys$res_di
#> [1] "di_key (mcols)"
#>
#> $keys$paired_hits
#> [1] "pair_key (mcols)"
#>
#> $keys$segments
#> [1] "names(segments) == pair_key"
#>
#>
#> $assays
#> [1] "psi" "inclusion_reads" "exclusion_reads"
This mirrors the table workflow, but keeps everything in a single
SpliceImpactResult object.
obj_flow <- as_splice_impact_result(
data = data,
sample_frame = sample_frame
)
obj_flow <- get_differential_inclusion(
obj_flow,
min_total_reads = 10,
parallel_glm = TRUE,
BPPARAM = BiocParallel::SerialParam(),
return_class = "S4"
)
obj_flow <- keep_sig_pairs(obj_flow, return_class = "S4")
obj_flow <- get_matched_events_chunked(
obj_flow,
annotation_df$annotations,
chunk_size = 2000,
return_class = "S4"
)
obj_flow <- attach_sequences(
obj_flow,
annotation_df$sequences,
return_class = "S4"
)
obj_flow <- get_pairs(obj_flow, source = "multi", return_class = "S4")
obj_flow <- compare_sequence_frame(
obj_flow,
annotation_df$annotations,
return_class = "S4"
)
obj_flow <- get_domains(obj_flow, exon_features, return_class = "S4")
obj_flow <- get_ppi_switches(
obj_flow,
ppi,
protein_feature_total,
return_class = "S4"
)
hits_core_flow <- get_hits_core(obj_flow)
hits_domain_flow <- get_hits_domain(obj_flow)
hits_ppi_flow <- get_hits_ppi(obj_flow)
hits_sequence_flow <- get_hits_sequence(obj_flow)
These helpers support non-standard inputs at multiple stages of the workflow.
get_manual_features() lets you add user-defined peptide-level features.
ann_dt <- data.table::as.data.table(annotation_df$annotations)
coding_tx <- unique(ann_dt[type == "exon" & cds_has == TRUE, transcript_id])
n_manual <- min(3L, length(coding_tx))
if (n_manual < 1L) {
stop("No coding transcripts found for manual feature example.")
}
manual_df <- data.frame(
ensembl_transcript_id = coding_tx[seq_len(n_manual)],
ensembl_peptide_id = rep(NA_character_, n_manual),
name = paste0("demo_feature_", seq_len(n_manual)),
start = c(20L, 45L, 80L)[seq_len(n_manual)],
stop = c(35L, 58L, 92L)[seq_len(n_manual)],
database = rep("manual", n_manual),
alt_name = rep(NA_character_, n_manual),
feature_id = rep(NA_character_, n_manual),
stringsAsFactors = FALSE
)
manual_features <- get_manual_features(
manual_features = manual_df,
gtf_df = annotation_df$annotations
)
#> 0 domain(s) not matched to genomic coords
head(manual_features)
#> ensembl_transcript_id feature_id start stop chr strand clean_name
#> <char> <char> <num> <num> <char> <char> <char>
#> 1: ENST00000337907 <NA> 20 35 chr1 - demo_feature_1
#> 2: ENST00000400908 <NA> 45 58 chr1 - demo_feature_2
#> 3: ENST00000476556 <NA> 80 92 chr1 - demo_feature_3
#> alt_name database ensembl_peptide_id method
#> <char> <char> <char> <char>
#> 1: <NA> manual <NA> manual
#> 2: <NA> manual <NA> manual
#> 3: <NA> manual <NA> manual
#> name
#> <char>
#> 1: demo_feature_1;chr1:8656030-8656077
#> 2: demo_feature_2;chr1:8656105-8656146
#> 3: demo_feature_3;chr1:8362842-8361798
Use get_user_data() when you already have sample-level reads/PSI.
example_df <- data.frame(
event_id = rep("A3SS:1", 8),
event_type = "A3SS",
form = rep(c("INC", "EXC"), each = 4),
gene_id = "ENSG00000158286",
chr = "chrX",
strand = "-",
inc = c(rep("149608626-149608834", 4), rep("149608626-149608829", 4)),
exc = c(rep("", 4), rep("149608830-149608834", 4)),
inclusion_reads = c(30, 32, 29, 31, 2, 3, 4, 3),
exclusion_reads = c(1, 1, 2, 1, 28, 27, 26, 30),
sample = c("S1", "S2", "S3", "S4", "S1", "S2", "S3", "S4"),
condition = rep(c("case", "case", "control", "control"), 2),
stringsAsFactors = FALSE
)
example_df$psi <- example_df$inclusion_reads /
(example_df$inclusion_reads + example_df$exclusion_reads)
user_data <- get_user_data(example_df)
head(user_data)
#> event_id event_type form gene_id chr strand inc
#> <char> <char> <char> <char> <char> <char> <char>
#> 1: A3SS:1 A3SS INC ENSG00000158286 chrX - 149608626-149608834
#> 2: A3SS:1 A3SS INC ENSG00000158286 chrX - 149608626-149608834
#> 3: A3SS:1 A3SS INC ENSG00000158286 chrX - 149608626-149608834
#> 4: A3SS:1 A3SS INC ENSG00000158286 chrX - 149608626-149608834
#> 5: A3SS:1 A3SS EXC ENSG00000158286 chrX - 149608626-149608829
#> 6: A3SS:1 A3SS EXC ENSG00000158286 chrX - 149608626-149608829
#> exc inclusion_reads exclusion_reads psi sample
#> <char> <num> <num> <num> <char>
#> 1: 30 1 0.96774194 S1
#> 2: 32 1 0.96969697 S2
#> 3: 29 2 0.93548387 S3
#> 4: 31 1 0.96875000 S4
#> 5: 149608830-149608834 2 28 0.06666667 S1
#> 6: 149608830-149608834 3 27 0.10000000 S2
#> condition source_file
#> <char> <char>
#> 1: case
#> 2: case
#> 3: control
#> 4: control
#> 5: case
#> 6: case
Use get_user_data_post_di() when you already have event-level DI summaries.
example_user_data <- data.frame(
event_id = rep("A3SS:1", 8),
event_type = "A3SS",
gene_id = "ENSG00000158286",
chr = "chrX",
strand = "-",
form = rep(c("INC", "EXC"), each = 4),
inc = c(rep("149608626-149608834", 4), rep("149608626-149608829", 4)),
exc = c(rep("", 4), rep("149608830-149608834", 4)),
stringsAsFactors = FALSE
)
user_res <- get_user_data_post_di(example_user_data)
head(user_res)
#> site_id
#> <char>
#> 1: A3SS|ENSG00000158286|chrX|149608626-149608834||INC
#> 2: A3SS|ENSG00000158286|chrX|149608626-149608834||INC
#> 3: A3SS|ENSG00000158286|chrX|149608626-149608834||INC
#> 4: A3SS|ENSG00000158286|chrX|149608626-149608834||INC
#> 5: A3SS|ENSG00000158286|chrX|149608626-149608829|149608830-149608834|EXC
#> 6: A3SS|ENSG00000158286|chrX|149608626-149608829|149608830-149608834|EXC
#> event_type event_id gene_id chr strand inc
#> <char> <char> <char> <char> <char> <char>
#> 1: A3SS A3SS:1 ENSG00000158286 chrX - 149608626-149608834
#> 2: A3SS A3SS:1 ENSG00000158286 chrX - 149608626-149608834
#> 3: A3SS A3SS:1 ENSG00000158286 chrX - 149608626-149608834
#> 4: A3SS A3SS:1 ENSG00000158286 chrX - 149608626-149608834
#> 5: A3SS A3SS:1 ENSG00000158286 chrX - 149608626-149608829
#> 6: A3SS A3SS:1 ENSG00000158286 chrX - 149608626-149608829
#> exc n_samples n_control n_case mean_psi_ctrl mean_psi_case
#> <char> <num> <num> <num> <num> <num>
#> 1: -1 -1 -1 -1 -1
#> 2: -1 -1 -1 -1 -1
#> 3: -1 -1 -1 -1 -1
#> 4: -1 -1 -1 -1 -1
#> 5: 149608830-149608834 -1 -1 -1 -1 -1
#> 6: 149608830-149608834 -1 -1 -1 -1 -1
#> delta_psi p.value padj cooks_max form n n_used
#> <num> <num> <num> <num> <char> <num> <num>
#> 1: 1 0 0 -1 INC -1 -1
#> 2: 1 0 0 -1 INC -1 -1
#> 3: 1 0 0 -1 INC -1 -1
#> 4: 1 0 0 -1 INC -1 -1
#> 5: -1 0 0 -1 EXC -1 -1
#> 6: -1 0 0 -1 EXC -1 -1
Use get_rmats_post_di() to ingest already-computed rMATS result tables.
rmats_df <- data.frame(
ID = 1L,
GeneID = "ENSG00000182871",
geneSymbol = "COL18A1",
chr = "chr21",
strand = "+",
longExonStart_0base = 45505834L,
longExonEnd = 45505966L,
shortES = 45505837L,
shortEE = 45505966L,
flankingES = 45505357L,
flankingEE = 45505431L,
ID.2 = 2L,
IJC_SAMPLE_1 = "1,1,1",
SJC_SAMPLE_1 = "1,1,1",
IJC_SAMPLE_2 = "1,1,1",
SJC_SAMPLE_2 = "1,1,1",
IncFormLen = 52L,
SkipFormLen = 49L,
PValue = 0.6967562,
FDR = 1,
IncLevel1 = "0.0,0.0,0.0",
IncLevel2 = "1.0,1.0,1.0",
IncLevelDifference = 1.0,
stringsAsFactors = FALSE
)
user_res_rmats <- get_rmats_post_di(rmats_df, event_type = "A3SS")
head(user_res_rmats)
#> event_id event_type form gene_id chr strand inc
#> <char> <char> <char> <char> <char> <char> <char>
#> 1: A3SS:1 A3SS INC ENSG00000182871 chr21 + 45505834-45505966
#> 2: A3SS:1 A3SS EXC ENSG00000182871 chr21 + 45505837-45505966
#> exc p.value delta_psi padj
#> <char> <num> <num> <num>
#> 1: 0.6967562 1 1
#> 2: 45505834-45505836 0.6967562 -1 1
Use compare_transcript_pairs() when you want to start directly from chosen
transcript pairs instead of event matching.
tx_ids <- unique(annotation_df$annotations$transcript_id)
tx_ids <- tx_ids[!is.na(tx_ids) & tx_ids != ""]
if (length(tx_ids) < 4L) {
stop("Need at least four transcripts for transcript-pair example.")
}
transcript_pairs <- data.frame(
transcript1 = tx_ids[1:2],
transcript2 = tx_ids[3:4],
stringsAsFactors = FALSE
)
user_matched <- compare_transcript_pairs(
transcript_pairs = transcript_pairs,
annotations = annotation_df$annotations
)
#> 2 out of 2 transcript pairs contained within annotations
head(user_matched)
#> event_id event_type form gene_id chr strand
#> <char> <char> <char> <char> <char> <char>
#> 1: TXCMP:1 TXCMP EXC ENSG00000158286 chr1 +
#> 2: TXCMP:1 TXCMP INC ENSG00000158286 chr1 +
#> 3: TXCMP:2 TXCMP EXC ENSG00000158286 chr1 +
#> 4: TXCMP:2 TXCMP INC ENSG00000158286 chr1 +
#> inc
#> <char>
#> 1: 6205475-6206101;6206196-6206302;6206536-6206726;6207379-6208307
#> 2: 6205475-6206101;6206196-6206302;6206536-6206726;6207379-6208307
#> 3: 6206129-6206302;6206536-6206726;6207379-6207564
#> 4: 6206129-6206302;6206536-6206726;6207379-6207564
#> exc
#> <char>
#> 1: 6208618-6209025;6209115-6209196;6209268-6209539;6209924-6209970;6210223-6210295;6210370-6210900
#> 2: 6208618-6209025;6209115-6209196;6209268-6209539;6209924-6209970;6210223-6210295;6210370-6210900
#> 3: 6208860-6209025;6209414-6209539;6209924-6209970;6210223-6210295;6210370-6210438;6211867-6212053;6212218-6212416;6212682-6213183;6218289-6218369;6219236-6220805
#> 4: 6208860-6209025;6209414-6209539;6209924-6209970;6210223-6210295;6210370-6210438;6211867-6212053;6212218-6212416;6212682-6213183;6218289-6218369;6219236-6220805
#> delta_psi p.value padj n_samples n_control n_case transcript_id
#> <num> <num> <num> <num> <num> <num> <char>
#> 1: -1 0 0 0 0 0 ENST00000485539
#> 2: 1 0 0 0 0 0 ENST00000466994
#> 3: -1 0 0 0 0 0 ENST00000496676
#> 4: 1 0 0 0 0 0 ENST00000484435
#> exons
#> <char>
#> 1: ENSE00001934791;ENSE00003528881;ENSE00001948578;ENSE00003560800;ENSE00003516232;ENSE00001921263
#> 2: ENSE00001843998;ENSE00001829482;ENSE00003686018;ENSE00001935169
#> 3: ENSE00001866172;ENSE00003662600;ENSE00003560800;ENSE00003516232;ENSE00003531305;ENSE00003488296;ENSE00001883823;ENSE00001858590;ENSE00003562079;ENSE00001844921
#> 4: ENSE00001551004;ENSE00003686018;ENSE00001828693
sessionInfo()
#> R version 4.6.0 alpha (2026-04-05 r89794)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] BiocParallel_1.45.0 SpliceImpactR_0.99.4 BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] splines_4.6.0 BiocIO_1.21.0
#> [3] bitops_1.0-9 ggplotify_0.1.3
#> [5] tibble_3.3.1 polyclip_1.10-7
#> [7] enrichit_0.1.3 XML_3.99-0.23
#> [9] lifecycle_1.0.5 httr2_1.2.2
#> [11] pwalign_1.7.0 rstatix_0.7.3
#> [13] processx_3.8.7 lattice_0.22-9
#> [15] MASS_7.3-65 backports_1.5.1
#> [17] magrittr_2.0.5 sass_0.4.10
#> [19] rmarkdown_2.31 jquerylib_0.1.4
#> [21] yaml_2.3.12 otel_0.2.0
#> [23] ggtangle_0.1.1 cowplot_1.2.0
#> [25] DBI_1.3.0 RColorBrewer_1.1-3
#> [27] abind_1.4-8 GenomicRanges_1.63.2
#> [29] purrr_1.2.1 BiocGenerics_0.57.0
#> [31] msigdbr_26.1.0 RCurl_1.98-1.18
#> [33] yulab.utils_0.2.4 tweenr_2.0.3
#> [35] rappdirs_0.3.4 aisdk_1.1.0
#> [37] gdtools_0.5.0 IRanges_2.45.0
#> [39] S4Vectors_0.49.1 enrichplot_1.31.5
#> [41] ggrepel_0.9.8 tidytree_0.4.7
#> [43] codetools_0.2-20 DelayedArray_0.37.1
#> [45] DOSE_4.5.1 ggforce_0.5.0
#> [47] tidyselect_1.2.1 aplot_0.2.9
#> [49] farver_2.1.2 matrixStats_1.5.0
#> [51] stats4_4.6.0 Seqinfo_1.1.0
#> [53] GenomicAlignments_1.47.0 jsonlite_2.0.0
#> [55] Formula_1.2-5 systemfonts_1.3.2
#> [57] tools_4.6.0 ggnewscale_0.5.2
#> [59] treeio_1.35.0 PFAM.db_3.22.0
#> [61] Rcpp_1.1.1 glue_1.8.0
#> [63] gridExtra_2.3 SparseArray_1.11.13
#> [65] xfun_0.57 qvalue_2.43.0
#> [67] MatrixGenerics_1.23.0 dplyr_1.2.1
#> [69] withr_3.0.2 BiocManager_1.30.27
#> [71] fastmap_1.2.0 callr_3.7.6
#> [73] digest_0.6.39 R6_2.6.1
#> [75] gridGraphics_0.5-1 GO.db_3.23.1
#> [77] dichromat_2.0-0.1 RSQLite_2.4.6
#> [79] cigarillo_1.1.0 tidyr_1.3.2
#> [81] generics_0.1.4 fontLiberation_0.1.0
#> [83] data.table_1.18.2.1 rtracklayer_1.71.3
#> [85] httr_1.4.8 htmlwidgets_1.6.4
#> [87] S4Arrays_1.11.1 scatterpie_0.2.6
#> [89] pkgconfig_2.0.3 gtable_0.3.6
#> [91] blob_1.3.0 S7_0.2.1
#> [93] XVector_0.51.0 clusterProfiler_4.19.7
#> [95] htmltools_0.5.9 fontBitstreamVera_0.1.1
#> [97] carData_3.0-6 bookdown_0.46
#> [99] scales_1.4.0 Biobase_2.71.0
#> [101] png_0.1-9 ggfun_0.2.0
#> [103] knitr_1.51 reshape2_1.4.5
#> [105] rjson_0.2.23 nlme_3.1-169
#> [107] curl_7.0.0 org.Hs.eg.db_3.23.1
#> [109] cachem_1.1.0 stringr_1.6.0
#> [111] parallel_4.6.0 AnnotationDbi_1.73.1
#> [113] restfulr_0.0.16 pillar_1.11.1
#> [115] grid_4.6.0 vctrs_0.7.2
#> [117] ggpubr_0.6.3 car_3.1-5
#> [119] tidydr_0.0.6 cluster_2.1.8.2
#> [121] evaluate_1.0.5 tinytex_0.59
#> [123] magick_2.9.1 cli_3.6.5
#> [125] compiler_4.6.0 Rsamtools_2.27.1
#> [127] rlang_1.2.0 crayon_1.5.3
#> [129] ggsignif_0.6.4 labeling_0.4.3
#> [131] ps_1.9.2 plyr_1.8.9
#> [133] fs_2.0.1 ggiraph_0.9.6
#> [135] stringi_1.8.7 babelgene_22.9
#> [137] assertthat_0.2.1 Biostrings_2.79.5
#> [139] lazyeval_0.2.3 GOSemSim_2.37.2
#> [141] fontquiver_0.2.1 Matrix_1.7-5
#> [143] patchwork_1.3.2 bit64_4.6.0-1
#> [145] ggplot2_4.0.2 KEGGREST_1.51.1
#> [147] SummarizedExperiment_1.41.1 igraph_2.2.3
#> [149] broom_1.0.12 memoise_2.0.1
#> [151] bslib_0.10.0 ggtree_4.1.1
#> [153] bit_4.6.0 ape_5.8-1
#> [155] gson_0.1.0