% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/fdr_calculations.R
\name{calculate_and_add_occupancy_pvals}
\alias{calculate_and_add_occupancy_pvals}
\title{Calculate and add gene occupancy FDR}
\usage{
calculate_and_add_occupancy_pvals(
  binding_data,
  occupancy_df,
  null_model_iterations = 1e+05,
  return_per_replicate_fdr = FALSE,
  plot_diagnostics = FALSE,
  BPPARAM = BiocParallel::bpparam(),
  seed = NULL
)
}
\arguments{
\item{binding_data}{A `GRanges` object of binding profiles, where metadata
columns represent samples. This is typically the `binding_profiles_data`
element from the list returned by `load_data_genes`.}

\item{occupancy_df}{A data frame of gene occupancies, typically the `occupancy`
element from the list returned by `load_data_genes`. It must contain
sample columns and a `nfrags` column.}

\item{null_model_iterations}{Integer. The number of sampling iterations to build the
FDR null model. Higher values are more accurate but slower. Default is 100000.}

\item{return_per_replicate_fdr}{Logical. Returns individual FDR scores by applying BH
adjustment on each individual sample.  Use this option if not intending
to apply downstream condition-level p-value aggregation via
`differential_binding()`.  (Default: FALSE)}

\item{plot_diagnostics}{Logical.  If TRUE, will plot the Tier 2 regression fits
for the p-value prediction slope, intercept and mean squared error (MSE).
(Default: FALSE)}

\item{BPPARAM}{A `BiocParallelParam` instance specifying the parallel backend
to use for computation. See `BiocParallel::bpparam()`.}

\item{seed}{An optional integer. If provided, it is used to set the random
seed before the sampling process begins, ensuring that the FDR calculations
are fully reproducible. If `NULL` (default), results may vary between runs.}
}
\value{
An updated version of the input `occupancy_df` data frame, with new columns
appended. For each sample column (e.g., `SampleA`), a corresponding p-value
column (e.g., `SampleA_pval`) is added.
}
\description{
This function calculates a p-value for gene occupancy scores.  When adjusted
for multiple-hypothesis testing, the resulting FDR may used as a proxy for
determining whether a gene is actively expressed in RNA Polymerase TaDa
experiments. The function applies a permutation-based null model to each
sample's binding profile to determine empirical p-values, and returns these
as new columns in the input occupancy dataframe.

These p-values are then aggregated and
}
\details{
The algorithm used here is substantially revised and adapted from the
method described in Southall et al. (2013), Dev Cell, 26(1), 101-12.  The
broad principle is as follows:
\enumerate{
  \item For each sample, a null distribution of mean occupancy scores is
    simulated by repeatedly sampling random fragments from the genome-wide
    binding profile. This is done for various numbers of fragments.
  \item A two-tiered regression model is fitted to the simulation results. This
    creates a statistical model that can predict the empirical p-value for any given
    occupancy score and fragment count.
  \item This predictive model is then applied to the actual observed mean
    occupancy and fragment count for each gene in the `occupancy_df`.
  \item The final set of p-values is adjusted using BH correction to generate
    FDR scores for each gene.
  \item The FDR value for each gene in each sample is appended to the
    `occupancy_df` in a new column.
}

Key differences from the original Southall et al. algorithm include fitting
natural spline models, using WLS to account for heterscedasticity of models,
and sampling with replacement to generate the underlying null distribution.

This function is typically not called directly by the user, but is instead
invoked via `load_data_genes(calculate_occupancy_pvals = TRUE)`.
}
\examples{

# Prepare sample binding data (GRanges object)
# Here, we'll load one of the sample files included with the package
# (this is a TF binding profile, which would not normally be used for
# occupancy calculations)
data_dir <- system.file("extdata", package = "damidBind")
bgraph_file <- file.path(data_dir, "Bsh_Dam_L4_r1-ext300-vs-Dam.kde-norm.gatc.2L.bedgraph.gz")

# The internal function `import_bedgraph_as_df` is used here for convenience.
# The column name 'score' will be replaced with the sample name.
binding_df <- damidBind:::import_bedgraph_as_df(bgraph_file, colname = "Sample_1")

binding_gr <- GenomicRanges::GRanges(
    seqnames = binding_df$chr,
    ranges = IRanges::IRanges(start = binding_df$start, end = binding_df$end),
    Sample_1 = binding_df$Sample_1
)

# Create a mock occupancy data frame.
# In a real analysis, this would be generated by `calculate_occupancy()`.
mock_occupancy <- data.frame(
    name = c("geneA", "geneB", "geneC"),
    nfrags = c(5, 10, 2),
    Sample_1 = c(1.5, 0.8, -0.2),
    row.names = c("geneA", "geneB", "geneC")
)

# Calculate pvals.
# We use a low null_model_iterations for speed, and a seed for reproducibility.
occupancy_with_pvals <- calculate_and_add_occupancy_pvals(
    binding_data = binding_gr,
    occupancy_df = mock_occupancy,
    null_model_iterations = 100,
    BPPARAM = BiocParallel::SerialParam(),
    seed = 42
)

# View the result, which now includes a _pval column for Sample_1.
print(occupancy_with_pvals)


}
\seealso{
[load_data_genes()] which uses this function internally.
}
