% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/filterRegions.R
\name{filterRegions}
\alias{filterRegions}
\title{Apply user-defined filtering options to genomic regions.}
\usage{
filterRegions(
  data,
  includeByChromosomeName = NULL,
  excludeByBlacklist = NULL,
  includeAboveScoreCutoff = NULL,
  includeTopNScoring = NULL,
  outputFormat = "GenomicRanges",
  showMessages = TRUE
)
}
\arguments{
\item{data}{PeakCombiner data frame structure with required columns
named \code{chrom}, \code{start}, \code{end}, \code{name},
\code{score}, \code{strand}, \code{center}, \code{sample_name}. Additional
columns will be maintained.}

\item{includeByChromosomeName}{\itemize{
\item 'NULL' (default) - No chromosome name filtering will be done.
* Character vector that contains chromosomes names to be retained.
}}

\item{excludeByBlacklist}{\itemize{
\item 'NULL' (default) - No blacklist filtering will be done.
* GenomicRanges object (default setup) or data frame/tibble with
columns \code{chrom}, \code{start}, and \code{end}.
}}

\item{includeAboveScoreCutoff}{\itemize{
\item 'NULL' (default) - No score filtering will be done.
* Single numeric value that defines the \code{score} threshold above
which all genomic regions will be retained. This results in
variable number of sites per sample.
}}

\item{includeTopNScoring}{\itemize{
\item 'NULL' (default) - No score filtering will be done.
* Single numeric value representing the number of genomic regions
per sample to be retained. The genomic regions are selected from
highest to lowest score, and if includeTopNScoring > number of
regions, then no filtering is done.
}}

\item{outputFormat}{Character value to define format of output object.
Accepted values are "GenomicRanges" (default), "tibble"
or "data.frame".}

\item{showMessages}{Logical value of TRUE (default) or FALSE. Defines if
info messages are displayed or not.}
}
\value{
A tibble with the columns \code{chrom}, \code{start}, \code{end}, \code{name}, \code{score},
\code{strand}, \code{center}, \code{sample_name}. The definitions of these columns are
described in full in the \link{prepareInputRegions} Details.
Use as input for functions \link{centerExpandRegions} and
\link{combineRegions}.
}
\description{
\link{filterRegions} is an optional step that allows
inclusion or exclusion of genomic regions based on 4 different criteria:
\itemize{
\item Include regions by their chromosome names (optional).
\item Exclude blacklisted regions (optional).
\item Include regions above a given score (optional).
\item Include top n regions per sample, ranked from highest to lowest score
(optional).
}

The accepted input is the PeakCombiner data frame is created from the
function \link{prepareInputRegions}.
Please see \link{prepareInputRegions} for more details.

The \link{filterRegions} can be used multiple times on the same
data set, which allows a user to step-wise optimize selection criteria of
regions of interest.
}
\details{
This is an optional step which enables commonly-needed filtering steps to
focus in on the key genomic regions of interest. This can be useful
when there are many genomic regions identified in your peak-caller or
input BED files.

\link{filterRegions} can be used multiple times on the same data
set, allowing a user to select regions of interest using a step-wise
optimization approach.
\itemize{
\item \code{includeByChromosomeName} -   Retains only chromosomes that are in the
provided vector. By not including
mitochondrial, sex, or non-classical
chromosomes, genomic regions found on
these chromosomes can be removed. If set
to 'NULL' (default), this step will be
skipped (optional).
\item \code{excludeByBlacklist} -         A GenomicRanges file, dataframe or tibble
can be provided listing the genomic
regions to remove (having \code{chrom} (
\code{seqnames} for GenomicRanges) , \code{start},
and \code{end} column names). If set to 'NULL'
(default), this step will be skipped
(optional).
Please note that if there are not matching
entries in the 'chrom' columns of input
and blacklist, an information message is
displayed. This can happend and does not
cause any problems with the script.
\item \code{includeAboveScoreCutoff} -   Single numeric value that defines the
\code{score} threshold above which all genomic
regions will be retained. The \code{score}
column in the peakCombiner input data
should be non-zero for this parameter to
be used. It is populated by
\link{prepareInputRegions}, and
by default takes the value of -log10(FDR)
if possible (e.g., using a .narrowPeak
file from MACS2 as input). Importantly,
applying this filter retains a variable
number of genomic regions per sample, all
having a score greater than the
\code{includeAboveScoreCutoff} parameter. If
set to 'NULL' (default), this step will
be skipped (optional).
\item \code{includeTopNScoring} -        Single numeric value that defines how many
of the top scoring genomic regions (using
the column \code{score}) are retained. All
other genomic regions are discarded.
Importantly, applying this filter retains
\code{includeTopNScoring} regions per
sample, which means that the minimum
enrichment levels may vary between
samples. Note that if multiple genomic
regions have the same \code{score} cutoff
value, then all of those genomic regions
are included. In this case, the number of
resulting regions retained may be a bit
higher than the input parameter. If set to
'NULL' (default), this step will be
skipped (optional).
}
}
\examples{

# Load in and prepare a an accepted tibble
utils::data(syn_data_bed)

data_prepared <- prepareInputRegions(
  data = syn_data_bed,
  outputFormat = "tibble",
  showMessages = TRUE
)

# Here use options for all four filtering methods.

filterRegions(
  data = data_prepared,
  includeByChromosomeName = c("chr1", "chr2", "chr4"),
  excludeByBlacklist = NULL,
  includeAboveScoreCutoff = 10,
  includeTopNScoring = 100,
  outputFormat = "tibble",
  showMessages = TRUE
)

}
