\name{enrichedPairs}
\alias{enrichedPairs}

\title{Collect local enrichment statistics for bin pairs}
\description{Determine the count for a variety of local neighborhoods around a bin pair, for use in computing peak enrichment statistics.}

\usage{
enrichedPairs(data, flank=5, exclude=0, assay.in=1, assay.out=NULL)
}

\arguments{
\item{data}{an InteractionSet object containing bin pair counts, generated by \code{\link{squareCounts}}}
\item{flank}{an integer scalar, specifying the number of bins to consider as the local neighborhood}
\item{exclude}{an integer scalar, specifying the number of bins to exclude from the neighborhood}
\item{assay.in}{a string or integer scalar, specifying the assay containing bin pair counts in \code{data}}
\item{assay.out}{a character vector containing 4 unique names for the neighborhood regions A-D, see below}
}

\section{Definition of the neighborhoods}{
Consider the coordinates of the interaction space in terms of bins, and focus on any particular bin pair (named here as the target bin pair).
This target bin pair is characterized by four neighborhood regions, from A to D.
Region A (named \code{"quadrant"}) is a square with side lengths equal to \code{flank}, positioned such that the target bin pair lies at the corner furthest from the diagonal (only used for intra-chromosomal targets).
Region B (named \code{"vertical"}) is a vertical rectangle with dimensions \code{(1, flank*2+1)}, containing the target bin pair at the center.
Region C (named \code{"horizontal"}) is the horizontal counterpart to B.
Region D (named \code{"surrounding"}) is a square with side lengths equal to \code{flank*2+1}, where the target bin pair is positioned in the center.

% Segments C and D might be better than separate up/down/left/right neighborhoods, as those run the risk of
% being too aggressive if the peak spills over into those neighborhoods. Spillover is diluted a bit more
% when you consider up/down together or left/right together, which reduces the problem.
%
% Similarly, segment A is probably better than looking at each quadrant separately, as it dilutes out any
% other peaks in the neighborhood. It fails to account for off-diagonal structural features, but there
% don't seem to be that many of them anyway, so maybe that's not an issue.

Obviously, the target bin pair itself is excluded in the definition of each neighborhood.
If \code{exclude} is positive, additional bin pairs closest to the target will also be excluded.
For example, region A* is constructed with \code{exclude} instead of \code{flank}, and the resulting area is excluded from region A (and so on for all other regions).
This avoids problems where diffuse interactions are imperfectly captured by the target bin pair, such that genuine interactions spill over into the neighborhood.
Spill-over is undesirable as it will inflate the size of the neighborhood counts for genuine interactions.
Setting a larger \code{exclude} ensures that this does not occur.

The size of \code{flank} requires consideration, as it defines the size of each neighborhood region.
If the value is too large, other peaks may be included in the background such that the neighborhood count size is inflated.
On the other hand, if \code{flank} is too small, there will not be enough neighborhood bin pairs to dilute the increase in counts from spill-over.
Both scenarios result in a decrease in enrichment values and loss of power to detect punctate events.
The default value of 5 seems to work well, though users may wish to test several values for themselves.

For each bin pair, the other bin pairs in \code{data} that belong to its neighborhood are identified.
The sum of counts across these bin pairs is computed for each library and stored in a matrix.
This is repeated for each type of neighborhood (A-D), and the matrices are named based on \code{assay.out}.
The area of each neighborhood is also computed in terms of the number of bin pairs contained by the neighborhood. 
Note that the neighborhood area includes bin pairs that are missing from \code{data}, as these are assumed to have a count of zero.
See \code{\link{filterPeaks}} for how these neighborhood counts are used to assess the \dQuote{peak-ness} of each bin pair.
}

\value{
An object of the same type as \code{data} is returned, containing additional matrices in the \code{assays} slot.
Each matrix contains the counts for one neighborhood for each bin pair in each library.
The area of each neighborhood is also returned in the \code{mcols} and named with the \code{"N."} prefix.
}

\author{
Aaron Lun
}

\seealso{
\code{\link{squareCounts}},
\code{\link{neighborCounts}},
\code{\link{filterPeaks}}
}

\examples{
# Setting up the object.
a <- 10
b <- 20
regions <- GRanges(rep(c("chrA", "chrB"), c(a, b)), IRanges(c(1:a, 1:b), c(1:a, 1:b)))

set.seed(23943)
all.anchor1 <- sample(length(regions), 50, replace=TRUE)
all.anchor2 <- as.integer(runif(50, 1, all.anchor1+1))
data <- InteractionSet(matrix(as.integer(rnbinom(200, mu=10, size=10)), 50, 4), 
    GInteractions(anchor1=all.anchor1, anchor2=all.anchor2, 
        regions=regions, mode="reverse"), 
    colData=DataFrame(lib.size=1:4*1000), metadata=List(width=1))
data$totals <- colSums(assay(data))

# Getting peaks.
head(enrichedPairs(data))
head(enrichedPairs(data, flank=3))
head(enrichedPairs(data, flank=1))
head(enrichedPairs(data, exclude=1))
}

\references{
Rao S et al. (2014). A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. \emph{Cell}. 159, 1665-1690.
}

\keyword{filtering}
