% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/prepareInputRegions.R
\name{prepareInputRegions}
\alias{prepareInputRegions}
\title{Prepare input data for peakCombiner package}
\usage{
prepareInputRegions(
  data,
  outputFormat = "GenomicRanges",
  genome = NA,
  startsAreBased = 1,
  showMessages = TRUE
)
}
\arguments{
\item{data}{Data frame or GRanges object with the input data. Several
formats are accepted, which are described in full in the Details below.
\itemize{
\item in memory data frame listing each sample's peak file location,
\item in memory data frame listing the peaks themselves that are found in each
sample, or
\item in memory GRanges object listing the peaks themselves that are found in
each sample.
}}

\item{outputFormat}{Character value to define format of output object.
Accepted values are "GenomicRanges" (default), "tibble"
or "data.frame".}

\item{genome}{Character value to define the matching genome reference to
the input data. Default value is NA. Allows values are
based on GenomicRanges supported genomes like "GRCh38",
"GRCh38.p13", "Amel_HAv3.1", "WBcel235", "TAIR10.1",
"hg38", "mm10", "rn6", "bosTau9", "canFam3", "musFur1",
"galGal6","dm6", "ce11", and "sacCer3". Please see also
help for \code{\link[Seqinfo:seqinfo]{Seqinfo::seqinfo()}} for more details.}

\item{startsAreBased}{Either 0, 1 (Default), or NA. Define if the provided
input data is 0 or 1-based. Only, if paramter is NA
then GenomicRanges object, tibbles and dataframes are
considered 1-based, while data loaded from a
sample_sheet is considered 0-based (expected to load a
BED file).}

\item{showMessages}{Logical value of TRUE (default) or FALSE. Defines if
info messages are displayed or not.}
}
\value{
A tibble with the columns \code{chrom}, \code{start}, \code{end}, \code{name}, \code{score},
\code{strand}, \code{center}, \code{sample_name}. The definitions of these columns are
described in full in the Details below. Use as input for functions
\code{\link[=centerExpandRegions]{centerExpandRegions()}}, \code{\link[=filterRegions]{filterRegions()}} and
\code{\link[=combineRegions]{combineRegions()}}.
}
\description{
\link{prepareInputRegions} prepares the input data in the format
needed for all of the following steps within peakCombiner. It accepts the
following formats:
\itemize{
\item in memory data frame listing each sample's peak file location,
\item in memory data frame listing the peaks themselves that are found in each
sample, or
\item in memory GRanges object listing the peaks themselves that are found in
each sample.
}
}
\details{
Accepted inputs are one of the three following options:
\enumerate{
\item In memory data frame listing each sample's peak file location
\itemize{
\item \code{sample_name} -  Unique name for each sample
(required).
\item \code{file_path} -   Path to the file in which the genomic regions are
stored. For example, the path to a bed file or
\code{.narrowPeak} file (required).
\item \code{file_format} -  The expected file format. Needed to correctly label the
columns of the input. Acceptable values are:
\code{bed}, \code{narrowPeak}, and \code{broadPeak} (required).
\item \code{score_colname} - Either column name or number of the column having the
the metric used to rank peak importance, where bigger
values are more important. Entries have to be identical,
mutliple entries are not supported. If not provided,
column 9 will be used for \code{.narrowPeak} or
\code{.broadPeak} file formats. Column 9 corresponds to
the \code{qValue} as described in the UCSC documentation
\href{https://genome.ucsc.edu/FAQ/FAQformat.html#format12}{here}.
Other alternatives for \code{narrowPeak} or \code{broadPeak}
could be columns 7 or 8, which correspond to
\code{signalValue} or \code{pValue} (optional).
}
\item In memory data frame listing the peaks themselves that are found in each
sample. The columns can be provided in any order and have the following
names. Note that additional columns will be dropped.
\itemize{
\item \code{chrom} - chromosome name (required).
\item \code{start} - start coordinate of range (1-based coordinate system,
NOT like bed files which are 0-based) (required).
\item \code{end} -   end coordinate of range (required).
\item \code{sample_name} - unique identifier for a sample. No restrictions on
characters (required).
\item \code{score} - the metric used to rank peak importance, where bigger values
are more important. For example, qValue from Macs2,
-log10FDR from another method, or fold enrichment over
background computed from your favorite method. If not
provided, defaults to 0 (optional).
\item \code{strand} - values are '+', '-', or '.'. If not provided, defaults to '.'
(optional).
\item \code{summit} - distance of the strongest signal ("summit") of the peak
region from the start coordinate (optional).
}
\item In memory GRanges object listing the peaks themselves that are found in
each sample. This object is very similar to the data frame above,
except that \code{chrom}, \code{start}, and \code{end} are instead described using
the \code{GRanges} nomenclature. Note that additional columns will be dropped.
}

This function parses the inputs provided and returns a data frame having the
columns listed below.
\itemize{
\item \code{chrom} -        chromosome name
\item \code{start} -        start coordinate of range (1-based coordinate system,
NOT like bed files which are 0-based)
\item \code{end} -          end coordinate of range
\item \code{name} -         unique identifier for a region. auto-generated by this
function
\item \code{score} -        the metric used to rank peak importance, where bigger
values are more important. For example, qValue from MACS2,
-log10FDR from another method, or fold enrichment over
background computed from your favorite method
\item \code{strand} -       values are '+', '-', or '.'. Chromatin data are typically
non- stranded so will have a '.'.
\item \code{center} -       absolute genomic coordinate of the nucleotide at the
center of the peak region, or alternatively the strongest
signal ("summit") of the peak region. If no value is
provided by the user, \code{center} defaults to the arithmetic
center of the peak region.
\item \code{sample_name} -  unique identifier for a sample. No restrictions on
characters
}

In addition, input data is checked for mutliple entries of the same genomic
region. This can occure when using called peak files as multiple summits can
be annotated within the sqme genomic regions (defined by \code{chrom}, \code{start}
and \code{end}). To avoid mutliple entries, this script is checking the input for
multiple summits within the same regions and maintains only the strongest
enriched (based on the values in the column \code{score}). This step is mandatory
to guaranty an optimal result.

An additional option is to provide already here a genome (details see below)
and maintain this information for the function
\code{\link[=centerExpandRegions]{centerExpandRegions()}}.
}
\examples{
# Load in and prepare a an accepted tibble
utils::data(syn_data_tibble)

data_prepared <- prepareInputRegions(
  data = syn_data_tibble,
  outputFormat = "tibble",
  showMessages = TRUE
)
data_prepared

# Or a pre-loaded tibble with genomic regions and named columns.

utils::data(syn_data_control01)
utils::data(syn_data_treatment01)

combined_input <- syn_data_control01 |>
  dplyr::mutate(sample_name = "control-rep1") |>
  rbind(syn_data_treatment01 |>
    dplyr::mutate(sample_name = "treatment-rep1"))

prepareInputRegions(
  data = combined_input,
  outputFormat = "tibble",
  showMessages = FALSE
)

}
