% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/defineCompounds.R, R/runHypergeom.R,
%   R/runDiffusion.R, R/runPagerank.R, R/enrich.R
\name{enrich-funs}
\alias{enrich-funs}
\alias{defineCompounds}
\alias{runHypergeom}
\alias{runDiffusion}
\alias{runPagerank}
\alias{enrich}
\title{Functions to map and enrich a list of metabolites}
\usage{
defineCompounds(compounds = NULL, compoundsBackground = NULL,
    data = NULL)

runHypergeom(object = NULL, data = NULL, p.adjust = "fdr")

runDiffusion(object = NULL, data = NULL, approx = "normality",
    t.df = 10, niter = 1000)

runPagerank(object = NULL, data = NULL, approx = "normality",
    dampingFactor = 0.85, t.df = 10, niter = 1000)

enrich(compounds = NULL, compoundsBackground = NULL,
    methods = listMethods(), loadMatrix = "none", approx = "normality",
    t.df = 10, niter = 1000, databaseDir = NULL, internalDir = TRUE,
    data = NULL, ...)
}
\arguments{
\item{compounds}{Character vector containing the 
KEGG IDs of the compounds considered as affected}

\item{compoundsBackground}{Character vector containing the KEGG IDs of 
the compounds that belong to the background. Can be \code{NULL} for the 
default background (all compounds)}

\item{data}{FELLA.DATA object}

\item{object}{FELLA.USER object}

\item{p.adjust}{Character passed to the 
\code{\link[stats]{p.adjust}} method}

\item{approx}{Character: "simulation" for Monte Carlo, "normality", 
"gamma" or "t" for parametric approaches}

\item{t.df}{Numeric value; number of degrees of freedom 
of the t distribution 
if the approximation \code{approx = "t"} is used}

\item{niter}{Number of iterations (permutations) 
for Monte Carlo ("simulation"), 
must be a numeric value between 1e2 and 1e5}

\item{dampingFactor}{Numeric value between 0 and 1 (none inclusive), 
damping factor \code{d} for 
PageRank (\code{\link[igraph:page_rank]{page.rank}})}

\item{methods}{Character vector, containing some of: 
\code{"hypergeom"}, \code{"diffusion"}, \code{"pagerank"}}

\item{loadMatrix}{Character vector to choose if 
heavy matrices should be loaded. 
Can contain: \code{"diffusion"}, \code{"pagerank"}}

\item{databaseDir}{Character, path to load the 
\code{\link{FELLA.DATA}} object if
it is not already passed through the argument \code{data}}

\item{internalDir}{Logical, is the directory located 
in the package directory?}

\item{...}{Further arguments for the enrichment function(s) 
\code{runDiffusion}, \code{runPagerank}}
}
\value{
\code{defineCompounds} returns 
the \code{\link{FELLA.USER}} object 
with the mapped metabolites, ready to be enriched.

\code{runHypergeom} returns a 
\code{\link{FELLA.USER}} object 
updated with the hypergeometric test results

\code{runDiffusion} returns a 
\code{\link{FELLA.USER}} object 
updated with the diffusion enrichment results

\code{runPagerank} returns a 
\code{\link[FELLA]{FELLA.USER}} object 
updated with the PageRank enrichment results

\code{enrich} returns a 
\code{\link{FELLA.USER}} object 
updated with the desired enrichment results if 
the \code{\link{FELLA.DATA}} was supplied. 
Otherwise, a list with the freshly loaded  
\code{\link{FELLA.DATA}} object and the 
corresponding enrichment in the 
\code{\link{FELLA.USER}} object.
}
\description{
Function \code{defineCompounds} creates a 
\code{\link{FELLA.USER}} object from a list of 
compounds and a \code{\link{FELLA.DATA}} object.

Functions \code{runHypergeom}, 
\code{runDiffusion} and \code{runPagerank} 
perform an enrichment on a \code{\link{FELLA.USER}} with 
the mapped input metabolites 
(through \code{defineCompounds}) 
and a \code{\link{FELLA.DATA}} object. 
They are based on the hypergeometric test, the heat diffusion model 
and the PageRank algorithm, respectively. 

Function \code{enrich} is a wrapper with 
the following order: 
\code{loadKEGGdata} (optional), 
\code{defineCompounds} and one or more in 
\code{runHypergeom}, \code{runDiffusion} 
and \code{runPagerank}
}
\details{
Function \code{defineCompounds} maps the 
specficied list of KEGG compounds [Kanehisa, 2017], usually from an 
experimental metabolomics study, to the graph contained in the
\code{\link{FELLA.DATA}} object. 
Importantly, the names must be KEGG ids, so other formats 
(common names, HMDB ids, etc) must be mapped to KEGG first. 
For example, through the "Compound ID Conversion" 
tool in MetaboAnalyst [Xia, 2015].
The user can also define a personalised background as a 
list of KEGG compound ids, which should be more extensive than 
the list of input metabolites. 
Once the compounds are mapped, the enrichment 
can be performed through \code{runHypergeom}, 
\code{runDiffusion} and \code{runPagerank}.

Function \code{runHypergeom} performs an over representation analysis 
through the hypergeometric test [Fisher, 1935] on a 
\code{\link{FELLA.USER}} object with mapped metabolites 
and a \code{\link{FELLA.DATA}} object. 
If a custom background was specified, it will be used. 
This approach is included for completeness and it is not the 
main purpose behind the \code{\link{FELLA}} package. 
Importantly, \code{runHypergeom} is not a hypergeometric test using the 
original KEGG pathways. 
Instead, a compound "belongs" to a "pathway" if 
it can reach the original pathway in the 
upwards-directed KEGG graph. 
This is a way to evaluate enrichment including indirect connections 
to a pathway, e.g. through an enzymatic family. 
New "pathways" are expected to be larger than the original pathways
in this analysis and therefore the results can differ from the 
standard over representation.

Function \code{runDiffusion} performs 
the diffusion-based enrichment on a 
\code{\link{FELLA.USER}} object with mapped metabolites 
and a \code{\link{FELLA.DATA}} object [Picart-Armada, 2017]. 
If a custom background was specified, it will be used. 
The idea behind the heat diffusion is the usage of the 
finite difference formulation of the heat equation to 
propagate labels from the metabolites to the rest of the graph.

Following the notation in [Picart-Armada, 2017], 
the temperatures (diffusion scores) 
are computed as:

\deqn{
T = -KI^{-1} \cdot G 
}{
T = -KI^(-1)*G 
}

\code{G} is an indicator vector of the input metabolites 
(\code{1} if input metabolite, \code{0} otherwise).
\code{KI} is the matrix \code{-KI = L + B}, being 
\code{L} the unnormalised graph Laplacian and 
\code{B} the diagonal matrix with \code{B[i,i] = 1} if 
node \code{i} is a pathway and \code{B[i,i] = 0} otherwise.

Equivalently, with the notation in the HotNet approach [Vandin, 2011], 
the stationary temperature is named \code{fs}:

\deqn{
f^s = L_{\gamma}^{-1} \cdot b^s 
}{
fs = Lgamma^(-1)*bs 
}

\code{bs} is the indicator vector \code{G} from above. 
\code{Lgamma}, on the other hand, is found as 
\code{Lgamma = L + gamma*I}, where \code{L} is the unnormalised 
graph Laplacian, \code{gamma} is the first order leaking rate 
and \code{I} is the identity matrix. 
In our formulation, only the pathway nodes are allowed to leak, 
therefore \code{I} is switched to \code{B}. 
The parameter \code{gamma} is set to \code{gamma = 1}.

The input metabolites are forced to stay warm, 
propagating flow to all the nodes in the network. 
However, only pathway nodes are allowed to evacuate 
this flow, so that its directionality is bottom-up. 
Further details on the setup of the diffusion process can be 
found in the supplementary file S2 from [Picart-Armada, 2017].

Finally, the warmest nodes in the graph are reported as 
the relevant sub-network. 
This will probably include some input metabolites and 
also reactions, enzymes, modules and pathways. 
Other metabolites can be suggested as well.

Function \code{runPagerank} performs the random walk 
based enrichment on a 
\code{\link{FELLA.USER}} object with mapped metabolites 
and a \code{\link{FELLA.DATA}} object.
If a custom background was specified, it will be used. 
PageRank was originally conceived as a scoring system for websites 
[Page, 1999]. 
Intuitively, PageRank favours nodes that 
(1) have a large amount of nodes pointing 
at them, and (2) whose pointing nodes also have high scores. 
Classical PageRank is formulated in terms of a random walker -  
the PageRank of a given node is the stationary probability 
of the walker visiting it. 

The walker chooses, in each step, 
whether to continue the random walk with probability 
\code{dampingFactor} or to restart it with probability 
\code{1 - dampingFactor}. 
In the original publication, \code{dampingFactor = 0.85}, 
which is the value used in \code{FELLA} by default. 
If he or she continues, an edge is picked from the outgoing edges 
in the current node with a probability proportional to its weight. 
If he or she restarts it, a node is uniformly picked from the 
whole graph. 
The "personalised PageRank" variant allows a user-defined 
distribution as the source of new random walks. 
The R package \code{igraph} contains such variant in its 
\code{\link[igraph:page_rank]{page.rank}} function [Csardi, 2006].

As described in the supplement S3 from [Picart-Armada, 2017], 
the PageRank \code{PR} can be computed as 
a column vector by imposing a stationary 
state in the probability.
With a damping factor \code{d} and the user-defined 
distribution \code{p} as a column vector:

\deqn{
\textrm{PR} = d\cdot M\cdot \textrm{PR} + (1 - d)\cdot p
}{
PR = d*M*PR + (1 - d)*p
}

\code{M} is the matrix whose element \code{M[i,j]} is the 
probability of transitioning from \code{j} to \code{i}. 
If node \code{j} has outgoing edges, their probability is proportional 
to their weight - all weights must be positive. 
If node \code{j} has no outgoing edges, the probability is 
uniform over all the nodes, i.e. \code{M[i,j] = 1/nrow(M)} 
for every \code{i}. 
Note that all the columns from \code{M} sum up exactly \code{1}.
This leads to an expression to compute PageRank:

\deqn{
\textrm{PR} = (1 - d)p \cdot(I - dM)^{-1}
}{
PR = (1 - d)*p*(I - d*M)^(-1)
}

The idea behind the method \code{"pagerank"} is closely related 
to \code{"diffusion"}. 
Relevant metabolites are the sources of new random walks and 
nodes are scored through their PageRank. 
Specifically, \code{p} is set to a uniform probability on the 
input metabolites. 
More details on the setup can be found in 
the supplementary file S3 from [Picart-Armada, 2017].

There is an important detail for \code{"diffusion"} 
and \code{"pagerank"}: the scores are statistically normalised. 
Omitting this normalisation leads to a systematic bias, 
especially in pathway nodes, as described in [Picart-Armada, 2017]. 

Therefore, in both cases, scores undergo a normalisation 
through permutation analysis. 
The score of a node \code{i} is compared to its null distribution 
under input permutation, leading to their p-scores. 
As described in [Picart-Armada, 2017], two alternatives are offered: 
a parametric and deterministic approach 
and a non-parametric, stochastic one.

Stochastic Monte Carlo trials (\code{"simulation"}) imply 
randomly permuting the input \code{niter} times and counting, 
for each node \code{i}, how many trials 
led to an equally or more extreme value than the original score. 
An empirical p-value is returned [North, 2002].

On the other hand, the parametric 
scores (\code{approx = "normality"}) 
give a z-score for such permutation analysis. 
The expected value and variance of such null distributions 
are known quantities, see supplementary 
file S4 from [Picart-Armada, 2017].
To work in the same range \code{[0,1]}, z-scores are 
transformed using the routine \code{\link[stats:Normal]{pnorm}}. 
The user can also choose the Student's t using 
\code{approx = "t"} and choosing a number of degrees of freedom 
through \code{t.df}. 
This uses the function \code{\link[stats:TDist]{pt}} instead.
Alternatively, a gamma distribution can be used by setting 
\code{approx = "gamma"}. 
The theoretical mean (E) and variance (V) 
are used to define the shape 
(E^2/V) and scale (V/E) of the gamma distribution, and 
\code{\link[stats:GammaDist]{pgamma}} to map to [0,1].

Any sub-network prioritised by \code{"diffusion"} 
and \code{"pagerank"} is selected by applying 
a threshold on the p-scores.

Finally, the function \code{enrich} 
is a wrapper to perform the enrichment analysis. 
If no \code{\link{FELLA.DATA}} object is supplied, 
it loads it, maps the affected compounds and performs 
the desired enrichment(s) with a single call.
Returned is a list with the loaded 
\code{\link{FELLA.DATA}} object 
and the results in a \code{\link{FELLA.USER}} object. 
Conversely, the user can supply the 
\code{\link{FELLA.DATA}} object and the wrapper 
will map the metabolites and run the desired enrichment 
method(s). 
In this case, only the \code{\link{FELLA.USER}} 
will be returned.
}
\examples{
## Load the internal database. 
## This one is a toy example!
## Do not use as a regular database
data(FELLA.sample)
## Load a list of compounds to enrich
data(input.sample)

######################
## Example, step by step

## First, map the compounds
obj <- defineCompounds(
compounds = c(input.sample, "I_dont_map", "me_neither"), 
data = FELLA.sample)
obj
## See the mapped and unmapped compounds
getInput(obj)
getExcluded(obj)
## Compounds are already mapped 
## We can enrich using any method now

## If no compounds are mapped an error is thrown. Example:
\dontrun{
data(FELLA.sample)
obj <- defineCompounds(
compounds = c("C00049", "C00050"), 
data = FELLA.sample)}

## Enrich using hypergeometric test
obj <- runHypergeom(
object = obj, 
data = FELLA.sample)
obj

## Enrich using diffusion
## Note how the results are added;  
## the hypergeometric results are not overwritten
obj <- runDiffusion(
object = obj, 
approx = "normality", 
data = FELLA.sample)
obj

## Enrich using PageRank
## Again, this does not overwrite other methods 
obj <- runPagerank(
object = obj, 
approx = "simulation", 
data = FELLA.sample)
obj

######################
## Example using the "enrich" wrapper

## Only diffusion
obj.wrap <- enrich(
compounds = input.sample, 
method = "diffusion", 
data = FELLA.sample)
obj.wrap

## All the methods
obj.wrap <- enrich(
compounds = input.sample, 
methods = FELLA::listMethods(), 
data = FELLA.sample)
obj.wrap

}
\references{
Kanehisa, M., Furumichi, M., Tanabe, 
M., Sato, Y., & Morishima, K. (2017). 
KEGG: new perspectives on genomes, pathways, diseases and drugs. 
Nucleic acids research, 45(D1), D353-D361.

Xia, J., Sinelnikov, I. V., Han, B., & Wishart, D. S. (2015). 
MetaboAnalyst 3.0 - making metabolomics more meaningful. 
Nucleic acids research, 43(W1), W251-W257.

Fisher, R. A. (1935). The logic of inductive inference.
Journal of the Royal Statistical Society, 98(1), 39-82. 

Picart-Armada, S., Fernandez-Albert, F., Vinaixa, 
M., Rodriguez, M. A., Aivio, S., Stracker, 
T. H., Yanes, O., & Perera-Lluna, A. (2017). 
Null diffusion-based enrichment for metabolomics data. 
PLOS ONE, 12(12), e0189012.

Vandin, F., Upfal, E., & Raphael, B. J. (2011). 
Algorithms for detecting significantly mutated pathways 
in cancer. Journal of Computational Biology, 18(3), 507-522.

Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). 
The PageRank citation ranking: Bringing order to the web. 
Stanford InfoLab.

Csardi, G., & Nepusz, T. (2006). The igraph software package 
for complex network research. 
InterJournal, Complex Systems, 1695(5), 1-9.

North, B. V., Curtis, D., & Sham, P. C. (2002). 
A note on the calculation of empirical P values 
from Monte Carlo procedures. 
American journal of human genetics, 71(2), 439.
}
