% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/MapAssessmentData.R
\name{MapAssessmentData}
\alias{MapAssessmentData}
\title{Map Evidence to a Genome}
\usage{
MapAssessmentData(genomes_DBFile,
                  tblName = "Seqs",
                  central_ID,
                  related_IDs,
                  protHits_Seqs,
                  protHits_Scores = rep.int(1, length(protHits_Seqs)),
                  strainID = "",
                  speciesName = "",
                  protHits_Threshold = 0,
                  protHits_IsNTerm = FALSE,
                  related_KMerLen = 8,
                  related_MinDist = 0.01,
                  related_MaxDistantN = 1000,
                  startCodons = c("ATG", "GTG", "TTG"),
                  ema_AlphaVal = 0.1,
                  ema_MinVal = 0.6,
                  useProt = TRUE,
                  useCons = TRUE,
                  processors = 1,
                  verbose = TRUE)
}
\arguments{
\item{genomes_DBFile}{A SQLite connection object or a character string specifying the path to the database file.}

\item{tblName}{Character string specifying the table where the genome sequences are located.}

\item{central_ID}{Character string specifying which identifier corresponds to the central genome, the genome
to which the proteomics data and evolutionary conservation data will be mapped.}

\item{related_IDs}{Character vector of strings specifying identifiers that correspond to related genomes, the genomes
that will be used to determine which start codons (ATG, GTG, and TTG) are evolutionarily conserved.}

\item{protHits_Seqs}{Character vector of amino acid strings that correspond to the sequences for the proteomics hits.}

\item{protHits_Scores}{Numeric vector of (confidence) scores for the proteomics hits. Scores cannot be negative.
The default option assigns a score of one to each proteomics hit.}

\item{strainID}{Optional character string that specifies the strain identifier that the central genome corresponds to.}

\item{speciesName}{Optional character string that specifies the name of the species that the central genome corresponds to.}

\item{protHits_Threshold}{Optional number that specifies what percent of the lowest scoring proteomics hits should be dropped.
Must be a non-negative integer less than 100.}

\item{protHits_IsNTerm}{Logical describing whether or not the proteomics hits come from N-terminal proteomics.
Default value is false.}

\item{related_KMerLen}{The k-mer length to be used when measuring distances between the central genome and related genomes.
Default value is 8. Recommended to use the default value.}

\item{related_MinDist}{The minimum fractional distance required for a related genome to be used in finding
evolutionary conservation. Used to prevent the inclusion of related genomes that are too similar to the central genome.
Default value is 0.01. Recommended to use the default value.}

\item{related_MaxDistantN}{The maximum number of related genomes to use in finding evolutionary conservation after the
related genomes have been sorted from most distantly related to most closely related in relation to the central genome.
Default value is 1000.}

\item{startCodons}{A character vector consisting of three-letter DNA strings to use as the start codons when finding
evolutionarily conserved starts.}

\item{ema_AlphaVal}{The alpha value to use when calculating the exponential moving average over an alignment derived
from a synteny map. Default value is 0.1. Recommended to use the default value.}

\item{ema_MinVal}{The minimum exponential moving average value required for an alignment position to be incorporated
into the conservation vectors. Default value is 0.6. Recommended to use the default value.}

\item{useProt}{Logical indicating whether or not proteomics evidence should be mapped to the genome.
Default value is true. Cannot be false if \code{useCons} is false.}

\item{useCons}{Logical indicating whether or not evolutionary conservation evidence should be mapped to the genome.
Default value is true. Cannot be false if \code{useProt} is false.}

\item{processors}{Number describing the how many processors to use with DECIPHER functions. Should be either a
positive integer that describes the number of processors to use or NULL to detect and use all available processors.}

\item{verbose}{Logical indicating whether or not to display progress and status messages.}
}
\value{
An object of class \code{Assessment} and subclass \code{DataMap}
}
\description{
Maps proteomics hits and evolutionarily conserved starts to a central genome
}
\details{
\code{MapAssessmentData} maps the given data (either proteomics data, evolutionary conservation data, or both) to the
given central genome and stores those mappings in the object outputted by the function. The object that is outputted can
then be used to assess the quality of genes predicted for that same central genome.

All genomes used inside this function, including the central genome, must be inside the specified table of the specified
database. If the central genome is not found, the function returns an error. Please see the Using AssessORF vignette
for details on how to populate a database with genomic sequences.

Information on the proteomics hits is primarily given by \code{protHits_Seqs} and \code{protHits_Scores}. The sequences
(\code{protHits_Seqs}) are mapped to the six-frame translations of the central genome, and the scores (\code{protHits_Scores})
are used in thresholding and plotting the proteomics hits.

\code{protHits_Scores} can be a single number. In that case, that number is used the as the score for all proteomics hits.
Otherwise, the \code{protHits_Scores} must be of the same length as \code{protHits_Seqs}.

Only proteomics hits with a score greater than the value of the percentile that corresponds to the value of \code{protHits_Threshold}
will be kept and the rest of the hits will be dropped. If all the proteomics hits have the same score or if \code{protHits_Threshold}
is zero, no thresholding will occur and no hits will be dropped.

Please note that the logical parameter \code{protHits_IsNTerm} has no effect on how the proteomics evidence is mapped to the central
genome but it can be used to affect how genes are assessed and categorized in \code{AssessGenes}. The \code{NTermProteomics} item in
the outputted object is set to the value of \code{protHits_IsNTerm} (TRUE or FALSE). Users then have the option of requiring that
\code{AssessGenes} specifically perform N-terminal proteomics assessment when categorizing genes via the \code{useNTermProt} parameter
to the \code{AssessGenes} function. To summarize, the \code{protHits_IsNTerm} parameter in the \code{MapAssessmentData} function and
the \code{useNTermProt} in the \code{AssessGenes} function must both be set to TRUE in order to perform N-terminal proteomics
assessment. See \code{\link{AssessGenes}} for more details.

Evolutionarily conserved starts and conserved stop are found by first measuring how far the related genomes are from the central
genome using k-mer frequencies. Next, synteny is mapped between the central genome and each of the most distant related genomes, and
alignments are built from those synteny maps. An exponential moving average (EMA) is calculated over the alignment (based on whether
the central genome is identical to the related genome at that position) to filter out areas of poor alignment. The synteny maps and
filterd alignments provide information on how often each position in the central genome is covered by syntenic matches to related
genomes (coverage), how often those positions correspond to the start codons (start codon conservation) in both genomes, and how
often those positions correspond to stop codons in related genomes (stop codon conservation). A ratio of conservation to coverage is
used in downstream functions to measure the strength of both conserved starts and conserved stops.

Related genomes should be from species that are closely related to the given strain. \code{related_IDs} specifies the identifiers
for the sequences of the related genomes inside the database. A related genome identifier (each element of \code{related_IDs}) is
considered invalid and not used when finding evolutionary conservation if it is not found in the databse. Please note that the function
will only error when none of the related genomes are found.

If there are less valid related genomes in the sequence database than value of \code{related_MaxDistantN}, all valid related genomes
will be used in finding evolutionary conservation.

The logical flag \code{useProt} is used to indicate whether or not proteomics evidence has been provided and should be mapped to
the genome. Error checking will not occur for any arguments that involve proteomics if it is false.

The logical flag \code{useCons} is used to indicate whether or not evolutionary conservation evidence has been provided and should be
mapped to the genome. Error checking will not occur for any arguments that involve evolutionary conservation if it is false.
}
\examples{

## Example showing the minimum number of arguments that need to be specified
## to map both proteomics and evolutionary conservation data:

\dontrun{
myMapObj <- MapAssessmentData(myDBFile, central_ID = "1",
                              related_IDs = as.character(2:1001),
                              protHits_Seqs = myProtSeqs)
}


## Runnable example that uses evolutionary conservation data only:
## Human adenovirus 1 is the strain of interest, and the set of Adenoviridae
## genomes will serve as the set of genome. The cenral genome, also known as
## the genome of human adenovirus 1, is at identifier 1. The related genomes
## are at identifiers 2 - 13.

myMapObj <- MapAssessmentData(system.file("extdata",
                                          "Adenoviridae.sqlite",
                                          package = "AssessORF"),
                              central_ID = "1",
                              related_IDs = as.character(2:13),
                              speciesName = "Human adenovirus 1",
                              useProt = FALSE)


}
\seealso{
\code{\link{Assessment-class}}
}
