% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/converters.R
\name{MSstatsPreprocessBig}
\alias{MSstatsPreprocessBig}
\title{General converter for larger-than-memory csv files in MSstats format 10-column format}
\usage{
MSstatsPreprocessBig(
  input_file,
  output_file_name,
  backend,
  max_feature_count = 20,
  filter_unique_peptides = FALSE,
  aggregate_psms = FALSE,
  filter_few_obs = FALSE,
  remove_annotation = FALSE,
  connection = NULL
)
}
\arguments{
\item{input_file}{name of the input text file in 10-column MSstats format.}

\item{output_file_name}{name of an output file which will be saved after pre-processing}

\item{backend}{"arrow" or "sparklyr". Option "sparklyr" requires a spark installation
and connection to spark instance provided in the `connection` parameter.}

\item{max_feature_count}{maximum number of features per protein. Features will
be selected based on highest average intensity.}

\item{filter_unique_peptides}{If TRUE, shared peptides will be removed.
Please refer to the `Details` section for additional information.}

\item{aggregate_psms}{If TRUE, multiple measurements per PSM in a Run will
be aggregated (by taking maximum value). Please refer to the `Details` section for additional information.}

\item{filter_few_obs}{If TRUE, feature with less than 3 observations across runs will be removed.
Please refer to the `Details` section for additional information.}

\item{remove_annotation}{If TRUE, columns BioReplicate and Condition will be removed
to reduce output file size. These will need to be added manually later before
using dataProcess function. Only applicable to sparklyr backend.}

\item{connection}{Connection to a spark instance created with the
`spark_connect` function from `sparklyr` package.}
}
\value{
either arrow object or sparklyr table that can be optionally collected
into memory by using dplyr::collect function.
}
\description{
General converter for larger-than-memory csv files in MSstats format 10-column format
}
\details{
Filtering and aggregation may be very time consuming and the ability
to perform them in a given R session depends on available memory, settings of
external packages, etc. Hence, all value of related parameters (`filter_unique_peptides`,
`aggregate_psms`, `filter_few_obs`) are set to FALSE by default and only feature
selection is performed, which saves both computation time and memory.
Appropriately configured spark backend provides the most consistent way to
perform these operations.
}
\examples{
converted_data <- bigFragPipetoMSstatsFormat(
  system.file("extdata", "fgexample.csv", package = "MSstatsBig"),
  "tencol_format.csv",
  backend="arrow")
procd <- MSstatsPreprocessBig("tencol_format.csv", "proc_out.csv", backend = "arrow")
head(dplyr::collect(procd))

}
