% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/clean_data.R
\name{clean_data}
\alias{clean_data}
\title{Retrieve, Clean, and Format Input Data}
\usage{
clean_data(data_file, if_aa, organism)
}
\arguments{
\item{data_file}{Path to the input file}

\item{if_aa}{Boolean value indicating if the input file contains
amino acid sequences with TRUE indicating that sequences are present and
FALSE indicating that only IDs are present}

\item{organism}{String indicating if the transcripts are from a human or
a mouse}
}
\value{
A data frame containing gene names, transcript IDs, and APPRIS
annotations for the given data. If sequences were provided, the data frame
will also contain amino acid sequences. If only IDs were provided, the data
frame will also contain the UniProt Swissprot ID, UniProt Swissprot
isoform ID, and UniProt TREMBL ID.
}
\description{
This function cleans and formats input data. The cleaning and formatting
portion involves removing any non-protein coding transcripts, removing any
principal transcripts, and standardizing all column names.
If the sequence is provided directly, the function also extracts the APPRIS
annotation and UniProt IDs of each transcript from Ensembl. Provided data can
follow 2 formats — the first option only contain transcript IDs and gene
names and the second option contains a unique transcript identifier, gene
names, and amino acid sequences. The function will return a data frame
containing the transcript IDs, gene names, and APPRIS Annotation for each
inputted transcript. If the amino acid sequence is included in the input 
data, this will also be included in the data frame. If only gene names and
transcript IDS are provided, UniProt IDs will be included in the data frame.
}
