% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/genome_download_helper.R
\name{get_genome_gtf}
\alias{get_genome_gtf}
\title{Download genome (fasta), annotation (GTF) and contaminants}
\usage{
get_genome_gtf(
  GTF,
  output.dir,
  organism,
  assembly,
  db,
  gunzip,
  genome,
  optimize = FALSE,
  uniprot_id = FALSE,
  gene_symbols = FALSE,
  pseudo_5UTRS_if_needed = NULL,
  remove_annotation_outliers = TRUE,
  refseq_genbank_format = c("gtf", "gff3")[1]
)
}
\arguments{
\item{GTF}{logical, default: TRUE, download gtf of organism specified
in "organism" argument. If FALSE, check if the downloaded
file already exist. If you want to use a custom gtf from you hard drive,
set GTF = FALSE,
and assign: \cr annotation <- getGenomeAndAnnotation(gtf = FALSE)\cr
annotation["gtf"] = "path/to/gtf.gtf".\cr
If db is not "ensembl", you will instead get a gff file.}

\item{output.dir}{directory to save downloaded data}

\item{organism}{scientific name of organism, Homo sapiens,
Danio rerio, Mus musculus, etc. See \code{biomartr:::get.ensembl.info()}
for full list of supported organisms.}

\item{assembly}{character, default is assembly = organism, which means getting
the first assembly in list, otherwise the name of the assembly wanted, like
"GCA_000005845" will get ecoli substrain k12, which is the most used ones for
references. Usually ignore this for non bacterial species.}

\item{db}{database to use for genome and GTF,
default adviced: "ensembl" (remember to set assembly_type to "primary_assembly",
else it will contain haplotypes, very large file!).
Alternatives: "refseq" (reference assemblies) and "genbank" (all assemblies)}

\item{gunzip}{logical, default TRUE, uncompress downloaded files
that are zipped when downloaded, should be TRUE!}

\item{genome}{character path, default NULL.
Path to fasta genome, corresponding to the gtf. must be indexed
(.fai file must exist there).
If you want to make sure chromosome naming of the GTF matches the genome
and correct seqlengths. If value is NULL or FALSE, it will be ignored.}

\item{optimize}{logical, default FALSE. Create a folder
within the output folder (defined by txdb_file_out_path),
that includes optimized objects
to speed up loading of annotation regions from up to 15 seconds
on human genome down to 0.1 second. ORFik will then load these optimized
objects instead. Currently optimizes filterTranscript() function and
loadRegion() function for 5' UTRs, 3' UTRs, CDS,
 mRNA (all transcript with CDS) and tx (all transcripts).}

\item{uniprot_id}{logical default FALSE.  If TRUE, will download
and store all uniprot id for all transcripts (coding and noncoding)-
In a file called: "gene_symbol_tx_table.fst" in same folder as txdb.}

\item{gene_symbols}{logical default FALSE. If TRUE, will download
and store all gene symbols for all transcripts (coding and noncoding)-
In a file called: "gene_symbol_tx_table.fst" in same folder as txdb.
hgcn for human, mouse symbols for mouse and rat, more to be added.}

\item{pseudo_5UTRS_if_needed}{integer, default NULL. If defined > 0,
will add pseudo 5' UTRs of maximum this length if 'minimum_5UTR_percentage" (default 30%) of
mRNAs (coding transcripts) do not have a leader. (NULL and 0 are both the ignore command)}

\item{remove_annotation_outliers}{logical, default TRUE. Only for refseq.
shall outlier lines be removed from the input annotation_file?
If yes, then the initial annotation_file will be overwritten and
the removed outlier lines will be stored at tempdir for further
exploration. Among others Aridopsis refseq contains malformed lines,
where this is needed}

\item{refseq_genbank_format}{= c("gtf", "gff3")[1] Gtf format files are usually
more secure from bugs downstream, so we highly advice to use them. GFF3 files
can sometimes include information you might not find in the gtf, so sometimes
it makes sense to use it.}
}
\value{
a named character vector of path to genomes and gtf downloaded,
 and additional contaminants if used. If merge_contaminants is TRUE, will not
 give individual fasta files to contaminants, but only the merged one.
}
\description{
This function automatically downloads (if files not already exists)
genomes and contaminants specified for genome alignment.
By default, it will use ensembl reference,
upon completion, the function will store
a file called \code{file.path(output.dir, "outputs.rds")} with
the output paths of your completed genome/annotation downloads.
For most non-model nonvertebrate organisms, you need
my fork of biomartr for it to work:
remotes::install_github("Roleren/biomartr)
If you misspelled something or crashed, delete wrong files and
run again.\cr
Do remake = TRUE, to do it all over again.\cr
}
\details{
Some files that are made after download:\cr
- A fasta index for the genome\cr
- A TxDb to speed up GTF/GFF reading\cr
- Seperat of merged contaminant files\cr
Files that can be made:\cr
- Gene symbols (hgnc, etc)\cr
- Uniprot ids (For name of protein structures)\cr
If you want custom genome or gtf from you hard drive, assign existing
paths like this: \cr
annotation <- getGenomeAndAnnotation(GTF = "path/to/gtf.gtf",
genome = "path/to/genome.fasta")\cr
}
\examples{

## Get Saccharomyces cerevisiae genome and gtf (create txdb for R)
#getGenomeAndAnnotation("Saccharomyces cerevisiae", tempdir(), assembly_type = "toplevel")
## Download and add pseudo 5' UTRs
#getGenomeAndAnnotation("Saccharomyces cerevisiae", tempdir(), assembly_type = "toplevel",
#  pseudo_5UTRS_if_needed = 100)
## Get Danio rerio genome and gtf (create txdb for R)
#getGenomeAndAnnotation("Danio rerio", tempdir())

output.dir <- "/Bio_data/references/zebrafish"
## Get Danio rerio and Phix contamints to deplete during alignment
#getGenomeAndAnnotation("Danio rerio", output.dir, phix = TRUE)

## Optimize for ORFik (speed up for large annotations like human or zebrafish)
#getGenomeAndAnnotation("Danio rerio", tempdir(), optimize = TRUE)

# Drosophila melanogaster (toplevel exists only)
#getGenomeAndAnnotation("drosophila melanogaster", output.dir = file.path(config["ref"],
# "Drosophila_melanogaster_BDGP6"), assembly_type = "toplevel")
## How to save malformed refseq gffs:
## First run function and let it crash:
#annotation <- getGenomeAndAnnotation(organism = "Arabidopsis thaliana",
#  output.dir = "~/Desktop/test_plant/",
#  assembly_type = "primary_assembly", db = "refseq")
## Then apply a fix (example for linux, too long rows):
# fixed_gff <- fix_malformed_gff("~/Desktop/test_plant/Arabidopsis_thaliana_genomic_refseq.gff")
## Then updated arguments:
# annotation <- c(fixed_gff, "~/Desktop/test_plant/Arabidopsis_thaliana_genomic_refseq.fna")
# names(annotation) <- c("gtf", "genome")
# Then make the txdb (for faster R use)
# makeTxdbFromGenome(annotation["gtf"], annotation["genome"], organism = "Arabidopsis thaliana")
}
\references{
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4919035/
}
\seealso{
Other STAR: 
\code{\link{STAR.align.folder}()},
\code{\link{STAR.align.single}()},
\code{\link{STAR.allsteps.multiQC}()},
\code{\link{STAR.index}()},
\code{\link{STAR.install}()},
\code{\link{STAR.multiQC}()},
\code{\link{STAR.remove.crashed.genome}()},
\code{\link{install.fastp}()}
}
\keyword{internal}
