This vignette is a plain-English walkthrough of what the amplican code does, step by step, for readers who do not read code. Each section describes what happens inside the pipeline, which settings matter most, and which behaviours may be surprising. File and function names are mentioned so you can locate the relevant code, but no code is shown.
amplican takes raw sequencing reads (FASTQ files) and a
configuration file describing your experiments, and produces a set of
HTML reports with plots and tables summarising CRISPR editing outcomes.
The entry point is a single function called
amplicanPipeline, defined in amplican.R.
Internally it calls a sequence of specialised functions, each living in
its own file:
helpers_warnings.R).helpers_alignment.R, helpers_filters.R).helpers_alignment.R).helpers_alignment.R).helpers_alignment.R).AlignmentsExperimentSet-class.R,
helpers_general.R).amplicanFilter.R,
helpers_filters.R).amplicanSummarize.R).amplicanNormalize.R).amplicanSummarize.R).amplicanReport.R).A conceptual map of this flow is shown in the overview vignette.
Throughout the pipeline, intermediate results are saved to disk using
a temporary-file-and-rename pattern: each file is first written with a
.temp extension and then atomically renamed to its final
name. This means that if the process crashes midway, you will not see
half-written files – only .temp files that are cleaned up
on the next run.
File: amplican.R
(amplicanPipe function) Helper files:
helpers_directory.R, helpers_warnings.R
When you call amplicanPipeline, the first thing it does
is resolve all file paths to absolute paths and check that the output
folder is writable (via checkFileWriteAccess in
helpers_directory.R). If it is not, the pipeline stops
immediately with an error.
Next, it handles the continue setting. When
continue is TRUE (the default), the pipeline
looks for any leftover .temp files from a previous crashed
run and deletes them, but otherwise tries to reuse existing results to
avoid redundant computation. When continue is
FALSE, the entire results folder is deleted and recreated
from scratch.
The configuration file is then read and validated by three functions
in helpers_warnings.R:
checkConfigFile verifies that there
are no duplicate experiment IDs, that every combination of barcode,
forward primer, and reverse primer is unique, that no primer contains
numeric characters (a common copy-paste error), and that all referenced
FASTQ files actually exist and are readable.checkPrimers searches for the forward
primer and the reverse complement of the reverse primer inside each
amplicon sequence, and records their positions. If a primer cannot be
found in its amplicon, the pipeline stops.checkTarget checks whether each
guideRNA sequence can be found inside its amplicon. If it cannot, a
warning message is printed, but the pipeline does not
stop – it simply marks that guide as not found and continues. This is
intentional, because you might be working with guides that have
mismatches or unusual architectures.continue (default: TRUE): Set to FALSE
when you want a completely fresh run and are sure you do not need any
previous intermediate files.fastqfiles (default: 0.5): Controls
how forward and reverse reads are used. A value of 0 means both forward
and reverse primers must be found in each read pair. A value of 0.5 (the
default) means only one of the two primers needs to match – this is
useful when, for example, reverse reads have been primer-trimmed by the
sequencing facility. A value of 1 uses only forward reads; 2 uses only
reverse reads. This setting flows through the entire pipeline and
affects primer matching, consensus, and plotting.sample and
seed (defaults: 0): When
sample is set above zero, only that many reads are randomly
sampled from each FASTQ file, rather than reading the full file. This is
useful for quick testing or parameter tuning. The seed
makes the sampling reproducible.continue is TRUE and a previous run completed
alignment but not reporting, the pipeline will skip alignment entirely
and jump ahead, which can be confusing if you changed alignment
parameters expecting them to take effect.Files: helpers_alignment.R
(makeAlignment function), helpers_filters.R
(goodBaseQuality, goodAvgQuality,
alphabetQuality)
For each barcode (a group of experiments sharing the same sequencing
pool), the makeAlignment function opens the corresponding
FASTQ file(s) using a streaming reader. This means reads are loaded in
manageable chunks rather than all at once, which keeps memory usage low
even for very large files.
Each chunk of reads passes through three quality filters:
goodBaseQuality checks that every
single nucleotide in a read has a quality score at or above the
min_quality threshold. If even one base is below the
threshold, the entire read is rejected.goodAvgQuality checks that the average
quality score across all nucleotides in a read meets the
average_quality threshold. A read might pass the per-base
filter but still have an overall low average.alphabetQuality (only active when
filter_n is TRUE) rejects any read containing the letter
“N”, which represents a nucleotide that the sequencer could not
confidently identify.The function keeps a running tally of how many reads were rejected
for each reason, and these counts end up in the
barcode_reads_filters.csv output file.
average_quality (default: 30 in the
pipeline): Lowering this value keeps more reads but may include
low-quality data that produces spurious mismatches. Raising it makes
filtering stricter. The scale depends on your sequencing platform
(Phred+33 for Illumina).min_quality (default:
0 in amplicanPipeline, but
20 in the standalone amplicanAlign
function): This is the single most important quality parameter. The
pipeline default of 0 is very permissive – it effectively disables
per-base quality filtering. If you are seeing a lot of background
mismatches in your reports, raising this to 15-20 can make a large
difference by removing reads with individual low-quality bases. See the
quirks section below for why this default differs.filter_n (default: FALSE): Turn this
on if your sequencing tends to produce many N calls. Reads with N bases
cannot be meaningfully aligned.batch_size (default: 10 million in the
pipeline): This controls how many reads are loaded into memory at once.
It has no effect on results – it is purely a memory management
parameter. Lower it if you are running on a machine with limited
RAM.min_quality differs between
amplicanPipeline (0) and the standalone
amplicanAlign (20). This can be confusing. The pipeline
defaults were chosen to be permissive so that fewer reads are lost, but
for many datasets, a value of 15-20 for min_quality gives
cleaner results.File: helpers_alignment.R
(makeAlignment function)
After quality filtering, each chunk of surviving reads is examined for duplicates. The function creates a table where each row is a unique combination of forward and reverse read sequences, along with a count of how many times that exact pair appeared. When all chunks have been processed, the per-chunk tables are merged and duplicate pairs across chunks are summed.
This step is critical for performance: if the same read sequence appears ten thousand times, it is aligned only once, and the count is carried forward. All downstream statistics (editing rates, frameshift counts, etc.) weight each unique read by this count.
This step has no adjustable parameters. It is always performed.
Files: helpers_alignment.R
(locate_pr_start and makeAlignment),
helpers_general.R (comb_along)
Each unique read must be assigned to one of the experiments defined in the configuration file. Assignment is done by searching for the experiment’s primers within the read.
The locate_pr_start function searches for each primer
inside each read using an overlap alignment (a variant of sequence
alignment that allows the primer to be partially present at the
beginning of the read). This handles the common situation where reads
are slightly truncated at the 5’ end and do not start exactly at the
primer. The function accepts a certain number of mismatches (controlled
by primer_mismatch) and requires a minimum overlap of
approximately half the primer length.
For each experiment (identified by its ID), the function checks
whether the forward primer is found in the forward read and/or the
reverse primer is found in the reverse read. Which primers must match
depends on the fastqfiles setting:
fastqfiles = 0: both forward and reverse primers must
be found.fastqfiles = 0.5 (default): either the forward or the
reverse primer is sufficient.fastqfiles = 1: only the forward primer is checked
(reverse reads not used).fastqfiles = 2: only the reverse primer is checked
(forward reads not used).Reads whose primers match a given experiment are assigned to it.
Reads that do not match any experiment’s primers are labelled
“unassigned” and saved to unassigned_reads.csv. Their
forward and reverse sequences can later be aligned against each other in
the barcode report to diagnose what went wrong.
primer_mismatch (default:
2 in the pipeline, 0 in standalone
amplicanAlign): This controls how many mismatches are
tolerated when matching primers to reads. With the default of 2, the
pipeline allows up to 2 positions in the primer to differ from the read,
which helps recover reads with sequencing errors in the primer region.
Setting this to 0 is stricter and will increase the number of unassigned
reads, but eliminates any risk of mis-assigning reads to the wrong
experiment due to similar primers. If you see a high unassigned rate,
try increasing this value; if you suspect cross-contamination between
experiments with similar primers, decrease it.fastqfiles: See Step 1. The most
common reason to change this is when your reverse reads have been
trimmed of primers by the sequencing facility (use 0.5 or 1), or when
you only have forward reads (use 1, and leave Reverse_Primer and
Reverse_Reads_File empty in the config).min_overlap parameter inside locate_pr_start,
which defaults to half the primer length.primer_mismatch differ between
amplicanPipeline
amplicanAlign (0), for the same reason as
min_quality – the pipeline is designed to be permissive by
default.File: helpers_alignment.R
(makeAlignment function)
Once reads are assigned to an experiment, each read is aligned against the amplicon sequence for that experiment. Before alignment, the read is trimmed so that it starts at the position where the forward primer was found (or, for reverse reads, the reverse primer position). The amplicon is likewise trimmed to the region between the two primers. This ensures that both the read and the amplicon start and end at comparable positions, which improves alignment quality.
The alignment itself uses the Needleman-Wunsch global alignment
algorithm, as implemented in the pwalign (formerly
Biostrings) package. This algorithm finds the optimal way
to line up two sequences by inserting gaps where needed. It uses a
scoring matrix that rewards matching bases and penalises mismatches,
plus penalties for opening and extending gaps.
The alignment results – the aligned read, the aligned amplicon, and a
score representing alignment quality – are stored in the
AlignmentsExperimentSet object (described in Step 8).
scoring_matrix (default: match = +5,
mismatch = -4): This matrix determines how the alignment algorithm
scores each position. Higher match rewards and more negative mismatch
penalties make the alignment stricter about calling mismatches. The
default works well for most CRISPR amplicon data. You would rarely
change this unless you have a specific reason, such as aligning reads
with unusual error profiles.gap_opening (default: 25): This is the
penalty for starting a new gap in the alignment. A high value (like 25)
means the algorithm will strongly prefer to call mismatches rather than
open a gap. This is the single most impactful alignment
parameter: lowering it will cause the aligner to call more
deletions and insertions (since gaps become cheaper), while raising it
will suppress indels in favour of mismatches. For CRISPR analysis where
indels are the primary signal, the default of 25 is a good balance.gap_extension (default: 0): This is
the additional penalty for each position added to an existing gap. A
value of 0 means that a long gap costs the same as a short gap (only the
opening penalty matters). This is appropriate for CRISPR data where
deletions can be large but should not be penalised more for being
longer.File: helpers_alignment.R
(is_hdr and is_hdr_strict functions)
If you have filled in the Donor column in your configuration file (indicating a homology-directed repair template), amplican attempts to identify which reads represent successful HDR events. Two different algorithms are available:
Permissive mode (is_hdr, used when
donor_strict = FALSE): The donor template is first aligned
against the amplicon to identify the positions where they differ. Each
read is then aligned against the donor. A read is classified as HDR if
two conditions are met: (1) it aligns to the donor at least as well as
it aligns to the amplicon, and (2) the number of small events
(single-base mismatches or indels) overlapping the donor-specific
positions does not exceed the donor_mismatch threshold.
This mode is fast and tolerant of sequencing noise.
Strict mode (is_hdr_strict, used when
donor_strict = TRUE): Instead of comparing alignment
scores, the strict algorithm checks whether every single event that
distinguishes the donor from the amplicon is present in the read,
matched exactly by position, type, and sequence. A read must contain all
donor-specific events to be counted as HDR. Additional events elsewhere
in the read (noise) do not disqualify it, unless there are more than
donor_mismatch extra events within the cut-site window.
donor_strict (default: FALSE): Switch
to TRUE if you need very precise HDR calling. The strict algorithm is
more conservative – it will not count a read as HDR unless it carries
every expected donor change. This reduces false positives but may miss
reads with partial HDR or sequencing errors in the donor region. The
strict algorithm is also significantly slower because it requires
event-by-event matching.donor_mismatch (default: 3): In
permissive mode, this is the maximum number of single-base discrepancies
allowed within the donor region. Lower values are stricter; setting to 0
requires a perfect match to the donor in that region (not recommended
due to typical sequencing error rates of ~0.1-1%). In strict mode with a
finite value, this controls how many extra consensus events (beyond the
required donor events) are tolerated within the cut-site window before a
read is rejected.cut_buffer (default: 5): In strict
mode, this defines the window around the cut site within which extra
events are counted toward the donor_mismatch
threshold.donor_mismatch only counts events
that fall within the donor-vs-amplicon event coordinate window.
Mismatches or indels elsewhere in the read are invisible to this
threshold and never disqualify a read. This is by design: the goal is to
assess whether the donor-specific region looks like HDR, not whether the
entire read is perfect.File:
AlignmentsExperimentSet-class.R
All alignment results are stored in a structured data container
called AlignmentsExperimentSet. This is an S4 class (a
formal R object type with strict validation). It holds:
The class includes a validation method that enforces consistency: every experiment ID must be unique, the number of forward alignments must match the number of reverse alignments, read counts must not contain zeros, and so on. If any of these conditions are violated, an error is reported.
The object supports subsetting (selecting individual experiments by
index or name), combining multiple objects together, and a
lookupAlignment method that prints a single alignment in a
human-readable, BLAST-like format. The writeAlignments
method can output all alignments to a file in FASTA or plain-text
format.
write_alignments_format (default:
“None”): Set to “fasta”, “txt”, or both (by passing a vector like
c("fasta", "txt")) to save human-readable alignments to
disk. This is useful for debugging or for sharing alignments with
collaborators. The default “None” skips this output to save disk space
and time.AlignmentsExperimentSet (for
example, by selecting one experiment), the barcode-level and
unassigned-read data are dropped. This is because there is no reliable
way to know which barcode-level statistics belong to which experiment
after subsetting.temp subfolder. These are later merged. If you delete
these files, the pipeline will need to redo the alignments for those
barcodes.extractEvents method (described next) is noted in
the code comments as being slow “despite vectorization.” If you have a
multi-core machine, pass use_parallel = TRUE to speed it
up.Files: AlignmentsExperimentSet-class.R
(extractEvents method), helpers_general.R
(getEventInfo, getEvents,
defGR)
After alignment, each read is represented as a pair of aligned
strings (the read and the amplicon, with gap characters inserted). The
extractEvents method converts these aligned strings into a
table of individual events.
For each alignment, the getEvents function scans the
aligned strings character by character and identifies three types of
events:
Each event is recorded with its start position, end position, width, the original amplicon base(s), the replacement read base(s), the type (deletion, insertion, or mismatch), a read identifier, the alignment score, and the strand (plus for forward reads, minus for reverse reads). The function also adds “artificial” deletion events at the ends of reads that do not span the full amplicon – these represent the fact that the read ended early, and they are later used by the filtering step.
The getEventInfo function is a wrapper that handles the
coordinate shifts needed because the alignment may not start at position
1 of the amplicon (the amplicon was trimmed to the primer region). It
also handles the asymmetry between forward and reverse strands: for
forward reads, it checks whether the read extends to the end of the
amplicon; for reverse reads, it checks whether the read starts at the
beginning.
use_parallel (default: FALSE): Set to
TRUE if you have registered a multi-core back-end (via
BiocParallel::register). This can significantly speed up
event extraction for large datasets, as each experiment is processed
independently.Direction = 1 (where the guide
is reverse-complemented), because flipRanges (called later)
handles the coordinate reversal.Files: amplicanFilter.R,
helpers_filters.R (findEOP,
findPD, findLQR)
Not all events extracted from alignments represent real biological mutations. Three types of artifacts are filtered out:
Events Overlapping Primers (EOP) – detected by
findEOP: When a read is shorter than the amplicon and ends
prematurely, the alignment may show what looks like a large deletion
covering the end of the amplicon. But this is not a real deletion – it
is simply that the read ran out of sequence. The findEOP
function identifies any deletion that touches or extends past the primer
regions and marks it for removal. This also catches deletions that start
before the forward primer ends. This filter operates in both absolute
coordinates (before shifting) and relative coordinates (after shifting),
and automatically adjusts based on whether the events have been shifted
yet.
PRIMER DIMERS – detected by findPD:
Primer dimers are PCR artifacts where the primers themselves amplify
each other without any amplicon template, producing very short products.
In sequencing, these appear as reads where the alignment shows a gap
covering most of the amplicon. The findPD function
calculates a cutoff: amplicon length minus the combined primer lengths
minus a buffer. Any deletion wider than this cutoff is flagged as a
primer dimer. All events from reads containing such a deletion are then
removed.
Low-quality / off-target reads – detected by
findLQR: Some reads may align to the amplicon but carry an
unusually high number of events and a low alignment score, suggesting
they originated from a different genomic locus (off-target) or are
otherwise poor quality. The findLQR function uses a
clustering approach (CLARA, a fast variant of k-medoids) on two
dimensions: the normalized number of events per read and the normalized
alignment score. It tries dividing reads into 2 or 3 clusters and uses
the silhouette criterion to decide which is better. If 3 clusters fit
better, the cluster with high event counts and low scores (top-left in a
score-vs-events plot) is marked for removal. This filter only activates
when there are at least 1000 events for the experiment – below that
threshold, it does nothing.
After filtering, the Reads_Filtered column in the config
summary is calculated as total reads minus PRIMER_DIMER reads minus
low-score reads.
PRIMER_DIMER (default: 30): This is
the buffer size added to the primer lengths when computing the primer
dimer cutoff. A larger value makes the filter more lenient (fewer reads
classified as primer dimers); a smaller value makes it more aggressive.
The default of 30 means that a read must have a deletion covering more
than amplicon_length - primer_lengths - 30 bases to be
called a primer dimer. If you are seeing too many reads classified as
primer dimers, increase this value.event_filter (default: TRUE): Set to
FALSE to disable the off-target detection (findLQR). This
might be useful if you have a very unusual editing pattern that is being
incorrectly flagged, or if you want to see all reads regardless of
quality. In most cases, leave it on.findLQR off-target filter requires at least 1000
events per experiment to activate. For small datasets or pilot
experiments, it silently does nothing. This means that the same dataset
analyzed with different amounts of data could produce different
filtering behavior at the boundary.findLQR uses CLARA with a sample size
capped at 1000. This means it is an approximation – results may vary
slightly between runs if the random sampling differs. However, the
silhouette-based decision between 2 and 3 clusters tends to be
stable.File: amplicanSummarize.R
(amplicanConsensus function)
When you use both forward and reverse reads (i.e.,
fastqfiles is 0 or 0.5), each read pair provides two
independent observations of the same DNA molecule. Ideally, both reads
should agree on what mutations are present. The
amplicanConsensus function compares events from the forward
and reverse reads and decides which events to trust.
The consensus logic follows these rules:
Full agreement: If a forward event and a reverse event have the same position, length, type, and replacement sequence, the event is confirmed. The forward read’s version is kept as the representative.
Partial disagreement: If both reads have events in the same region but they differ in details (e.g., same start but different end), the event from whichever read has the higher alignment score is kept, but only if both reads overlap the expected cut site. If neither overlaps the cut site, both events are rejected.
One strand broken: If one read has an event overlapping a primer (indicating it is truncated or unreliable in that region) but the other read covers that region cleanly, the clean read’s event is trusted.
Single-strand-only events (controlled by
promiscuous_consensus): When an event appears on only one
strand and there is no conflicting information from the other strand,
the promiscuous_consensus setting determines what happens.
When TRUE (the default), these single-strand events are accepted. When
FALSE, they are rejected unless the other strand is broken in that
region (in which case the intact strand is trusted).
The output is a boolean vector (TRUE/FALSE for each event) indicating which events represent the consensus.
promiscuous_consensus (default: TRUE):
This is the key parameter for consensus behavior. When TRUE, indels
detected on only one strand are kept. This is more sensitive but may
include some artifacts. When FALSE (strict mode), single-strand indels
are rejected unless the other strand is broken. Use FALSE when you want
high-confidence events only, and TRUE when you do not want to miss any
real editing events. The trade-off is sensitivity versus precision.fastqfiles <= 0.5 (i.e.,
both forward and reverse reads are used). When using only forward or
only reverse reads, all events are automatically marked as consensus
(TRUE), because there is no second strand to compare against.amplicanConsensus function requires that
amplicanOverlap has already been run (it needs to know
which events overlap the cut site to apply rule 2). In the pipeline,
overlap is always computed before consensus.File: amplicanSummarize.R
(amplicanOverlap function)
The amplicanOverlap function determines which events are
likely caused by CRISPR editing, as opposed to being random sequencing
errors or natural variants. It does this by checking whether each event
overlaps the expected cut site.
The expected cut site is specified by you in the configuration file:
any UPPER CASE letters in the amplicon sequence are treated as the
target region (typically the PAM sequence or the guide RNA binding
site). The function detects these UPPER CASE groups using the
upperGroups helper, which finds runs of capital letters in
the amplicon string.
Each detected cut-site region is then expanded by
cut_buffer bases in both directions to create a window. Any
event whose position overlaps this window is marked as overlapping the
cut site (TRUE); all others are marked as not overlapping (FALSE).
If an amplicon has no UPPER CASE letters at all, the function issues a warning and uses the entire amplicon as the cut-site region. This means all events will be marked as overlapping, which effectively disables the spatial filtering.
cut_buffer (default: 5): This controls
how forgiving the spatial filter is. With a buffer of 5, events up to 5
bases outside the uppercase region are still counted as overlapping.
Increasing the buffer captures events from slightly imprecise cutting;
decreasing it makes the filter stricter. This parameter also flows
through to reporting, where it determines the highlighted region in
plots. Note that CRISPR cut sites are typically a few bases upstream of
the PAM, so the buffer helps account for this biological
variability.File: helpers_general.R
(amplicanMap, flipRanges,
upperGroups)
After filtering, events are re-encoded so that their coordinates are
relative to the expected cut site rather than being absolute positions
within the amplicon. This is done by the amplicanMap
function.
The “zero point” is defined as the first UPPER CASE letter in the amplicon. All event positions are shifted so that this zero point becomes position 0. Events that were upstream of the cut site get negative coordinates; events downstream get positive coordinates. This relative coordinate system is what makes it possible to create metaplots that overlay data from different amplicons of different lengths.
For amplicons with Direction = 1 (meaning the guide RNA
needed to be reverse-complemented to match the amplicon), an additional
transformation is applied by flipRanges: the strand labels
are swapped (+ becomes -, and vice versa), coordinates are mirrored
(reversed end-to-end relative to the amplicon length), and the original
and replacement base sequences are reverse-complemented. This ensures
that all events are reported from the perspective of the forward strand,
regardless of the original guide direction.
This step has no adjustable parameters.
amplicanMap if you have no expected cut sites specified,
and do not use the metaplot functions (which rely on relative
coordinates) either.flipRanges function modifies the data in place for
efficiency. This means that after this step, the strand column no longer
reflects which read (forward or reverse) produced the event for
Direction = 1 amplicons – it has been flipped to maintain the
forward-strand perspective.File: amplicanNormalize.R
If your configuration file has any rows marked as controls
(Control = TRUE or Control = 1), the
amplicanNormalize function removes background events from
the treatment (non-control) samples.
The process works as follows:
Events from control samples are grouped by the columns specified
in the normalize parameter (by default, guideRNA and
Group). This means that a control is only used to normalize treatments
that share the same guideRNA and Group.
For each group, the frequency of each unique event (defined by its position, type, width, original base, and replacement base) is calculated relative to the total number of filtered reads in that control group.
Events in the control group whose frequency falls below
min_freq are treated as sequencing errors and discarded –
they are too rare to be considered real background.
For each treatment event, the function checks whether an exact match exists among the surviving control events (same position, type, width, original base, replacement base, guideRNA, and Group). If it does, the treatment event is removed, because it is indistinguishable from background.
All control events are kept (they are not removed).
This step handles the common situation where the sequenced strain has natural sequence differences from the reference amplicon (SNPs), or where there is a low level of index hopping (cross-contamination between sequencing pools).
Events containing the replacement base “N” are always removed before normalization, as they represent noise from poor sequencing.
min_freq (default: 0.01, i.e., 1%):
This is the single most important normalization parameter. It sets the
frequency threshold below which control events are considered noise and
not used for filtering. Lowering it (e.g., to 0.001) makes normalization
more aggressive – more events will be removed from treatments. Raising
it (e.g., to 0.1) makes normalization more conservative. The default of
1% works well for most experiments with good sequencing depth. If you
suspect index hopping (where reads from one sample contaminate another),
raising this to 0.03-0.15 can help, because index hopping typically
introduces low-frequency spurious events. You can inspect the mismatch
plots in the control reports to see what the background noise level
looks like and choose an appropriate threshold.normalize (default: c(“guideRNA”,
“Group”)): This specifies which columns to use for matching controls to
treatments. The default means that only events from the same guideRNA in
the same Group are normalized against each other. You can change this to
use fewer columns (e.g., c("guideRNA") to normalize by
guide only, ignoring Group) or more columns. Pass NULL to
skip normalization entirely even when controls are present. Pass an
empty vector c() to normalize globally with no
stratification (all controls averaged together for all treatments).cut_buffer affects only the strict HDR
detection path, not normalization directly.Reads_Filtered (total reads after filtering), not
Reads (total reads before filtering). This means that
primer dimers and low-quality reads are not counted in the
denominator.File: amplicanSummarize.R
(amplicanSummarize function)
The amplicanSummarize function computes per-experiment
statistics from the filtered, normalized, and consensus-confirmed
events. It adds five new columns to the configuration table:
The calculation proceeds in two passes. First, for each individual read, the function determines whether it has any deletions, insertions, HDR status, or frameshift. Then, these per-read results are aggregated to the experiment level by summing the read counts (remembering that each unique read carries a weight equal to how many times it was observed).
Only events that are both consensus (confirmed by forward-reverse agreement, or accepted via promiscuous mode) and overlapping the cut site are included in these counts. This ensures that the statistics reflect high-confidence CRISPR-induced editing.
This step has no direct parameters, but its results depend on all upstream choices: the quality filters (which reads survive), the alignment parameters (which events are called), the consensus mode (which events are confirmed), the cut-site buffer (which events overlap), and the normalization (which events are removed as background).
Reads_Edited can be less than
Reads_Del + Reads_In because a read with both a deletion
and an insertion is counted in both individual categories but only once
in the combined category. This is explained in the FAQ vignette.Files: amplicanReport.R,
helpers_rmd.R
The final step generates a set of reports in R Markdown format. R Markdown files are templates that combine text, R code, and plotting instructions. When “knitted” (rendered), they produce polished HTML pages with tables and figures.
The amplicanReport function creates six types of
report:
Each report is generated from a template stored inside the package
(in the rmarkdown/templates directory). The function
injects the paths to the result files and relevant parameters into each
template, then optionally knits them to HTML.
knit_reports (default: TRUE): When
TRUE, all reports are automatically rendered to HTML. This is the most
time-consuming step because the reports contain high-quality vector
figures. When FALSE, the Rmd template files are created but not rendered
– you can then open them in RStudio, customise the text, colours, or
layout, and knit them manually.cut_buffer (default: 5): Passed
through to the report templates and used to highlight the expected
cut-site region in plots.top (default: 5): Controls how many of
the most frequent unassigned reads are shown in the barcode report. Each
unassigned read is aligned forward-against-reverse to help diagnose why
it was not assigned (e.g., wrong primers, primer dimers,
contamination).knit_reports = FALSE, customise the Rmd files, and knit
them yourself.File: helpers_general.R
(cigarsToEvents, pairToEvents)
If you prefer to use an external alignment tool instead of amplican’s built-in aligner, two functions allow you to import external alignment results and convert them into the same event table format that the pipeline uses internally. Once imported, the events can be filtered, normalised, summarised, and plotted using all the same downstream functions.
cigarsToEvents accepts alignment
results in CIGAR string format – the standard way that alignment tools
like BWA, minimap2, and Bowtie2 report alignments in SAM/BAM files. You
provide the CIGAR strings, the alignment start positions, the query
(read) sequences, the reference (amplicon) sequences, read IDs, mapping
quality scores, sequence names, strands, and read counts. The function
parses the CIGAR operations (M, I, D, X, N) and converts them into
insertion, deletion, and mismatch events in the GRanges format. This
function requires the GenomicAlignments package to be
installed.
pairToEvents accepts alignment results
in the “pair” output format produced by EMBOSS needle (specifically
needleall). This is a human-readable format that shows
aligned sequences with match/mismatch/gap indicators. You provide the
path to a pair-format file, and the function parses it into the same
event table. It reads the alignment parameters (gap open, gap extend)
and scores from the file header. Only the needleall “pair”
format is supported – other EMBOSS output formats will produce an
error.
These functions have no tunable parameters beyond the input data. They are designed to faithfully convert external alignment representations into amplican’s internal format.
cigarsToEvents handles extended CIGAR operations
including “X” (mismatch) and “N” (skip/ref_skip). If your aligner does
not produce “X” operations (some older aligners use “M” for both matches
and mismatches), you will not get mismatch events – only insertions and
deletions. You would need to derive mismatches separately (e.g., from MD
tags in the BAM file).pairToEvents assumes that reads are not collapsed (each
read has a count of 1). If you have pre-collapsed your reads, you would
need to adjust the counts manually after import.pairToEvents explicitly checks that the input file was
produced by needleall with the -aformat3 pair
option. Files from needle or other programs, or with
different output formats, will be rejected with an error.amplicanPipeline wrapper.Files: helpers_plots.R,
ggforce_bezier.R, src/bezier.cpp
amplican provides a rich set of plotting functions, all built on top
of the ggplot2 graphics library. These functions read the
event tables produced by the pipeline and create publication-quality
figures.
The main plot types are:
plot_deletions: Draws each deletion as
a smooth arch (bezier curve) over the amplicon. The height of the arch
represents the deletion’s frequency. Arches are coloured by whether the
deletion overlaps the expected cut site (pink for overlapping, blue for
non-overlapping). This is the most distinctive amplican plot type.plot_cuts: Overlays deletion arches
from multiple experiments on top of each other, showing the distribution
of cut sites across the amplicon. Useful for comparing editing
outcomes.plot_insertions: Represents each
insertion as a triangle over the amplicon, with height proportional to
frequency.plot_mismatches: Shows mismatches as
stacked bars at each position along the amplicon, coloured by the
replacement nucleotide (red for A, blue for C, yellow for G, green for
T). The top panel shows forward reads; the bottom panel shows reverse
reads.plot_variants: Shows the most frequent
mutated sequences (variants) with their frequency, frameshift status,
and codon-level information. Each variant is annotated with the amino
acid changes it would cause.plot_heterogeneity: Shows the
cumulative share of reads accounted for by each unique sequence, sorted
from most to least frequent. A steep curve indicates low heterogeneity
(a few dominant sequences); a flat curve indicates high heterogeneity
(many different sequences at similar frequencies).waffle: Creates a grid-of-squares
chart (like a pie chart but with discrete squares), used in the index
report to show the proportion of good vs. filtered reads.There are also “meta” versions of some plots
(metaplot_deletions, metaplot_insertions,
metaplot_mismatches) that aggregate data across multiple
amplicons by using relative coordinates. These are used in the group,
guide, and amplicon reports.
The smooth arches in deletion and cut plots are rendered using custom
bezier curve mathematics implemented in C++ (in
src/bezier.cpp) and exposed to R through Rcpp (in
ggforce_bezier.R). The C++ code computes quadratic and
cubic bezier paths – the same mathematical curves used in vector
graphics software. The arch height is set to approximately 4/3 of the
deletion frequency, following a geometric formula that makes the arch
shape visually proportional.
cut_buffer parameter
that determines which region is highlighted as the expected cut
site.plot_height helper), so the HTML pages may be very
tall.| Parameter | Default | Where | Effect when increased | Effect when decreased |
|---|---|---|---|---|
average_quality |
30 | Quality filter | Fewer reads pass (stricter) | More reads pass (looser) |
min_quality |
0 | Quality filter | Fewer reads pass (stricter) | More reads pass (looser) |
filter_n |
FALSE | Quality filter | Reads with N bases removed | No N filtering |
primer_mismatch |
2 | Read assignment | More reads assigned (risk misassignment) | Fewer reads assigned (more unassigned) |
gap_opening |
25 | Alignment | Fewer gaps, more mismatches | More gaps, fewer mismatches |
PRIMER_DIMER |
30 | Artifact filter | Fewer reads called primer dimer | More reads called primer dimer |
cut_buffer |
5 | Overlap detection | More events count as cut-site | Fewer events count as cut-site |
promiscuous_consensus |
TRUE | Consensus | Single-strand indels kept | Single-strand indels rejected |
min_freq |
0.01 | Normalization | More control events used (more aggressive removal) | Fewer control events used (gentler removal) |
donor_mismatch |
3 | HDR detection | More reads called HDR (looser) | Fewer reads called HDR (stricter) |
event_filter |
TRUE | Artifact filter | Off-target detection active | Off-target detection disabled |
The standalone amplicanAlign function has different
defaults from amplicanPipeline for several parameters. This
is by design: the pipeline is tuned for convenience and permissiveness,
while the standalone function gives you more control.
| Parameter | amplicanPipeline |
amplicanAlign |
|---|---|---|
min_quality |
0 | 20 |
primer_mismatch |
2 | 0 |
batch_size |
10,000,000 | 1,000,000 |
If you call amplicanAlign directly and then feed its
results into the rest of the pipeline manually, be aware that these
stricter defaults may produce different filtering behavior than the
all-in-one amplicanPipeline.