amplican 1.35.4
This vignette is a practical guide for transforming your experiment spreadsheet (Excel/xlsx) into a valid amplican configuration file using code generated by a large language model (LLM). It covers what amplican expects, how to format the amplicon sequence correctly, how primers work, and includes copy-pasteable example functions and an LLM prompt template.
It is tempting to open your Excel file, shuffle columns around, and save as CSV. Do not do this. Manual editing is error-prone, untraceable, and impossible to reproduce. When your data changes (a new sequencing run, a corrected sample name), you would have to redo the entire process by hand.
Instead, write a script that reads your Excel file and produces the config CSV programmatically. This approach has several advantages:
An LLM (such as ChatGPT, Claude, or similar) can write most of this script for you, given the right instructions. This vignette provides those instructions.
The amplican config file is a CSV where column order matters, not column
names. When amplican reads your config, it renames the first 12 columns
internally based on their position (see amplicanAlign.R). Any additional
columns beyond the 12th are preserved by name and ignored by the pipeline.
The 12 columns, in order:
| # | Column | Type | Description |
|---|---|---|---|
| 1 | ID | character | Unique identifier for this experiment (e.g., ID_1, SampleA_L001) |
| 2 | Barcode | character | Groups experiments sharing the same FASTQ file pair (e.g., barcode_1, A1_L001) |
| 3 | Forward_Reads | character | Filename of the forward FASTQ (e.g., R1_001.fastq.gz). Not a full path – amplican prepends the fastq_folder you pass to amplicanPipeline. Leave empty if using only reverse reads. |
| 4 | Reverse_Reads | character | Filename of the reverse FASTQ. Leave empty if using only forward reads. |
| 5 | Group | character | User-defined grouping (e.g., experimenter name, treatment condition). Used in group-level reports. |
| 6 | Control | 0 or 1 | 1 if this row is a control sample (used for normalization). 0 otherwise. |
| 7 | guideRNA | character | Guide RNA sequence, 5’ to 3’. Does not need to be reverse-complemented – amplican handles that based on the Direction column. |
| 8 | Forward_Primer | character | Anchoring sequence on the left side of the amplicon (see “Primers” section below). |
| 9 | Reverse_Primer | character | Anchoring sequence on the right side of the amplicon. Written 5’ to 3’ as you would order it from a primer supplier. |
| 10 | Direction | 0 or 1 | 0 if the guide is found on the forward strand of the amplicon. 1 if the guide needs to be reverse-complemented to match. |
| 11 | Amplicon | character | Full amplicon sequence. Case matters: UPPER CASE letters mark the expected cut site; lowercase letters mark the rest. See the next section. |
| 12 | Donor | character | Donor template sequence for HDR experiments. Leave empty if no donor. The donor should span the same region as the amplicon (between primers). |
Important notes:
fastq_folder argument.0 = FALSE (treatment), 1 = TRUE
(control). Normalization uses control rows to remove background events from
treatment rows.0 or 1.The Amplicon column is the most critical and most error-prone part of the config. Amplican uses letter case to distinguish the region where CRISPR is expected to cut (UPPER CASE) from the rest of the amplicon (lowercase).
cut_buffer) are counted toward Reads_Del,
Reads_In, Reads_Edited, and Reads_Frameshifted.If your amplicon has no UPPER CASE letters at all, amplican falls back to using the entire amplicon as the cut-site region (with a warning). This effectively disables spatial filtering, so all events are counted regardless of position.
There are two common strategies for uppercasing, each with a different
recommended cut_buffer setting:
This is the most common approach for treatment samples. You start with the full amplicon in lowercase, find where the guide maps, and uppercase the guide region plus some extra bases on each side as a buffer.
Recommended buffer: 10 bp on each side of the guide match.
With this strategy, use the pipeline’s default cut_buffer = 5, which adds
another 5 bp of tolerance on each side. The total effective window is then
guide ± 15 bp.
This strategy is precise: only events near the actual cut site are counted as editing events. Background mismatches elsewhere in the amplicon are excluded from the summary statistics.
Instead of marking only the guide region, you uppercase everything between the primers and leave only the primer regions in lowercase. This means every event in the editable region counts as a cut-site event.
With this strategy, set cut_buffer = 0 in the pipeline, because the
UPPER CASE region already covers the entire interior – there is no need for
additional expansion.
This is particularly useful for control samples, where you want to detect all background variation regardless of position. It is also used when you care about events across the whole amplicon, not just at the cut site.
| Strategy | Config uppercase region | Pipeline cut_buffer |
Total effective window |
|---|---|---|---|
| Guide + buffer (typical) | guide ± 10 bp | 5 (default) | guide ± 15 bp |
| Guide + buffer (precise) | guide ± 10 bp | 0 | guide ± 10 bp |
| Whole interior (controls) | everything between primers | 0 | entire interior |
The Direction column tells amplican whether the guideRNA sequence is found on the forward strand of the amplicon (Direction = 0) or needs to be reverse-complemented to match (Direction = 1).
In practice:
To determine the direction automatically, try searching for the guide in the
amplicon. If it matches directly, Direction = 0. If it does not match,
reverse-complement the guide and try again. If the reverse complement matches,
Direction = 1. The detect_direction() function below shows how to do this.
Direction also affects plotting: for Direction = 1 amplicons, amplican flips all coordinates to the forward-strand perspective, so events from different amplicons can be compared on the same relative scale.
One of the most important concepts in amplican is that the Forward_Primer and Reverse_Primer columns do not need to contain your actual PCR primer sequences. They are anchoring sequences – two short sequences that flank the region of interest and help amplican position reads during alignment.
Read assignment: When amplican reads a FASTQ file, it searches each read for the forward primer (in the forward read) and the reverse primer (in the reverse read). Reads that contain the primers are assigned to that experiment. This is how amplican knows which reads belong to which experiment.
Alignment positioning: Before aligning a read to the amplicon, amplican trims everything before the primer in the read. This ensures the read and amplicon start at the same position, which improves alignment quality.
Coordinate system: The forward primer defines the left boundary of the amplicon coordinate system. The reverse primer defines the right boundary.
fastqfiles = 1 in the pipeline.If your reads are primer-trimmed (the sequencing facility removed the primer
sequences), use fastqfiles = 0.5 in the pipeline. This mode only requires
one of the two primers to be found, which handles the case where one read has
had its primer removed.
You can use any LLM (ChatGPT, Claude, Gemini, etc.) to generate a transformation script tailored to your Excel layout. Here is the recommended workflow:
Open your Excel file and note the column headers. Also note any special formatting: merged cells, multiple samples per row, blank rows, etc. Write this down as plain text.
Copy the prompt template below, fill in your column information, and paste it into your LLM of choice.
The LLM will produce an R script. Review it, make sure the column mappings are correct, install any required packages, and run it. Always inspect the output config before running the full pipeline.
Copy and paste the text below into your LLM. Replace the bracketed sections with your information.
I need an R script that reads my Excel experiment sheet and produces a CSV
config file for the amplican R package.
## My Excel file
File name: [my_experiments.xlsx]
Sheet name: [Sheet1]
My columns are (in order):
1. [Well] - 96-well plate position, e.g. A1, B3
2. [Index_i7] - i7 barcode sequence
3. [Index_i5] - i5 barcode sequence
4. [Sample] - sample name
5. [Forward_Primer] - forward primer sequence
6. [Reverse_Primer] - reverse primer sequence
7. [Guide] - guide RNA sequence
8. [Repair_Template] - donor template sequence (may be empty)
[Note: there may be multiple samples per row / other special formatting.
Describe it here.]
## What amplican needs (12 columns, in this exact order)
1. ID - unique experiment ID (use sample name + lane)
2. Barcode - group sharing same FASTQ pair (use well + lane)
3. Forward_Reads - FASTQ filename (NOT full path)
4. Reverse_Reads - FASTQ filename (NOT full path)
5. Group - user-defined grouping (use sample name)
6. Control - 0 or 1 (1 if this is a control/mock)
7. guideRNA - guide RNA sequence, 5' to 3'
8. Forward_Primer - left anchoring sequence (must be found in amplicon)
9. Reverse_Primer - right anchoring sequence, 5' to 3'
10. Direction - 0 or 1 (0 if guide on forward strand, 1 if on reverse)
11. Amplicon - amplicon sequence (lowercase, with guide region in UPPER CASE)
12. Donor - donor template, or empty
## Rules for the Amplicon column
The amplicon sequence must have the guide region in UPPER CASE and everything
else in lowercase. For each amplicon:
- Start with the full amplicon in lowercase.
- Find the guide (or its reverse complement) in the amplicon using fuzzy
matching (allow up to 3 mismatches).
- Uppercase the guide match plus 10 bp on each side.
- If the guide matches forward: Direction = 0.
- If only the reverse complement matches: Direction = 1.
For control/mock samples, uppercase the entire amplicon except the primer
regions at both ends. These samples will use cut_buffer = 0 in the pipeline.
## Rules for primers
The Forward_Primer must appear in the amplicon as-is.
The reverse complement of the Reverse_Primer must appear in the amplicon.
Trim the amplicon to span exactly from forward primer start to reverse primer
RC end.
## Output
Write the result as a CSV file named "config.csv" with no row names.
Column names in the CSV are ignored by amplican (only column order matters),
but use descriptive names anyway.
Please use the Biostrings package for reverse complement and the pwalign
package for sequence matching. Include a validation step that checks:
- All IDs are unique
- All primers are found in their amplicons
- All guides are found in their amplicons (with mismatches allowed)
- All amplicons contain at least some UPPER CASE letters
This function implements Strategy A: find the guide in the amplicon and uppercase the guide region plus a configurable buffer.
library(Biostrings)
uppercase_by_guide <- function(amplicon, guide, buffer = 10,
max_mismatch = 3) {
amp_seq <- tolower(amplicon)
guide_lc <- tolower(guide)
# Try forward strand first
match_fwd <- matchPattern(guide_lc, DNAString(amp_seq),
max.mismatch = max_mismatch)
if (length(match_fwd) > 0) {
direction <- 0
s <- start(match_fwd)[1]
e <- end(match_fwd)[1]
} else {
# Try reverse complement of guide
guide_rc <- tolower(as.character(reverseComplement(DNAString(guide_lc))))
match_rev <- matchPattern(guide_rc, DNAString(amp_seq),
max.mismatch = max_mismatch)
if (length(match_rev) > 0) {
direction <- 1
s <- start(match_rev)[1]
e <- end(match_rev)[1]
} else {
warning("Guide not found in amplicon. Returning all lowercase.")
return(list(amplicon = amp_seq, direction = NA))
}
}
# Uppercase guide match plus buffer, clamped to amplicon bounds
up_start <- max(1, s - buffer)
up_end <- min(nchar(amp_seq), e + buffer)
substr(amp_seq, up_start, up_end) <-
toupper(substr(amp_seq, up_start, up_end))
list(amplicon = amp_seq, direction = direction)
}
Usage:
result <- uppercase_by_guide(
amplicon = "aagctgacggctaaatgaaaaatgtcaaacatctgttccaggtgctgcgtatgccagggcagaggAGGTGGTCAGGGAACTGGtggaggtcactgggataccctttcttcccacaccaatggggaaaggagtcctgccagatgaccatcccaactgtgttgctgcagccagatccaggtgtgtttgcgcttgtgtaatt",
guide = "AGGTGGTCAGGGAACTGG",
buffer = 10
)
# result$amplicon has the guide region in UPPER CASE
# result$direction is 0 or 1
This function implements Strategy B: uppercase everything, then re-lowercase the primer regions at both ends.
library(Biostrings)
uppercase_whole_interior <- function(amplicon, forward_primer, reverse_primer) {
amp_seq <- toupper(amplicon)
amp_lc <- tolower(amplicon)
fwd_len <- nchar(forward_primer)
rev_rc <- tolower(as.character(reverseComplement(DNAString(reverse_primer))))
# Lowercase forward primer region at the start
substr(amp_seq, 1, fwd_len) <- tolower(substr(amp_seq, 1, fwd_len))
# Find and lowercase reverse primer RC near the end
rev_pos <- regexpr(rev_rc, amp_lc, fixed = TRUE)
if (rev_pos > 0) {
rev_end <- rev_pos + nchar(rev_rc) - 1
substr(amp_seq, rev_pos, rev_end) <-
tolower(substr(amp_seq, rev_pos, rev_end))
} else {
warning("Reverse primer RC not found in amplicon.")
}
amp_seq
}
With this strategy, set cut_buffer = 0 in the pipeline, since the entire
interior is already marked.
This helper determines whether the guide is on the forward or reverse strand.
library(Biostrings)
detect_direction <- function(amplicon, guide, max_mismatch = 3) {
amp_lc <- tolower(amplicon)
guide_lc <- tolower(guide)
match_fwd <- matchPattern(guide_lc, DNAString(amp_lc),
max.mismatch = max_mismatch)
if (length(match_fwd) > 0) return(0)
guide_rc <- tolower(as.character(reverseComplement(DNAString(guide_lc))))
match_rev <- matchPattern(guide_rc, DNAString(amp_lc),
max.mismatch = max_mismatch)
if (length(match_rev) > 0) return(1)
NA # guide not found on either strand
}
Run this after generating your config to catch common mistakes before feeding it to the pipeline.
library(Biostrings)
validate_config <- function(config) {
# 1. Unique IDs
dup_ids <- duplicated(config$ID)
if (any(dup_ids)) {
warning("Duplicate IDs: ", toString(unique(config$ID[dup_ids])))
}
# 2. No NA in required columns
required <- c("ID", "Barcode", "Forward_Primer", "Reverse_Primer",
"guideRNA", "Direction", "Amplicon")
for (col in required) {
if (any(is.na(config[[col]]) | config[[col]] == "")) {
warning("Empty values in column: ", col)
}
}
# 3. Forward primer found in amplicon
for (i in seq_len(nrow(config))) {
amp <- toupper(config$Amplicon[i])
fp <- toupper(config$Forward_Primer[i])
if (nzchar(fp) && !grepl(fp, amp, fixed = TRUE)) {
warning("Forward primer not in amplicon for ID: ", config$ID[i])
}
}
# 4. Reverse primer RC found in amplicon
for (i in seq_len(nrow(config))) {
amp <- toupper(config$Amplicon[i])
rp <- config$Reverse_Primer[i]
if (nzchar(rp)) {
rp_rc <- toupper(as.character(reverseComplement(DNAString(rp))))
if (!grepl(rp_rc, amp, fixed = TRUE)) {
warning("Reverse primer RC not in amplicon for ID: ", config$ID[i])
}
}
}
# 5. Guide found in amplicon (allowing mismatches)
for (i in seq_len(nrow(config))) {
guide <- config$guideRNA[i]
amp <- tolower(config$Amplicon[i])
if (nzchar(guide)) {
fwd <- matchPattern(tolower(guide), DNAString(amp), max.mismatch = 3)
if (length(fwd) == 0) {
guide_rc <- tolower(as.character(reverseComplement(DNAString(guide))))
rev <- matchPattern(guide_rc, DNAString(amp), max.mismatch = 3)
if (length(rev) == 0) {
warning("Guide not found in amplicon for ID: ", config$ID[i])
}
}
}
}
# 6. Amplicon has at least some UPPER CASE
for (i in seq_len(nrow(config))) {
if (!grepl("[A-Z]", config$Amplicon[i])) {
warning("No UPPER CASE in amplicon for ID: ", config$ID[i],
" -- whole amplicon will be treated as cut site")
}
}
# 7. Unique barcode-primer combinations
combo <- paste(config$Barcode, config$Forward_Primer, config$Reverse_Primer)
dup_combo <- duplicated(combo)
if (any(dup_combo)) {
warning("Non-unique barcode/primer combos for rows: ",
toString(which(dup_combo)))
}
message("Validation complete. Check warnings above for any issues.")
}
If you have HDR experiments with a donor template, the donor sequence in your spreadsheet may be shorter than the full amplicon. Amplican expects the donor to span the same region as the amplicon (between the primers). You can extend a short donor to full length by aligning it against the amplicon and embedding it into a copy of the amplicon sequence.
library(Biostrings)
library(pwalign)
extend_donor <- function(donor, amplicon,
scoring_matrix = nucleotideSubstitutionMatrix(
match = 5, mismatch = -4, baseOnly = FALSE,
type = "DNA"),
gap_opening = 25, gap_extension = 0) {
if (nchar(donor) == 0) return("")
amp_upper <- toupper(amplicon)
donor_seq <- toupper(donor)
# Align donor to amplicon
aln <- pairwiseAlignment(
DNAStringSet(donor_seq), DNAStringSet(amp_upper),
substitutionMatrix = scoring_matrix, type = "overlap",
gapOpening = gap_opening, gapExtension = gap_extension)
# Embed donor into a copy of the full amplicon
extended <- amp_upper
stringr::str_sub(extended, start(subject(aln)), end(subject(aln))) <- donor_seq
# Validate: re-align amplicon to extended donor, score should not drop
aln_check <- pairwiseAlignment(
DNAStringSet(amp_upper), DNAStringSet(extended),
substitutionMatrix = scoring_matrix, type = "overlap",
gapOpening = gap_opening, gapExtension = gap_extension)
if (score(aln_check) < score(aln)) {
warning("Donor validation score dropped: ",
round(score(aln_check)), " < ", round(score(aln)))
}
extended
}
This works because the overlap alignment finds where the donor matches within the amplicon, and the donor bases replace the corresponding amplicon bases. Positions outside the donor alignment retain the amplicon sequence, so the result spans the full primer-to-primer window.
Before running amplicanPipeline, verify:
fastq_foldercut_buffer = 0 to the pipelinecut_buffer = 5A complete workflow looks like this:
readxl or openxlsx package).config.csv.validate_config() on the result.amplicanPipeline(config, fastq_folder, results_folder).If you have both treatment and control samples in the same config (recommended
for normalization), use guide + 10 bp buffer for treatments and whole-interior
uppercase for controls. Pass cut_buffer = 0 if controls dominate, or keep the
default cut_buffer = 5 if treatments dominate and you want precise cut-site
detection. The cut_buffer is a single global setting – choose based on your
primary analysis goal.