Preparing amplican config from Excel with LLM assistance

This vignette is a practical guide for transforming your experiment spreadsheet (Excel/xlsx) into a valid amplican configuration file using code generated by a large language model (LLM). It covers what amplican expects, how to format the amplicon sequence correctly, how primers work, and includes copy-pasteable example functions and an LLM prompt template.

Why transform through code, not by hand

It is tempting to open your Excel file, shuffle columns around, and save as CSV. Do not do this. Manual editing is error-prone, untraceable, and impossible to reproduce. When your data changes (a new sequencing run, a corrected sample name), you would have to redo the entire process by hand.

Instead, write a script that reads your Excel file and produces the config CSV programmatically. This approach has several advantages:

  • Reproducible: anyone can re-run the script and get the same config.
  • Traceable: the script documents every transformation step.
  • Reusable: when new data arrives, just re-run the script.
  • Shareable: colleagues can inspect and modify the logic.
  • Auditable: you can see exactly how each column was derived.

An LLM (such as ChatGPT, Claude, or similar) can write most of this script for you, given the right instructions. This vignette provides those instructions.

The 12 required columns

The amplican config file is a CSV where column order matters, not column names. When amplican reads your config, it renames the first 12 columns internally based on their position (see amplicanAlign.R). Any additional columns beyond the 12th are preserved by name and ignored by the pipeline.

The 12 columns, in order:

# Column Type Description
1 ID character Unique identifier for this experiment (e.g., ID_1, SampleA_L001)
2 Barcode character Groups experiments sharing the same FASTQ file pair (e.g., barcode_1, A1_L001)
3 Forward_Reads character Filename of the forward FASTQ (e.g., R1_001.fastq.gz). Not a full path – amplican prepends the fastq_folder you pass to amplicanPipeline. Leave empty if using only reverse reads.
4 Reverse_Reads character Filename of the reverse FASTQ. Leave empty if using only forward reads.
5 Group character User-defined grouping (e.g., experimenter name, treatment condition). Used in group-level reports.
6 Control 0 or 1 1 if this row is a control sample (used for normalization). 0 otherwise.
7 guideRNA character Guide RNA sequence, 5’ to 3’. Does not need to be reverse-complemented – amplican handles that based on the Direction column.
8 Forward_Primer character Anchoring sequence on the left side of the amplicon (see “Primers” section below).
9 Reverse_Primer character Anchoring sequence on the right side of the amplicon. Written 5’ to 3’ as you would order it from a primer supplier.
10 Direction 0 or 1 0 if the guide is found on the forward strand of the amplicon. 1 if the guide needs to be reverse-complemented to match.
11 Amplicon character Full amplicon sequence. Case matters: UPPER CASE letters mark the expected cut site; lowercase letters mark the rest. See the next section.
12 Donor character Donor template sequence for HDR experiments. Leave empty if no donor. The donor should span the same region as the amplicon (between primers).

Important notes:

  • Column names in your CSV are ignored for the first 12 columns. Only the order matters. You can name them whatever you like, but the data must be in the order shown above.
  • Forward_Reads and Reverse_Reads are filenames, not full paths. Amplican combines them with the fastq_folder argument.
  • Control is treated as logical: 0 = FALSE (treatment), 1 = TRUE (control). Normalization uses control rows to remove background events from treatment rows.
  • Direction is also treated as logical: 0 or 1.

The Amplicon column: case is everything

The Amplicon column is the most critical and most error-prone part of the config. Amplican uses letter case to distinguish the region where CRISPR is expected to cut (UPPER CASE) from the rest of the amplicon (lowercase).

  • lowercase letters: background sequence (primers, introns, flanking regions). Events here are still detected but are treated as non-cut-site for summary statistics.
  • UPPER CASE letters: the expected cut-site region. Events overlapping this region (expanded by cut_buffer) are counted toward Reads_Del, Reads_In, Reads_Edited, and Reads_Frameshifted.

If your amplicon has no UPPER CASE letters at all, amplican falls back to using the entire amplicon as the cut-site region (with a warning). This effectively disables spatial filtering, so all events are counted regardless of position.

There are two common strategies for uppercasing, each with a different recommended cut_buffer setting:

Strategy A: uppercase the guide plus a buffer (typical)

This is the most common approach for treatment samples. You start with the full amplicon in lowercase, find where the guide maps, and uppercase the guide region plus some extra bases on each side as a buffer.

Recommended buffer: 10 bp on each side of the guide match.

With this strategy, use the pipeline’s default cut_buffer = 5, which adds another 5 bp of tolerance on each side. The total effective window is then guide ± 15 bp.

This strategy is precise: only events near the actual cut site are counted as editing events. Background mismatches elsewhere in the amplicon are excluded from the summary statistics.

Strategy B: uppercase the whole interior (for controls or broad analysis)

Instead of marking only the guide region, you uppercase everything between the primers and leave only the primer regions in lowercase. This means every event in the editable region counts as a cut-site event.

With this strategy, set cut_buffer = 0 in the pipeline, because the UPPER CASE region already covers the entire interior – there is no need for additional expansion.

This is particularly useful for control samples, where you want to detect all background variation regardless of position. It is also used when you care about events across the whole amplicon, not just at the cut site.

Summary of buffer interaction

Strategy Config uppercase region Pipeline cut_buffer Total effective window
Guide + buffer (typical) guide ± 10 bp 5 (default) guide ± 15 bp
Guide + buffer (precise) guide ± 10 bp 0 guide ± 10 bp
Whole interior (controls) everything between primers 0 entire interior

Direction: 0 or 1

The Direction column tells amplican whether the guideRNA sequence is found on the forward strand of the amplicon (Direction = 0) or needs to be reverse-complemented to match (Direction = 1).

In practice:

  • Direction = 0: The guide sequence, as written in the config, appears somewhere in the amplicon (reading left to right).
  • Direction = 1: The reverse complement of the guide appears in the amplicon. Amplican will reverse-complement the guide internally before searching.

To determine the direction automatically, try searching for the guide in the amplicon. If it matches directly, Direction = 0. If it does not match, reverse-complement the guide and try again. If the reverse complement matches, Direction = 1. The detect_direction() function below shows how to do this.

Direction also affects plotting: for Direction = 1 amplicons, amplican flips all coordinates to the forward-strand perspective, so events from different amplicons can be compared on the same relative scale.

Primers: anchoring sequences, not necessarily PCR primers

One of the most important concepts in amplican is that the Forward_Primer and Reverse_Primer columns do not need to contain your actual PCR primer sequences. They are anchoring sequences – two short sequences that flank the region of interest and help amplican position reads during alignment.

What primers do in amplican

  1. Read assignment: When amplican reads a FASTQ file, it searches each read for the forward primer (in the forward read) and the reverse primer (in the reverse read). Reads that contain the primers are assigned to that experiment. This is how amplican knows which reads belong to which experiment.

  2. Alignment positioning: Before aligning a read to the amplicon, amplican trims everything before the primer in the read. This ensures the read and amplicon start at the same position, which improves alignment quality.

  3. Coordinate system: The forward primer defines the left boundary of the amplicon coordinate system. The reverse primer defines the right boundary.

Rules for primers

  • Forward_Primer must be found inside the Amplicon sequence as-is (reading left to right on the forward strand).
  • Reverse_Primer is reverse-complemented internally before searching. So the reverse complement of your Reverse_Primer must be found in the Amplicon. Write the Reverse_Primer in 5’ to 3’ order (as you would order it from a synthesis service).
  • Both primers must be found, or the pipeline stops with an error.
  • The combination of Barcode + Forward_Primer + Reverse_Primer must be unique across all rows (no two experiments in the same barcode can share the same primer pair).
  • Primers can be shorter than the actual PCR primers. Using a sub-sequence of the PCR primer (e.g., the 3’ half) as the anchoring sequence is fine – it just needs to be unique within the amplicon.
  • If you only have forward reads, leave Reverse_Primer empty and set fastqfiles = 1 in the pipeline.

Practical tip

If your reads are primer-trimmed (the sequencing facility removed the primer sequences), use fastqfiles = 0.5 in the pipeline. This mode only requires one of the two primers to be found, which handles the case where one read has had its primer removed.

The LLM workflow

You can use any LLM (ChatGPT, Claude, Gemini, etc.) to generate a transformation script tailored to your Excel layout. Here is the recommended workflow:

Step 1: Export your column headers

Open your Excel file and note the column headers. Also note any special formatting: merged cells, multiple samples per row, blank rows, etc. Write this down as plain text.

Step 2: Prepare the prompt

Copy the prompt template below, fill in your column information, and paste it into your LLM of choice.

Step 3: Review and run the generated code

The LLM will produce an R script. Review it, make sure the column mappings are correct, install any required packages, and run it. Always inspect the output config before running the full pipeline.

Prompt template

Copy and paste the text below into your LLM. Replace the bracketed sections with your information.

I need an R script that reads my Excel experiment sheet and produces a CSV
config file for the amplican R package.

## My Excel file

File name: [my_experiments.xlsx]
Sheet name: [Sheet1]

My columns are (in order):
1. [Well]            - 96-well plate position, e.g. A1, B3
2. [Index_i7]        - i7 barcode sequence
3. [Index_i5]        - i5 barcode sequence
4. [Sample]          - sample name
5. [Forward_Primer]  - forward primer sequence
6. [Reverse_Primer]  - reverse primer sequence
7. [Guide]           - guide RNA sequence
8. [Repair_Template] - donor template sequence (may be empty)

[Note: there may be multiple samples per row / other special formatting.
Describe it here.]

## What amplican needs (12 columns, in this exact order)

1. ID              - unique experiment ID (use sample name + lane)
2. Barcode         - group sharing same FASTQ pair (use well + lane)
3. Forward_Reads   - FASTQ filename (NOT full path)
4. Reverse_Reads   - FASTQ filename (NOT full path)
5. Group           - user-defined grouping (use sample name)
6. Control         - 0 or 1 (1 if this is a control/mock)
7. guideRNA        - guide RNA sequence, 5' to 3'
8. Forward_Primer  - left anchoring sequence (must be found in amplicon)
9. Reverse_Primer  - right anchoring sequence, 5' to 3'
10. Direction      - 0 or 1 (0 if guide on forward strand, 1 if on reverse)
11. Amplicon       - amplicon sequence (lowercase, with guide region in UPPER CASE)
12. Donor          - donor template, or empty

## Rules for the Amplicon column

The amplicon sequence must have the guide region in UPPER CASE and everything
else in lowercase. For each amplicon:
- Start with the full amplicon in lowercase.
- Find the guide (or its reverse complement) in the amplicon using fuzzy
  matching (allow up to 3 mismatches).
- Uppercase the guide match plus 10 bp on each side.
- If the guide matches forward: Direction = 0.
- If only the reverse complement matches: Direction = 1.

For control/mock samples, uppercase the entire amplicon except the primer
regions at both ends. These samples will use cut_buffer = 0 in the pipeline.

## Rules for primers

The Forward_Primer must appear in the amplicon as-is.
The reverse complement of the Reverse_Primer must appear in the amplicon.
Trim the amplicon to span exactly from forward primer start to reverse primer
RC end.

## Output

Write the result as a CSV file named "config.csv" with no row names.
Column names in the CSV are ignored by amplican (only column order matters),
but use descriptive names anyway.

Please use the Biostrings package for reverse complement and the pwalign
package for sequence matching. Include a validation step that checks:
- All IDs are unique
- All primers are found in their amplicons
- All guides are found in their amplicons (with mismatches allowed)
- All amplicons contain at least some UPPER CASE letters

Example: uppercase by guide with buffer

This function implements Strategy A: find the guide in the amplicon and uppercase the guide region plus a configurable buffer.

library(Biostrings)

uppercase_by_guide <- function(amplicon, guide, buffer = 10,
                                max_mismatch = 3) {
  amp_seq <- tolower(amplicon)
  guide_lc <- tolower(guide)

  # Try forward strand first
  match_fwd <- matchPattern(guide_lc, DNAString(amp_seq),
                            max.mismatch = max_mismatch)
  if (length(match_fwd) > 0) {
    direction <- 0
    s <- start(match_fwd)[1]
    e <- end(match_fwd)[1]
  } else {
    # Try reverse complement of guide
    guide_rc <- tolower(as.character(reverseComplement(DNAString(guide_lc))))
    match_rev <- matchPattern(guide_rc, DNAString(amp_seq),
                              max.mismatch = max_mismatch)
    if (length(match_rev) > 0) {
      direction <- 1
      s <- start(match_rev)[1]
      e <- end(match_rev)[1]
    } else {
      warning("Guide not found in amplicon. Returning all lowercase.")
      return(list(amplicon = amp_seq, direction = NA))
    }
  }

  # Uppercase guide match plus buffer, clamped to amplicon bounds
  up_start <- max(1, s - buffer)
  up_end <- min(nchar(amp_seq), e + buffer)
  substr(amp_seq, up_start, up_end) <-
    toupper(substr(amp_seq, up_start, up_end))

  list(amplicon = amp_seq, direction = direction)
}

Usage:

result <- uppercase_by_guide(
  amplicon = "aagctgacggctaaatgaaaaatgtcaaacatctgttccaggtgctgcgtatgccagggcagaggAGGTGGTCAGGGAACTGGtggaggtcactgggataccctttcttcccacaccaatggggaaaggagtcctgccagatgaccatcccaactgtgttgctgcagccagatccaggtgtgtttgcgcttgtgtaatt",
  guide = "AGGTGGTCAGGGAACTGG",
  buffer = 10
)
# result$amplicon has the guide region in UPPER CASE
# result$direction is 0 or 1

Example: uppercase whole interior (for controls)

This function implements Strategy B: uppercase everything, then re-lowercase the primer regions at both ends.

library(Biostrings)

uppercase_whole_interior <- function(amplicon, forward_primer, reverse_primer) {
  amp_seq <- toupper(amplicon)
  amp_lc  <- tolower(amplicon)

  fwd_len <- nchar(forward_primer)
  rev_rc  <- tolower(as.character(reverseComplement(DNAString(reverse_primer))))

  # Lowercase forward primer region at the start
  substr(amp_seq, 1, fwd_len) <- tolower(substr(amp_seq, 1, fwd_len))

  # Find and lowercase reverse primer RC near the end
  rev_pos <- regexpr(rev_rc, amp_lc, fixed = TRUE)
  if (rev_pos > 0) {
    rev_end <- rev_pos + nchar(rev_rc) - 1
    substr(amp_seq, rev_pos, rev_end) <-
      tolower(substr(amp_seq, rev_pos, rev_end))
  } else {
    warning("Reverse primer RC not found in amplicon.")
  }

  amp_seq
}

With this strategy, set cut_buffer = 0 in the pipeline, since the entire interior is already marked.

Example: detect direction

This helper determines whether the guide is on the forward or reverse strand.

library(Biostrings)

detect_direction <- function(amplicon, guide, max_mismatch = 3) {
  amp_lc  <- tolower(amplicon)
  guide_lc <- tolower(guide)

  match_fwd <- matchPattern(guide_lc, DNAString(amp_lc),
                            max.mismatch = max_mismatch)
  if (length(match_fwd) > 0) return(0)

  guide_rc <- tolower(as.character(reverseComplement(DNAString(guide_lc))))
  match_rev <- matchPattern(guide_rc, DNAString(amp_lc),
                            max.mismatch = max_mismatch)
  if (length(match_rev) > 0) return(1)

  NA  # guide not found on either strand
}

Example: validate config

Run this after generating your config to catch common mistakes before feeding it to the pipeline.

library(Biostrings)

validate_config <- function(config) {

  # 1. Unique IDs
  dup_ids <- duplicated(config$ID)
  if (any(dup_ids)) {
    warning("Duplicate IDs: ", toString(unique(config$ID[dup_ids])))
  }

  # 2. No NA in required columns
  required <- c("ID", "Barcode", "Forward_Primer", "Reverse_Primer",
                "guideRNA", "Direction", "Amplicon")
  for (col in required) {
    if (any(is.na(config[[col]]) | config[[col]] == "")) {
      warning("Empty values in column: ", col)
    }
  }

  # 3. Forward primer found in amplicon
  for (i in seq_len(nrow(config))) {
    amp <- toupper(config$Amplicon[i])
    fp  <- toupper(config$Forward_Primer[i])
    if (nzchar(fp) && !grepl(fp, amp, fixed = TRUE)) {
      warning("Forward primer not in amplicon for ID: ", config$ID[i])
    }
  }

  # 4. Reverse primer RC found in amplicon
  for (i in seq_len(nrow(config))) {
    amp <- toupper(config$Amplicon[i])
    rp  <- config$Reverse_Primer[i]
    if (nzchar(rp)) {
      rp_rc <- toupper(as.character(reverseComplement(DNAString(rp))))
      if (!grepl(rp_rc, amp, fixed = TRUE)) {
        warning("Reverse primer RC not in amplicon for ID: ", config$ID[i])
      }
    }
  }

  # 5. Guide found in amplicon (allowing mismatches)
  for (i in seq_len(nrow(config))) {
    guide <- config$guideRNA[i]
    amp   <- tolower(config$Amplicon[i])
    if (nzchar(guide)) {
      fwd <- matchPattern(tolower(guide), DNAString(amp), max.mismatch = 3)
      if (length(fwd) == 0) {
        guide_rc <- tolower(as.character(reverseComplement(DNAString(guide))))
        rev <- matchPattern(guide_rc, DNAString(amp), max.mismatch = 3)
        if (length(rev) == 0) {
          warning("Guide not found in amplicon for ID: ", config$ID[i])
        }
      }
    }
  }

  # 6. Amplicon has at least some UPPER CASE
  for (i in seq_len(nrow(config))) {
    if (!grepl("[A-Z]", config$Amplicon[i])) {
      warning("No UPPER CASE in amplicon for ID: ", config$ID[i],
              " -- whole amplicon will be treated as cut site")
    }
  }

  # 7. Unique barcode-primer combinations
  combo <- paste(config$Barcode, config$Forward_Primer, config$Reverse_Primer)
  dup_combo <- duplicated(combo)
  if (any(dup_combo)) {
    warning("Non-unique barcode/primer combos for rows: ",
            toString(which(dup_combo)))
  }

  message("Validation complete. Check warnings above for any issues.")
}

Donor extension (advanced)

If you have HDR experiments with a donor template, the donor sequence in your spreadsheet may be shorter than the full amplicon. Amplican expects the donor to span the same region as the amplicon (between the primers). You can extend a short donor to full length by aligning it against the amplicon and embedding it into a copy of the amplicon sequence.

library(Biostrings)
library(pwalign)

extend_donor <- function(donor, amplicon,
                         scoring_matrix = nucleotideSubstitutionMatrix(
                           match = 5, mismatch = -4, baseOnly = FALSE,
                           type = "DNA"),
                         gap_opening = 25, gap_extension = 0) {
  if (nchar(donor) == 0) return("")

  amp_upper <- toupper(amplicon)
  donor_seq <- toupper(donor)

  # Align donor to amplicon
  aln <- pairwiseAlignment(
    DNAStringSet(donor_seq), DNAStringSet(amp_upper),
    substitutionMatrix = scoring_matrix, type = "overlap",
    gapOpening = gap_opening, gapExtension = gap_extension)

  # Embed donor into a copy of the full amplicon
  extended <- amp_upper
  stringr::str_sub(extended, start(subject(aln)), end(subject(aln))) <- donor_seq

  # Validate: re-align amplicon to extended donor, score should not drop
  aln_check <- pairwiseAlignment(
    DNAStringSet(amp_upper), DNAStringSet(extended),
    substitutionMatrix = scoring_matrix, type = "overlap",
    gapOpening = gap_opening, gapExtension = gap_extension)

  if (score(aln_check) < score(aln)) {
    warning("Donor validation score dropped: ",
            round(score(aln_check)), " < ", round(score(aln)))
  }

  extended
}

This works because the overlap alignment finds where the donor matches within the amplicon, and the donor bases replace the corresponding amplicon bases. Positions outside the donor alignment retain the amplicon sequence, so the result spans the full primer-to-primer window.

Validation checklist

Before running amplicanPipeline, verify:

Putting it all together

A complete workflow looks like this:

  1. Export your Excel sheet to CSV (or read the xlsx directly in R with the readxl or openxlsx package).
  2. Use the LLM prompt template to generate a transformation script tailored to your column layout.
  3. The script should: read the spreadsheet, map columns, look up amplicons by primer, uppercase by guide (or whole interior for controls), detect direction, extend donors if needed, and write config.csv.
  4. Run validate_config() on the result.
  5. Run amplicanPipeline(config, fastq_folder, results_folder).

If you have both treatment and control samples in the same config (recommended for normalization), use guide + 10 bp buffer for treatments and whole-interior uppercase for controls. Pass cut_buffer = 0 if controls dominate, or keep the default cut_buffer = 5 if treatments dominate and you want precise cut-site detection. The cut_buffer is a single global setting – choose based on your primary analysis goal.