Using the Anacapa Pipeline

There are some important files that you should take a look at before starting anything else:

File Description
Anacapa_db/scripts/anacapa_vars.sh File containing default variables used in the Anacapa pipeline. You can modify this file to change default settings.
Anacapa_db/metabarcode_loci_min_merge_length.txt Configuation file for dada2, to configure overlap lengths.
Anacapa_db/forward_primers.txt Default forward primers used for trimming and sorting.
Anacapa_db/reverse_primers.txt Default reverse primers used for trimming and sorting.

You can edit these files using the terminal with the nano command. For example:

nano Anacapa_db/scripts/anacapa_vars.sh

Once you've made your changes, exit with ^X (Control + X), then press Y to save, and Enter to confirm the filename.

First Half - QC and DADA2

The first half of the Anacapa pipeline can be run with Anacapa_db/anacapa_QC_dada2.sh.

Example:

Anacapa_db/anacapa_QC_dada2.sh -i Example_data/12S_example_anacapa_QC_dada2_and_BLCA_classifier/12S_test_data -o out -d Anacapa_db -f Example_data/12S_example_anacapa_QC_dada2_and_BLCA_classifier/12S_test_data/forward.txt -r Example_data/12S_example_anacapa_QC_dada2_and_BLCA_classifier/12S_test_data/reverse.txt -e Anacapa_db/metabarcode_loci_min_merge_length.txt -a nextera -t MiSeq -l

Breakdown of the example command:

Command Argument Description
-i [filepath] Path to folder with .fastq.gz files
-o out Path to output directory. If it doesn't exist, it will be created.
-d Anacapa_db Path to Anacapa_db. This doesn't really change.
-f [filepath] Path to file with forward primers
-r [filepath] Path to file with reverse primers
-e [filepath] File path to a list of minimum length(s) required for paired F and R reads to overlap
-a nextera Illumina adapter type
-t MiSeq Illumina platform
-l Indicates running locally. This is always needed, because the original Anacapa was designed for use on the UCLA HPC.

Other arguments not used in the example:

Command Argument Description
-g If .fastq read are not compressed
-c To modify the allowed cutadapt error for 3' adapter and 5' primer adapter trimming: 0.0 to 1.0 (default 0.3)
-p To modify the allowed cutadapt error 3' primer sorting and trimming: 0.0 to 1.0 (default 0.3)
-q To modify the minimum quality score allowed: 0 - 40 (default 35)
-m To modify the minimum length after quality trimming: 0 - 300 (default 100)
-x To modify the additional 5' trimming of forward reads: 0 - 300 (default HiSeq 10, default MiSeq 20)
-y To modify the additional 5' trimming of reverse reads: 0 - 300 (default HiSeq 25, default MiSeq 50)
-b To modify the number of occurrences required to keep an ASV: 0 - any integer (default 0)

Second Half - Classification

The second half of the Anacapa pipeline can be run with Anacapa_db/anacapa_classifier.sh.

Example:

Anacapa_db/anacapa_classifier.sh -d Anacapa_db -o out -l

Breakdown of the example command:

Command Argument Description
-d Anacapa_db Path to Anacapa_db. This doesn't really change.
-o out Path to output directory generated in the Sequence QC and ASV Parsing script. Yes, the output is the input. I don't know either. It does modify the input in-place, though, so it does become the output, in a way.
-l Indicates running locally. This is always needed, because the original Anacapa was designed for use on the UCLA HPC.

Other arguments not used in the example:

Command Argument Description
-b Percent of missmatch allowed between the qury and subject for BLCA: 0.0 to 1.0 (default 0.8)
-p Minimum percent of length of the subject reltive to the query for BLCA: 0.0 to 1.0 (default 0.8)
-c A list of BCC cut-off values to report taxonomy: "0 to 100", quotes required (default "40 50 60 70 80 90 95"). The file must contain the following format: PERCENT="40 50 60 70 80 90 95 100", Where the value may differ but the PERCENT="values" is required. see Anacapa_db/scripts/BCC_default_cut_off.sh as an example.
-n BLCA number of times to bootstrap: integer value (default 100)
-m Muscle alignment match score: default 1
-f Muscle alignment mismatch score: default -2.5
-g Muscle alignment gap penalty: default -2

Complete

Once both halves have been run, your output files will be in the output directory you specified in both commands. Copy or move it to the /data directory to access it from your host machine, if that's not where it already is.