Contents of README:
The following can be accessed by running ./barcode_split_trim.pl --help
:
USAGE
barcode_split_trim.pl [options] -b BARCODE IN.FASTQ
DESCRIPTION
Extracts fastq reads for specified barcode(s) from one or multiple FASTQ files.
Use wildcards ('*') to match multiple input FASTQ files.
OPTIONS
-h, --help Print this help message
-v, --version Print version number
--id Sample or Experiment ID
-b, --barcode BARCODE Specify barcode or file w/ list of barcodes to extract
-l, --list Indicate BARCODE is a list of barcodes in a file
--indexed Samples designated by index sequences
Alternate read FQ files and index FQ files
-m, --mismatches Minimum number of mismatches allowed in barcode sequence [0]
-n, --notrim Split without trimming barcodes
-st, --stats Output summary stats only (w/o creating fastq files)
-o, --outdir DIR Output file is saved in the specified directory
(or same directory as IN.FASTQ, if --outdir is not used)
NAMING OPTIONS
--autoprefix Append FASTQ file name onto output
--autosuffix Append barcode onto output
-p, --prefix PREFIX Add custom prefix to output
-su, --suffix SUFFIX Add custom suffix to output
OUTPUT
An output file in fastq format is written for each barcode to the directory
containing IN.FASTQ, unless an output directory is specified.
The default name of the output file is SAMPLE_ID.fq. The output names can be
customized using the Naming Options.
Log files and a summary plot that aid in identification of problem libraries.
EXAMPLES
barcode_split_trim.pl -i Charlotte -b GACTG kitten_DNA.fq
barcode_split_trim.pl --id BigExperiment --barcode barcode.file --list *_DNA.fastq
barcode_split_trim.pl --help
A pair of sample FASTQ and barcode files is provided in the sample_files
folder. They have been used to generate the output FASTQ, log, and summary plot files (sample_files/output
). In this example, I show what these files look like and how to generate them.
The sample FASTQ file (sample_files/sequences.fq
) contains 100,000 sequence reads from a pool of 14 samples. The corresponding barcodes (sample_files/barcode.list
) are:
TACGC marmot1
ATCGT marmot2
ATTCC marmot3
CCAGC marmot4
GATAC aardvark1
GGATG aardvark2
TCGAT tarsier1
AGCGC tarsier2
CCAAT tarsier3
CGCTG tarsier4
CTAGT puffin1
TAGAG puffin2
TAGTC puffin3
GGTCA puffin4
To split sample_files/sequences.fq
with the barcodes in sample_files/barcode.list
, we can run:
./barcode_split_trim.pl \
--id demo \
--barcode sample_files/barcode.list \
--list \
--outdir sample_files/output \
sample_files/sequences.fq
This results in a FASTQ file for each barcode (barcodes are trimmed), a single FASTQ file containing all unmatched barcodes (barcodes are left in tact), and two log files:
# FASTQ file for each barcode
aardvark1.fq
aardvark2.fq
marmot1.fq
marmot2.fq
marmot3.fq
marmot4.fq
puffin1.fq
puffin2.fq
puffin3.fq
puffin4.fq
tarsier1.fq
tarsier2.fq
tarsier3.fq
tarsier4.fq
# FASTQ file containing all unmatched barcodes
unmatched.fq_sequences.bar_barcode.list.fq
# log files
log_barcode_counts.fq_sequences.bar_barcode.list
log_barcodes_observed.fq_sequences.bar_barcode.list
# plot summary
demo.barcodes.png
The first log file is the barcode splitting summary (sample_files/output/log_barcode_counts.fq_sequences.bar_barcode.list
) and returns some basic stats with an emphasis on expected barcodes:
Barcode splitting summary for:
sample_files/sequences.fq
---------------------------
matched 24,459 97.8%
unmatched 541 2.2%
---------------------------
barcodes 14
min 1,260 5.0%
max 2,200 8.8%
mean 1,747 7.0%
median 1,735.5 6.9%
---------------------------
id barcode count percent
aardvark1 GATAC 1,595 6.4%
aardvark2 GGATG 1,839 7.4%
marmot1 TACGC 1,758 7.0%
marmot2 ATCGT 1,504 6.0%
marmot3 ATTCC 1,614 6.5%
marmot4 CCAGC 1,468 5.9%
puffin1 CTAGT 1,476 5.9%
puffin2 TAGAG 1,260 5.0%
puffin3 TAGTC 2,102 8.4%
puffin4 GGTCA 1,964 7.9%
tarsier1 TCGAT 2,010 8.0%
tarsier2 AGCGC 2,200 8.8%
tarsier3 CCAAT 1,713 6.9%
tarsier4 CGCTG 1,956 7.8%
The second log file returns counts and percentages for all observed barcodes (both expected and unexpected). Below are the first 20 (of 215) observed barcodes from this log (sample_files/output/log_barcodes_observed.fq_sequences.bar_barcode.list
):
barcode count percent id
AGCGC 2,200 8.8% tarsier2
TAGTC 2,102 8.4% puffin3
TCGAT 2,010 8.0% tarsier1
GGTCA 1,964 7.9% puffin4
CGCTG 1,956 7.8% tarsier4
GGATG 1,839 7.4% aardvark2
TACGC 1,758 7.0% marmot1
CCAAT 1,713 6.9% tarsier3
ATTCC 1,614 6.5% marmot3
GATAC 1,595 6.4% aardvark1
ATCGT 1,504 6.0% marmot2
CTAGT 1,476 5.9% puffin1
CCAGC 1,468 5.9% marmot4
TAGAG 1,260 5.0% puffin2
GGGCA 32 0.1%
GGTCC 27 0.1%
GGATT 15 0.1%
NAGAG 11 0.0%
NAGTC 11 0.0%
TCTAT 11 0.0%
The logs are useful, but if there are numerous barcodes and/or experiments being analyzed at once, it can be difficult to easily detect irregularities or problematic barcodes. To solve this issue, we can make a barcode frequency plot using R.
For this plot (saved to sample_files/output/demo.barcodes.png
), barcodes are split into two groups, those that match an expected barcode and those that are unmatched. Boxplots are then generated using the observed barcode frequencies (which are jitter-plotted individually on top of the boxplot).