Construction Benchmarks

Hardware and software

One Amazon EC2 i3.8xlarge instance:

32 logical / 16 physical cores of Xeon E5-2686 v4
244 gigabytes of memory
/: EBS General Purpose SSD (gp2), 160 GB
/large_disk: 4x 1.9 TB NVMe SSD in RAID 0
Ubuntu 16.04 with Linux kernel 4.4.0
GCC 5.4.0
A recent version of VG with GBWT v0.5

Directory structure

~/vg: VG
~/data: Downloaded data
~/graphs: VG graphs
~/indexes: GBWT/GCSA2 indexes
~/logs: Construction logs

Data

The benchmarks are based on 1000 Genomes Project phase 3 data. In order to download the data, run the following in ~/data:

#!/bin/bash

# Get the reference
REFERENCE=hs37d5.fa
rm -f ${REFERENCE} ${REFERENCE}.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/${REFERENCE}.gz
gunzip ${REFERENCE}

# Get the variants
VARIANTS=ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz
SHORTV=all.vcf.gz
rm -f ${SHORTV} ${SHORTV}.tbi
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${VARIANTS}
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${VARIANTS}.tbi
mv ${VARIANTS} ${SHORTV}
mv ${VARIANTS}.tbi ${SHORTV}.tbi

# Get the phasings for chromosomes 1-22
PREFIX=ALL.chr
SPREFIX=chr
SUFFIX=.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
SSUFFIX=.vcf.gz
for i in $(seq 1 22; echo X; echo Y)
do
  NAME=${PREFIX}${i}${SUFFIX}
  SHORT=${SPREFIX}${i}${SSUFFIX}
  rm -f ${SHORT} ${SHORT}.tbi
  wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${NAME}
  wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${NAME}.tbi
  mv ${NAME} ${SHORT}
  mv ${NAME}.tbi ${SHORT}.tbi
done

# Get the phasings for chromosomes X and Y
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.tbi
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz chrX.vcf.gz
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.tbi chrX.vcf.gz.tbi
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi
mv ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz chrY.vcf.gz
mv ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi chrY.vcf.gz.tbi

Graph construction

We build VG graphs with variants embedded as alt paths. Node size is limited to 32 bases for better GCSA2 query performance. In order to build the graphs, run the following in ~/graphs:

#!/bin/bash

VG=~/vg/bin/vg
DATA=~/data
REFERENCE=${DATA}/hs37d5.fa
MAPPING=node_mapping

# Build the VG graphs
(seq 1 22; echo X; echo Y) | parallel -j 24 \
    "$VG construct -r $REFERENCE -v ${DATA}/chr{}.vcf.gz -R {} -C -a > chr{}.vg"

# Harmonize the node ids
$VG ids -j -m $MAPPING  $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.vg; done)
cp $MAPPING ${MAPPING}.backup

This will take several hours. The vg ids command produces an empty node mapping that is useful for some GCSA2 construction options.

GBWT construction

We run 14 construction jobs in parallel (formerly 12), starting from the largest chromosomes. The last job to finish is the one for chromosome 2. Run the following in ~/indexes to build per-chromosome GBWT indexes.

#!/bin/bash

(echo X; seq 1 22; echo Y) | parallel -j 14 "./build_gbwt.sh {}"

Script build_gbwt.sh contains the following:

#!/bin/bash
# $1 - chromosome

VG=~/vg/bin/vg
DATA=~/data
GRAPHS=~/graphs
LOGS=~/logs
BASENAME=chr${1}

export TMPDIR=/large_disk/tmp

LOGFILE=${LOGS}/gbwt_${BASENAME}.log
rm -f $LOGFILE

START=$(date +%s)
echo "Start time: $START" >> $LOGFILE
echo >> $LOGFILE

$VG index -G ${BASENAME}.gbwt -v ${DATA}/${BASENAME}.vcf.gz \
    -F ${BASENAME}.threads -p ${GRAPHS}/${BASENAME}.vg 2>> $LOGFILE

STOP=$(date +%s)
echo >> $LOGFILE
echo "Finish time: $STOP" >> $LOGFILE
echo "Total time: $(($STOP-$START)) seconds" >> $LOGFILE

If we want to build chromosome-length haplotypes, we can add options -P -o -B 100 to the vg index command:

-P transforms unphased genotypes into randomly phased ones.
-o skips overlapping alternate alleles when the overlap cannot be resolved.
-B 100 saves memory by generating the haplotypes in batches of 100 samples (default 200).

We then merge the individual GBWT indexes into a single file:

#!/bin/bash

VG=~/vg/bin/vg
LOGS=~/logs
OUTPUT=all.gbwt

LOGFILE=${LOGS}/gbwt_merge.log
rm -f $LOGFILE

$VG gbwt -f -o $OUTPUT -p \
    $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.gbwt; done) 2>> $LOGFILE

Results (default parameters)

Chromosome	Sequences	Time (h)	Memory (GB)	Disk (GB)	GBWT (MB)	Identifiers (MB)	Total (MB)
1	3681462	8.85	20.82	1.06	574	519	1093
2	3994646	10.20	22.66	1.13	611	570	1180
3	3353940	8.20	18.30	0.97	498	469	967
4	2932492	8.66	17.98	0.99	489	463	952
5	3118314	7.74	16.57	0.86	448	427	875
6	2965410	7.40	15.82	0.89	434	407	841
7	2591106	6.95	14.90	0.80	412	372	784
8	1989534	6.96	14.54	0.75	391	364	755
9	1878266	5.22	11.42	0.59	330	279	608
10	2425746	5.94	12.69	0.68	374	318	692
11	2170912	5.67	12.86	0.67	370	321	692
12	2375914	5.65	12.32	0.65	361	309	671
13	1794956	4.00	8.99	0.49	263	225	487
14	1751932	3.58	8.37	0.44	249	203	452
15	1187578	3.58	7.68	0.40	233	192	425
16	1357316	3.88	8.50	0.43	257	211	469
17	1674504	3.19	7.37	0.38	227	178	406
18	1210872	3.21	7.18	0.38	212	174	386
19	1375318	2.76	5.85	0.32	183	143	326
20	783244	2.78	5.80	0.30	170	137	307
21	526148	1.67	3.53	0.20	112	86	198
22	581150	1.68	3.53	0.19	111	86	197
X	4871814	3.89	11.09	5.25	331	220	551
Y	4270	0.02	0.89	0.00	6	1	7
Merging		0.24	29.38
Total	50596844	10.44	< 244	18.82	7668	7570	15238

Results (chromosome-length haplotypes)

Chromosome	Sequences	Time (h)	Memory (GB)	Disk (GB)	GBWT (MB)	Identifiers (MB)	Total (MB)
1	10016	13.11	20.82	1.26	559	427	986
2	10016	15.02	22.66	1.36	595	466	1060
3	10016	11.85	18.30	1.15	484	386	870
4	10016	11.59	17.98	1.17	478	378	855
5	10016	10.71	16.57	1.03	435	335	771
6	10016	9.97	15.83	1.05	422	321	743
7	10016	9.25	14.90	0.95	402	302	703
8	10016	8.84	14.54	0.90	383	294	677
9	10016	6.48	11.42	0.70	322	232	554
10	10016	7.47	12.70	0.81	364	257	621
11	10016	7.51	12.86	0.80	362	260	621
12	10016	7.24	12.32	0.77	352	249	601
13	10016	4.84	8.98	0.58	256	177	433
14	10016	4.59	8.38	0.53	242	166	408
15	10016	4.15	7.68	0.48	228	152	381
16	10016	4.70	8.50	0.51	252	169	421
17	10016	3.97	7.37	0.45	221	147	368
18	10016	3.75	7.19	0.46	207	136	343
19	10016	3.05	5.85	0.38	178	112	290
20	10016	2.89	5.80	0.36	167	111	278
21	10016	1.72	3.84	0.23	110	66	176
22	10016	1.69	3.87	0.22	109	66	175
X	17414	5.86	11.09	5.32	311	187	498
Y	2466	0.02	0.87	0.00	6	1	7
Merging		0.32	26.23
Total	240232	15.35	< 244	21.46	7425	6073	13498

Provide feedback

Saved searches

Use saved searches to filter your results more quickly