Skip to content

Construction Benchmarks

Jouni Siren edited this page Jul 20, 2018 · 35 revisions

Hardware and software

One Amazon EC2 i3.8xlarge instance:

  • 32 logical / 16 physical cores of Xeon E5-2686 v4
  • 244 gigabytes of memory
  • /: EBS General Purpose SSD (gp2), 160 GB
  • /large_disk: 4x 1.9 TB NVMe SSD in RAID 0
  • Ubuntu 16.04 with Linux kernel 4.4.0
  • GCC 5.4.0
  • A recent version of VG with GBWT v0.5

Directory structure

  • ~/vg: VG
  • ~/data: Downloaded data
  • ~/graphs: VG graphs
  • ~/indexes: GBWT/GCSA2 indexes
  • ~/logs: Construction logs

Data

The benchmarks are based on 1000 Genomes Project phase 3 data. In order to download the data, run the following in ~/data:

#!/bin/bash

# Get the reference
REFERENCE=hs37d5.fa
rm -f ${REFERENCE} ${REFERENCE}.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/${REFERENCE}.gz
gunzip ${REFERENCE}

# Get the variants
VARIANTS=ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz
SHORTV=all.vcf.gz
rm -f ${SHORTV} ${SHORTV}.tbi
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${VARIANTS}
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${VARIANTS}.tbi
mv ${VARIANTS} ${SHORTV}
mv ${VARIANTS}.tbi ${SHORTV}.tbi

# Get the phasings for chromosomes 1-22
PREFIX=ALL.chr
SPREFIX=chr
SUFFIX=.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
SSUFFIX=.vcf.gz
for i in $(seq 1 22; echo X; echo Y)
do
  NAME=${PREFIX}${i}${SUFFIX}
  SHORT=${SPREFIX}${i}${SSUFFIX}
  rm -f ${SHORT} ${SHORT}.tbi
  wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${NAME}
  wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/${NAME}.tbi
  mv ${NAME} ${SHORT}
  mv ${NAME}.tbi ${SHORT}.tbi
done

# Get the phasings for chromosomes X and Y
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.tbi
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz chrX.vcf.gz
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz.tbi chrX.vcf.gz.tbi
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi
mv ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz chrY.vcf.gz
mv ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz.tbi chrY.vcf.gz.tbi

Graph construction

We build VG graphs with variants embedded as alt paths. Node size is limited to 32 bases for better GCSA2 query performance. In order to build the graphs, run the following in ~/graphs:

#!/bin/bash

VG=~/vg/bin/vg
DATA=~/data
REFERENCE=${DATA}/hs37d5.fa
MAPPING=node_mapping

# Build the VG graphs
(seq 1 22; echo X; echo Y) | parallel -j 24 \
    "$VG construct -r $REFERENCE -v ${DATA}/chr{}.vcf.gz -R {} -C -a > chr{}.vg"

# Harmonize the node ids
$VG ids -j -m $MAPPING  $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.vg; done)
cp $MAPPING ${MAPPING}.backup

This will take several hours. The vg ids command produces an empty node mapping that is useful for some GCSA2 construction options.

GBWT construction

We run 14 construction jobs in parallel (formerly 12), starting from the largest chromosomes. The last job to finish is the one for chromosome 2. Run the following in ~/indexes to build per-chromosome GBWT indexes.

#!/bin/bash

(echo X; seq 1 22; echo Y) | parallel -j 14 "./build_gbwt.sh {}"

Script build_gbwt.sh contains the following:

#!/bin/bash
# $1 - chromosome

VG=~/vg/bin/vg
DATA=~/data
GRAPHS=~/graphs
LOGS=~/logs
BASENAME=chr${1}

export TMPDIR=/large_disk/tmp

LOGFILE=${LOGS}/gbwt_${BASENAME}.log
rm -f $LOGFILE

START=$(date +%s)
echo "Start time: $START" >> $LOGFILE
echo >> $LOGFILE

$VG index -G ${BASENAME}.gbwt -v ${DATA}/${BASENAME}.vcf.gz \
    -F ${BASENAME}.threads -p ${GRAPHS}/${BASENAME}.vg 2>> $LOGFILE

STOP=$(date +%s)
echo >> $LOGFILE
echo "Finish time: $STOP" >> $LOGFILE
echo "Total time: $(($STOP-$START)) seconds" >> $LOGFILE

If we want to build chromosome-length haplotypes, we can add options -P -o -B 100 to the vg index command:

  • -P transforms unphased genotypes into randomly phased ones.
  • -o skips overlapping alternate alleles when the overlap cannot be resolved.
  • -B 100 saves memory by generating the haplotypes in batches of 100 samples (default 200).

We then merge the individual GBWT indexes into a single file:

#!/bin/bash

VG=~/vg/bin/vg
LOGS=~/logs
OUTPUT=all.gbwt

LOGFILE=${LOGS}/gbwt_merge.log
rm -f $LOGFILE

$VG gbwt -f -o $OUTPUT -p \
    $(for i in $(seq 1 22; echo X; echo Y); do echo chr${i}.gbwt; done) 2>> $LOGFILE

Results (default parameters)

Chromosome Sequences Time (h) Memory (GB) Disk (GB) GBWT (MB) Identifiers (MB) Total (MB)
1 3681462 8.85 20.82 1.06 574 519 1093
2 3994646 10.20 22.66 1.13 611 570 1180
3 3353940 8.20 18.30 0.97 498 469 967
4 2932492 8.66 17.98 0.99 489 463 952
5 3118314 7.74 16.57 0.86 448 427 875
6 2965410 7.40 15.82 0.89 434 407 841
7 2591106 6.95 14.90 0.80 412 372 784
8 1989534 6.96 14.54 0.75 391 364 755
9 1878266 5.22 11.42 0.59 330 279 608
10 2425746 5.94 12.69 0.68 374 318 692
11 2170912 5.67 12.86 0.67 370 321 692
12 2375914 5.65 12.32 0.65 361 309 671
13 1794956 4.00 8.99 0.49 263 225 487
14 1751932 3.58 8.37 0.44 249 203 452
15 1187578 3.58 7.68 0.40 233 192 425
16 1357316 3.88 8.50 0.43 257 211 469
17 1674504 3.19 7.37 0.38 227 178 406
18 1210872 3.21 7.18 0.38 212 174 386
19 1375318 2.76 5.85 0.32 183 143 326
20 783244 2.78 5.80 0.30 170 137 307
21 526148 1.67 3.53 0.20 112 86 198
22 581150 1.68 3.53 0.19 111 86 197
X 4871814 3.89 11.09 5.25 331 220 551
Y 4270 0.02 0.89 0.00 6 1 7
Merging 0.24 29.38
Total 50596844 10.44 < 244 18.82 7668 7570 15238

Results (chromosome-length haplotypes)

Chromosome Sequences Time (h) Memory (GB) Disk (GB) GBWT (MB) Identifiers (MB) Total (MB)
1 10016 13.11 20.82 1.26 559 427 986
2 10016 15.02 22.66 1.36 595 466 1060
3 10016 11.85 18.30 1.15 484 386 870
4 10016 11.59 17.98 1.17 478 378 855
5 10016 10.71 16.57 1.03 435 335 771
6 10016 9.97 15.83 1.05 422 321 743
7 10016 9.25 14.90 0.95 402 302 703
8 10016 8.84 14.54 0.90 383 294 677
9 10016 6.48 11.42 0.70 322 232 554
10 10016 7.47 12.70 0.81 364 257 621
11 10016 7.51 12.86 0.80 362 260 621
12 10016 7.24 12.32 0.77 352 249 601
13 10016 4.84 8.98 0.58 256 177 433
14 10016 4.59 8.38 0.53 242 166 408
15 10016 4.15 7.68 0.48 228 152 381
16 10016 4.70 8.50 0.51 252 169 421
17 10016 3.97 7.37 0.45 221 147 368
18 10016 3.75 7.19 0.46 207 136 343
19 10016 3.05 5.85 0.38 178 112 290
20 10016 2.89 5.80 0.36 167 111 278
21 10016 1.72 3.84 0.23 110 66 176
22 10016 1.69 3.87 0.22 109 66 175
X 17414 5.86 11.09 5.32 311 187 498
Y 2466 0.02 0.87 0.00 6 1 7
Merging 0.32 26.23
Total 240232 15.35 < 244 21.46 7425 6073 13498