gpu-bdb is derived from TPCx-BB. While this repository implements logic included in TPCx-BB adapted to run on GPUs, it has not been reviewed by the TPC and cannot be considered an official submission. gpu-bdb benchmark results are NOT comparable to TPCx-BB benchmark results.
The GPU Big Data Benchmark (gpu-bdb) is a RAPIDS library based benchmark for enterprises that includes 30 queries representing real-world ETL & ML workflows at various "scale factors": SF1000 is 1 TB of data, SF10000 is 10TB. Each “query” is in fact a model workflow that can include SQL, user-defined functions, careful sub-setting and aggregation, and machine learning.
We provide a conda environment definition specifying all RAPIDS dependencies needed to run our query implementations. To install and activate it:
CONDA_ENV="rapids-gpu-bdb"
conda env create --name $CONDA_ENV -f gpu-bdb/conda/rapids-gpu-bdb.yml
conda activate rapids-gpu-bdb
This repository includes a small local module containing utility functions for running the queries. You can install it with the following:
cd gpu-bdb/gpu_bdb
python -m pip install .
This will install a package named bdb-tools
into your Conda environment. It should look like this:
conda list | grep bdb
bdb-tools 0.2 pypi_0 pypi
Note that this Conda environment needs to be replicated or installed manually on all nodes, which will allow starting one dask-cuda-worker per node.
Queries 10, 18, and 19 depend on two static (negativeSentiment.txt, positiveSentiment.txt) files. As we cannot redistribute those files, you should download the tpcx-bb toolkit and extract them to your data directory on your shared filesystem:
jar xf bigbenchqueriesmr.jar
cp tpcx-bb1.3.1/distributions/Resources/io/bigdatabenchmark/v1/queries/q10/*.txt ${DATA_DIR}/sentiment_files/
For Query 27, we rely on spacy. To download the necessary language model after activating the Conda environment:
python -m spacy download en_core_web_sm
We use the dask-scheduler
and dask-cuda-worker
command line interfaces to start a Dask cluster. We provide a cluster_configuration
directory with a bash script to help you set up an NVLink-enabled cluster using UCX.
Before running the script, you'll make changes specific to your environment.
In cluster_configuration/cluster-startup.sh
:
- Update `GPU_BDB_HOME=...` to location on disk of this repo
- Update `CONDA_ENV_PATH=...` to refer to your conda environment path.
- Update `CONDA_ENV_NAME=...` to refer to the name of the conda environment you created, perhaps using the `yml` files provided in this repository.
- Update `INTERFACE=...` to refer to the relevant network interface present on your cluster.
- Update `CLUSTER_MODE="TCP"` to refer to your communication method, either "TCP" or "NVLINK". You can also configure this as an environment variable.
- You may also need to change the `LOCAL_DIRECTORY` and `WORKER_DIR` depending on your filesystem. Make sure that these point to a location to which you have write access and that `LOCAL_DIRECTORY` is accessible from all nodes.
To start up the cluster on your scheduler node, please run the following from gpu_bdb/cluster_configuration/
. This will spin up a scheduler and one Dask worker per GPU.
bash cluster-startup.sh SCHEDULER
Then run the following on every other node from gpu_bdb/cluster_configuration/
.
bash cluster-startup.sh
This will spin up one Dask worker per GPU. If you are running on a single node, you will only need to run bash cluster-startup.sh SCHEDULER
.
To run a query, starting from the repository root, go to the query specific subdirectory. For example, to run q07:
cd gpu_bdb/queries/q07/
The queries assume that they can attach to a running Dask cluster. Cluster address and other benchmark configuration lives in a yaml file. You will need to fill this out as appropriate.
conda activate rapids-gpu-bdb
python gpu_bdb_query_07.py --config_file=../../benchmark_runner/benchmark_config.yaml
This repository includes optional performance-tracking automation using Google Sheets. To enable logging query runtimes, on the client node:
export GOOGLE_SHEETS_CREDENTIALS_PATH=<path to creds.json>
Then configure the --sheet
and --tab
arguments in benchmark_config.yaml.
The included benchmark_runner.py
script will run all queries sequentially. Configuration for this type of end-to-end run is specified in benchmark_runner/benchmark_config.yaml
.
To run all queries, cd to gpu_bdb/
and:
python benchmark_runner.py --config_file benchmark_runner/benchmark_config.yaml
By default, this will run each Dask query once, and, if BlazingSQL queries are enabled in benchmark_config.yaml
, each BlazingSQL query once. You can control the number of repeats by changing the N_REPEATS
variable in the script.
BlazingSQL implementations of all queries are included. BlazingSQL currently supports communication via TCP. To run BlazingSQL queries, please follow the instructions above to create a cluster using CLUSTER_MODE=TCP
.
The RAPIDS queries expect Apache Parquet formatted data. We provide a script which can be used to convert bigBench dataGen's raw CSV files to optimally sized Parquet partitions.