This repository contains source code and artifacts for the paper "Budget-Conscious Fine-Grained Configuration Optimization for Spatio-Temporal Applications" (VLDB 2022). It includes the different presented linear programming models to jointly optimize the sorting, compression, indexing, and tiering decisions for spatio-temporal applications.
In case you have any questions, please contact Keven Richly.
Shortened DBLP BibTeX entry:
@article{DBLP:journals/pvldb/RichlySB22,
author = {Keven Richly and Rainer Schlosser and Martin Boissier},
title = {Budget-Conscious Fine-Grained Configuration Optimization for Spatio-Temporal Applications},
journal = {Proc. {VLDB} Endow.},
volume = {15},
number = {13},
pages = {4079--4092},
year = {2022},
url = {https://www.vldb.org/pvldb/vol15/p4079-richly.pdf},
doi = {10.14778/3565838.3565858}
}
To get all dependencies of the Table Configuration Optimizer installed, run
pip install -r requirements.txt
To start the jupyter lab, run
jupyter lab
Additionally, it is required to have a solver, such as gurobi, installed. Also, other common free linear programming solvers (GLPK or LP_Solve) can be used. In this context, the solve times of the different solvers can vary strongly.
To efficiently utilize the scarce DRAM capacities, modern database systems support various tuning possibilities to reduce the memory footprint or increase performance. However, the selection of cost and performance-balancing configurations is challenging due to the vast number of possible setups consisting of mutually dependent individual decisions. We introduce a novel approach to jointly optimize the compression, sorting, indexing, and tiering configuration for spatio-temporal workloads. Further, we consider horizontal data partitioning, which enables the independent application of different tuning options on a fine-grained level. We propose three different linear programming (LP) models addressing cost dependencies at different levels of accuracy to compute optimized tuning configurations for a given workload and memory budgets:
- The Segment-Based Model with Sorting Dependencies (SMS) allows to solve the configuration problem with segment-based costs. It includes intra-chunk dependencies between segments with regard to the chunk-based ordering decision.
- The Relaxed Model with Independent Segment Effects (ISE) uses a relaxation regarding the ordering dependencies of the cost effects between segments to solve the SMS model heuristically.
- The General Model with Chunk-based Configuration Dependencies (CCD) is a solution approach accounting for full cost dependencies within a chunk. In this model, the costs associated with a segment can depend on all specific configuration decisions of all other segments (enabling the inclusion of multi-attribute indexes, etc.).
To yield maintainable and robust configurations, we extend our LP-based approach to incorporate reconfiguration costs as well as a worst-case optimization for potential workload scenarios. The extended LP models for the SMS and ISE models can be found here:
- Minimal-invasive state-dependent reconfigurations (SMS, ISE)
- Robust configuration selection for different potential workload scenarios (SMS, ISE)
To demonstrate the impact of different tuning decisions on the memory consumption and the runtime of a scan operation, as well as for the evaluation of our optimization approaches, we use the real-world dataset of a transportation network company (TNC) as an example. The spatio-temporal dataset consists of 400 million observed locations of drivers for three consecutive months in the City of Dubai (raw size 15.9 GB). Besides the timestamp, latitude, longitude, and the driver's identifier, a status attribute is tracked for each observed location. The status indicates the driver's current state (free or occupied). All attributes are stored as integers. Based on the insertion order, a certain temporal ordering of the sample points exists, but we cannot guarantee that the timestamp column is sorted due to transmission problems and delayed transmissions.
The data folder provides the necessary input data for the models determined for the dataset of a TNC. All end-to-end measurements have been executed on a server equipped with Intel Xeon Platinum 8180 CPUs (2.50GHz). For the benchmark queries and the evaluation of the determined configurations, we use the research database Hyrise. We define the input parameters based on the supported encoding and indexing properties of the database. It contains the measured runtimes of the benchmark queries, DRAM consumption of the different encodings, and the results of the calibrations queries (storage penalty, consecutive scan penalty). We explain how to determine these parameters for other datasets and workloads with Hyrise here.
For the evaluation presented in the paper, we used a workload consisting of six query templates to evaluate the models based on the data of a TNC.
ID | Query | Frequency | Skipped Chunks |
---|---|---|---|
q0 | SELECT * FROM TABLE WHERE ("driver_id" <= {selectivity of the value: 0.0001}) AND ("status" <= {0.7}) |
15% | 0, 1 |
q1 | SELECT * FROM TABLE WHERE ("timestamp" BETWEEN {0.2}) AND ("latitude" BETWEEN {0.5}) AND ("longitude" BETWEEN {0.5}) AND ("status" <= {0.7}) |
15% | 0, 1, 2, 6, 7, 8, 9 |
q2 | SELECT * FROM TABLE WHERE („driver_id" <= {0.01}) AND ("latitude" BETWEEN {0.1}) AND ("longitude" BETWEEN {0.1}) |
10% | |
q3 | SELECT * FROM TABLE WHERE ("timestamp" <= {0.05}) AND ("latitude" BETWEEN {0.7}) AND ("longitude" BETWEEN {0.7}) |
25% | 1, 2, 3, 4, 5, 6, 7, 8, 9 |
q4 | SELECT * FROM TABLE WHERE („driver_id" <= {0.01}) AND ("timestamp" <= {0.4}) |
15% | 4, 5, 6, 7, 8, 9 |
q5 | SELECT * FROM TABLE WHERE ("latitude" BETWEEN {0.01}) AND ("longitude" BETWEEN {0.01}) AND ("timestamp" BETWEEN {0.5}) |
20% | 0, 1, 8, 9 |
Further workload examples are provided in /data/workloads
.
To evaluate the performance of the determined table configurations for a given dataset and workload in Hyrise, we provide a benchmark in the krichly/chunk_sort_compression_advanced
branch. The benchmark is implemented in the playground of Hyrise that is stored at src/bin/playground.cpp
. To execute the benchmark, it is necessary to setup Hyrise. A step-by-step guide to get, setup, and run Hyrise is provided here. The playground is built with the following command:
make hyrisePlayground
As input for the benchmark, the user has to provide a set of table configurations determined by the models and exported as .csv files, the workload, and the dataset in the data
folder. The benchmark is executed with the command:
./hyrisePlayground
For each table configuration, the benchmark applies the defined tuning options (compression, indexing, sorting, and tiering configuration) and executes the workload on the table. Afterward, the performance (runtime of each query) and the memory consumption of each table configuration are exported as .csv file.
Furthermore, we implemented a microbenchmark in src/benchmark/ping_data_micro_benchmark.cpp
that determines for a given dataset the input parameters for the models (runtime and memory consumption). To build the microbenchmarks, the following command is used:
make hyriseMicroBenchmarks
Based on the dataset, the chunk size and the column names have to be adjusted. The benchmark can be executed by the command:
./hyriseMicroBenchmarks --benchmark_out=../../out/performance.csv --benchmark_out_format=csv --benchmark_repetitions=5 --benchmark_filter="BM_PingData" --benchmark_min_time=5 --benchmark_display_aggregates_only=true