This folder contains a set of examples for the spark-s3-shuffle plugin.
- A Apache Spark installation
- A Kubernetes Cluster
- A S3 Bucket / or a local S3 installation (see
config.sh
for the configuration). - Configure
s3cmd
with the S3 credentials to enable cleanup.
Modify config_X.X.X.sh
based on your setup.
Run the following commands to build the docker containers:
source config_X.X.X.sh
./terasort/build.sh
./sql/build.sh
Use the following environment variables to configure the Spark:
USE_S3_SHUFFLE=1
Enable Shuffle on S3 (default: on)USE_S3_SHUFFLE=0
Disable Shuffle on S3 (default: on)
export SIZE=1g # Options: 1g (default), 10g, 100g
export USE_S3_SHUFFLE=1 # Options: 0, 1 (default)
./terasort/run.sh
Run a single query (full list - omit .sql
).
export SIZE=1000 # Options: 10, 100, 1000 (default)
export USE_S3_SHUFFLE=1 # Options: 0, 1 (default)
./sql/run_single_query.sh q67
Run tpcds:
export SIZE=100 # Options: 10, 100 (default), 1000
export USE_S3_SHUFFLE=1 # Options: 0, 1 (default) (enable shuffle on S3)
./sql/run_tpcds.sh
Deploy influxdb and adapt configuration in profiler_config.yml
. A sample configuration can be
found in influxdb_kubernetes.yml
.
To enable profiling set the environment variable USE_PROFILER
to 1
:
export USE_PROFILER=1
and run the examples above.
See this InfoQ article how influxdb and the JVM-Profiler interact with Grafana.
Assumption: Configure a sampleinterval
interval variable in the Grafana page.
Show Max(CPU) usage by process:
SELECT max("processCpuLoad") FROM "CpuAndMemory" WHERE "role" = 'executor' AND $timeFilter GROUP BY time($sampleinterval), processUuid fill(none)```
Executor heap memory consumption (sum):
SELECT SUM("heapMemoryTotalUsed") as Used, SUM("heapMemoryCommitted") as Committed from "autogen"."CpuAndMemory" where "role" = 'executor' AND $timeFilter GROUP BY time($sampleinterval)
Per-executor heap memory consumption (max):
select MAX("heapMemoryTotalUsed") as Used, Max("heapMemoryCommitted") as Committed from "autogen"."CpuAndMemory" where "role" = 'executor' AND $timeFilter GROUP BY time($sampleinterval), processUuid fill(none)```
Measure the S3ShuffleReader threads:
SELECT time, COUNT(host) AS S3ShuffleReader FROM Stacktrace WHERE "role" = 'executor' AND "stacktrace" =~ /S3ShuffleReader/ AND $timeFilter GROUP BY time($sampleinterval) fill(none)
Measure the S3ShuffleWriter threads:
SELECT time, COUNT(processUuid) AS S3ShuffleWriter FROM Stacktrace WHERE "role" = 'executor' AND "stacktrace" =~ /S3ShuffleWriter/ AND $timeFilter GROUP BY time($sampleinterval) fill(none)
Measure the ExternalSorter threads:
SELECT time, COUNT(processUuid) AS ExternalSorter FROM Stacktrace WHERE "role" = 'executor' AND "stacktrace" =~ /ShuffleExternalSorter/ AND $timeFilter GROUP BY time($sampleinterval)