You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#1721 highlighted that we have some room for improvement on how we handle metrics storage during long running benchmarks (i.e. days), especially when using the default in-memory metrics store. As long running benchmarks can end up causing excessive memory usage by esrally, and ultimately end up triggering the kernel OOMKiller on the load driver.
This is particularly an issue for benchmarks that have very long running task(s) where we want to calculate statistics on per-request latencies, as we currently store samples for every request performed (more details below) in memory for the duration of a task. Once a task is completed we serialise the samples and compress them with zlib, and only attempt to reload them into memory once a benchmark is complete.
If you consider that some of these benchmarks can execute hundreds of thousands of requests per-second and can run for days, you can quickly see how this becomes an issue.
We create one Sampler per Worker (i.e. per-core) that is shared by all clients on that Worker to store their samples during task execution, and we also limit the amount of samples that can be stored per-Sampler to 2^20 (2097152):
self.logger.warning("Dropping sample for [%s] due to a full sampling queue.", task.operation.name)
The size of each sample will depend on the operation's metadata etc, but if we assume that a sample is it at least 4KB (rough estimate from testing) then we can use ~8GB of memory per-worker, per-task before dropping samples on the floor.
We actually do some form of compression on in-memory metrics in to_externalizable(), but only either after all tasks have reached a join point, or on benchmark completion:
Consider implementing similar serialisation and compression techniques on samples as part of the routine flush() call made during a benchmark's execution for in-memory metrics stores?
The text was updated successfully, but these errors were encountered:
#1721 highlighted that we have some room for improvement on how we handle metrics storage during long running benchmarks (i.e. days), especially when using the default
in-memory
metrics store. As long running benchmarks can end up causing excessive memory usage byesrally
, and ultimately end up triggering the kernel OOMKiller on the load driver.This is particularly an issue for benchmarks that have very long running task(s) where we want to calculate statistics on per-request latencies, as we currently store samples for every request performed (more details below) in memory for the duration of a task. Once a task is completed we serialise the samples and compress them with zlib, and only attempt to reload them into memory once a benchmark is complete.
If you consider that some of these benchmarks can execute hundreds of thousands of requests per-second and can run for days, you can quickly see how this becomes an issue.
We create one
Sampler
perWorker
(i.e. per-core) that is shared by all clients on thatWorker
to store their samples during task execution, and we also limit the amount of samples that can be stored per-Sampler
to2^20 (2097152)
:rally/esrally/driver/driver.py
Lines 1327 to 1339 in 2470328
rally/esrally/driver/driver.py
Lines 1363 to 1413 in 2470328
The size of each sample will depend on the operation's metadata etc, but if we assume that a sample is it at least 4KB (rough estimate from testing) then we can use ~8GB of memory per-worker, per-task before dropping samples on the floor.
We actually do some form of compression on in-memory metrics in
to_externalizable()
, but only either after all tasks have reached a join point, or on benchmark completion:rally/esrally/driver/driver.py
Lines 819 to 845 in 2470328
rally/esrally/metrics.py
Lines 1153 to 1161 in 2470328
The
Driver
already wakes up everyPOST_PROCESS_INTERVAL_SECONDS
(30) to flush the collected samples:rally/esrally/driver/driver.py
Lines 295 to 305 in 2470328
rally/esrally/driver/driver.py
Lines 950 to 955 in 2470328
rally/esrally/driver/driver.py
Lines 1062 to 1068 in 2470328
Ideas:
2^20
flush()
call made during a benchmark's execution for in-memory metrics stores?The text was updated successfully, but these errors were encountered: