[TASK] Optimize the storage of accumulables in core tools #1263

amahussein · 2024-08-06T19:13:35Z

While investigating bottlenecks, it was found that most of the objects being allocated are representing metrics.

In original code the accumulables were stored in a huge map AppBase.taskStageAccumMap

  // accum id to task stage accum info
  var taskStageAccumMap: HashMap[Long, ArrayBuffer[TaskStageAccumCase]] =
    HashMap[Long, ArrayBuffer[TaskStageAccumCase]]()

There is another map hashMap that stored AccumulableIDs to set of StageIDs to build the Exec-to-Stage map.

The class TaskStageAccumCase definition is as follows:

case class TaskStageAccumCase(
    stageId: Int,
    attemptId: Int,
    taskId: Option[Long],
    accumulatorId: Long,
    name: Option[String],
    // The total accumulated so far for all tasks
    value: Option[Long],
    // The amount for this particular task/update
    update: Option[Long],
    isInternal: Boolean)

A new TaskStageAccumCase is created for each accumulable when the stage/stage is completed.

Lets say there is a stage with 100 tasks.
An accumulator ID will have an ArrayBuffer of size 101. all of those entries will repeat the same common fields and differ only in the TaskID/update values if any.

Changes

This PR revisits the way the accumulables are stored.

Accumulable names are stored in a global concurrent hashMap AccumNameRef. This implies that we create only one string and use it across all the threads to represent the same accumulable.
Create a new class AccumMetaRef that holds <accumId-AccNameRef>: this encapsulation tends to be very important to propagate the same optimization while dumping the data.
AccumMetaRef are stored in a per-app hashMap because this should not be shared across the different threads/applications. Once analysis is done, the map is collected
AccumProfileResult is changed to use AccumMetaRef to optimize the memory consumption. This reduces the number of allocations since accumMetaRef already exists in memory. Finally the CSVformat conversion is also part of the AccumNameRef because we should create only one value for each accumulator instead of reformating a new string for each row (X by number of stages)

Unit tests affected

"test printSQLPlanMetrics" : this unit-test was affected because in the legacy code we used to use a "0" if a task-update does not exist. As a result, it forced the minimum to be 0 for all the accumulables which is incorrect. The new code handles this correctlt because it only aggregates stats for records with valid mapping.
"test dsv1 complex": the estimate GPU speedup is different in the new code.
- The new code will create an entry between AccumID-to-StageID before a stage is completed. This should capture the cases where updates an accumulable but it is not completed
- As mentioned earliuer, the new code does not enforce a "0" to the accumulable in case a map between stage/task-AccumId does not exists.
- The new code captures the case when a task updates the same Accumulable multiple times.
- the maxDuration is correctly looking into the stage records. The legacy code was looking for the maximum value in the entire ArrayBuffer. This could eventually pick tasks if stage entries do not exist (incomplete stages)

Future work and Followups

See the list of tasks in #815

Performance Optimizations

@bilalbari please share some performance numbers in this PR description:

heap-dumps before and after
benchmarkSuite before/after with different heap size
Proof of memory savings by showing a case where an eventlog would fail on heap size and eventually succeeds on the new branch with smaller or equal heap size

Heap Usage Before Changes

Heap Usage Post Changes

Example of Failed Run that OOM
Setup -

Heap Size - 14G ( This continues to OOM till 18G heap size )
Event Log Size - 1.5G ( gz compressed )

OOM BASELINES

Before changes
- Minimum memory required to run - 19G
After changes
- Minimum memory required to run - 5G
%age memory improvement - 74%

BenchmarkSuite output without changes

JVM Name                   :   OpenJDK 64-Bit Server VM 
Java Version               :   1.8.0_422 
OS Name                    :   Linux 
OS Version                 :   6.5.0-41-generic 
MaxHeapMemory              :   21504 MB 
Total Warm Up Iterations   :   3 
Total Runtime Iterations   :   3 
Input Arguments            :    --output-directory /home/sbari/project-repos/scratch_folder/issue-367/output_folder /home/sbari/project-repos/scratch_folder/issue-367/eventlogs/temp-event-logs 
 
================================================================================================
Benchmark_Per_SQL_Arg_Qualification
================================================================================================

Benchmark :                               Best Time(ms)   Avg Time(ms)   Stdev(ms)      Avg GC Time(ms)       Avg GC Count     Stdev GC Count    Max GC Time(ms)       Max GC Count   Relative
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Enable_Per_SQL_Arg_Qualification                  87366          87441         100               6607.0               48.0                  4               6632                 51      1.00X
Disable_Per_SQL_Arg_Qualification                 86165          86448         257               6627.0               45.0                  0               6908                 46      1.01X

BenchmarkSuite output with changes

JVM Name                   :   OpenJDK 64-Bit Server VM 
Java Version               :   1.8.0_422 
OS Name                    :   Linux 
OS Version                 :   6.5.0-41-generic 
MaxHeapMemory              :   10240 MB 
Total Warm Up Iterations   :   3 
Total Runtime Iterations   :   3 
Input Arguments            :    --output-directory /home/sbari/project-repos/scratch_folder/issue-367/output_folder /home/sbari/project-repos/scratch_folder/issue-367/eventlogs/temp-event-logs 
 
================================================================================================
Benchmark_Per_SQL_Arg_Qualification
================================================================================================

Benchmark :                               Best Time(ms)   Avg Time(ms)   Stdev(ms)      Avg GC Time(ms)       Avg GC Count     Stdev GC Count    Max GC Time(ms)       Max GC Count   Relative
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Enable_Per_SQL_Arg_Qualification                 116229         116712         424               3416.0              142.0                  2               3455                144      1.00X
Disable_Per_SQL_Arg_Qualification                117012         118136        1466               3157.0              129.0                  1               3180                130      0.99X

Signed-off-by: Ahmed Hussein <[email protected]>

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Signed-off-by: Ahmed Hussein <[email protected]>

Signed-off-by: Sayed Bilal Bari <[email protected]>

Signed-off-by: Ahmed Hussein <[email protected]>

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/StatisticsMetrics.scala