Support dynamic calculation of JVM resources in CLI cmd #944

amahussein · 2024-04-16T18:45:53Z

Signed-off-by: Ahmed Hussein (amahussein) [email protected]

Fixes #943

This code change is to reduce the probability of OOME thrown by the core-tools when too many threads are created within the core module. The problem was that a thread processing the eventlog would need around 4-6 GB to succeed. This PR is aiming at dynamically calculating the number of threads that can fit to the virtual memory of the host. Note that this does not solve the problem. It is an improvement to dynamically pass JVM resources to the java cmd. Again, an OOME can be thrown if the batch of eventlogs is too large to exceed the expected 8 GB scenario.

What has changed:

Use G1GC as GC algorithm. this is to override the default JDK8 parallel GC. The G1GC which stands for Garbage-First GC could be a better option to target short living objects.
Pull the Virtual memory information of the host machine to calculate the default heap size. By default the heap size is set to 80% of the total virtual memory.
Next, calculate the number of threads to be passed to the RAPIDS java cmd. Assuming that a thread needs at least 8GB of heap memory. the number of threads is calculated at (heap_size / 8)
If the CLI is running in concurrent mode (i.e., estimation_model is enabled), then the CLI splits the resources between Profiling and Qualification by the ratio of 2:1 respectively.
Add jvm_heap_size to the spark_rapids CLI
Added a temporary flag to disable the concurrency-mode. The CLI will run Qualification and Profiling in sequence when estimation_model is set to XGBOOST
Add jvm_threads to the spark_rapids CLI
Put an upper bound to the number of threads assigned to the JVM

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]> Fixes NVIDIA#943 This code change is to reduce the probability of OOME thrown by the core-tools when too many threads are created within the core module. The problem was that a thread processing the eventlog would need around 4-6 GB to succeed. This PR is aiming at dynamically calculating the number of threads that can fit to the virtual memory of the host. Note that this does not solve the problem. It is an improvement to dynamically pass JVM resources to the java cmd. Again, an OOME can be thrown if the batch of eventlogs is too large to exceed the expected 8 GB scenario. What has changed: - Use G1GC as GC algorithm. this is to override the default JDK8 parallel GC. The G1GC which stands for Garbage-First GC could be a better option to target short living objects. - Pull the Virtual memory information of the host machine to calculate the default heap size. By default the heap size is set to 80% of the total virtual memory. - Next, calculate the number of threads to be passed to the RAPIDS java cmd. Assuming that a thread needs at least 8GB of heap memory. the number of threads is calculated at (`heap_size / 8`) - If the CLI is running in concurrent mode (i.e., estimation_model is enabled), then the CLI splits the resources between Profiling and Qualification by the ratio of 2:1 respectively. - Add `jvm_heap_size` to the `spark_rapids` CLI

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py

user_tools/src/spark_rapids_tools/cmdli/argprocessor.py

user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

parthosa

Thanks @amahussein for this change.

In this PR, by default we would run P and Q tool sequentially.

I was wondering if running them sequentially with explicit values of Xmx should be sufficient?

Concerns about splitting memory between P and Q tool:

For smaller jobs, total time of running tools would be similar in both sequential and parallel case, so splitting should not have significant impact.
For larger jobs, with splitting we can still crash due to low memory for Q tool.

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

amahussein

Added a new argument jvm_threads to allow setting the number of threads assigned to the RAPIDS as per @mattahrens request

nartal1

Thanks @amahussein ! Just couple of questions on the latest commit.

user_tools/src/spark_rapids_tools/utils/util.py

parthosa · 2024-04-18T17:46:14Z

user_tools/src/spark_rapids_tools/utils/util.py

+        # Maximum number of threads that can be used in the tools JVM.
+        # cpu_count returns the logical number of cores. So, we take a 50% to get better representation
+        # of physical cores.
+        return min(6, (psutil.cpu_count() + 1) // 2)


Can we use psutil.cpu_count(logical=False) to get number of physical cores?

Very good question.
I actually tried that out on my local development and I got the same return value.
All the docs suggest that it should be different result.
I did not dive deeper to see why that's the case, but my intuition is that it could be a kernel (OS) or a library compatibility thing.
That's why I decided to lower down the value returned by dividing by 2. IIRC, the core tools code comments said that we use number_cores/4 as default num_threads

I see. I think in our Mac we have same number of physical and logical cores. In linux machines these should be different. Although it should not matter because we have min operator around it.

amahussein

Thanks @parthosa and @nartal1

amahussein · 2024-04-18T18:36:10Z

user_tools/src/spark_rapids_tools/utils/util.py

+        # Maximum number of threads that can be used in the tools JVM.
+        # cpu_count returns the logical number of cores. So, we take a 50% to get better representation
+        # of physical cores.
+        return min(6, (psutil.cpu_count() + 1) // 2)


Very good question.
I actually tried that out on my local development and I got the same return value.
All the docs suggest that it should be different result.
I did not dive deeper to see why that's the case, but my intuition is that it could be a kernel (OS) or a library compatibility thing.
That's why I decided to lower down the value returned by dividing by 2. IIRC, the core tools code comments said that we use number_cores/4 as default num_threads

user_tools/src/spark_rapids_tools/utils/util.py

parthosa

Thanks @amahussein

parthosa · 2024-04-18T19:04:09Z

user_tools/src/spark_rapids_tools/utils/util.py

+        # Maximum number of threads that can be used in the tools JVM.
+        # cpu_count returns the logical number of cores. So, we take a 50% to get better representation
+        # of physical cores.
+        return min(6, (psutil.cpu_count() + 1) // 2)


I see. I think in our Mac we have same number of physical and logical cores. In linux machines these should be different. Although it should not matter because we have min operator around it.

amahussein added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Apr 16, 2024

amahussein requested review from mattahrens, parthosa, cindyyuanjiang and nartal1 April 16, 2024 18:45

amahussein self-assigned this Apr 16, 2024

Disiable running RAPIDS tools in parallel

f0016f0

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

parthosa reviewed Apr 16, 2024

View reviewed changes

user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py Outdated Show resolved Hide resolved

user_tools/src/spark_rapids_tools/cmdli/argprocessor.py Show resolved Hide resolved

nartal1 reviewed Apr 16, 2024

View reviewed changes

user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py Show resolved Hide resolved

Address review comments

17e8d2d

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

parthosa reviewed Apr 16, 2024

View reviewed changes

Merge branch 'dev' into spark-rapids-tools-943

de6b285

parthosa previously approved these changes Apr 17, 2024

View reviewed changes

Add jvm_threads as argument to the CLI

eaeb76b

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

amahussein commented Apr 18, 2024

View reviewed changes

amahussein requested review from nartal1 and parthosa April 18, 2024 13:52

amahussein dismissed parthosa’s stale review via eaeb76b April 18, 2024 14:00

nartal1 approved these changes Apr 18, 2024

View reviewed changes

user_tools/src/spark_rapids_tools/utils/util.py Show resolved Hide resolved

user_tools/src/spark_rapids_tools/utils/util.py Show resolved Hide resolved

parthosa reviewed Apr 18, 2024

View reviewed changes

amahussein commented Apr 18, 2024

View reviewed changes

parthosa approved these changes Apr 18, 2024

View reviewed changes

amahussein merged commit c31172b into NVIDIA:dev Apr 18, 2024
15 checks passed

amahussein deleted the spark-rapids-tools-943 branch April 18, 2024 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dynamic calculation of JVM resources in CLI cmd #944

Support dynamic calculation of JVM resources in CLI cmd #944

amahussein commented Apr 16, 2024 •

edited

Loading

parthosa left a comment •

edited

Loading

amahussein left a comment

nartal1 left a comment

parthosa Apr 18, 2024

amahussein Apr 18, 2024

parthosa Apr 18, 2024 •

edited

Loading

amahussein left a comment

amahussein Apr 18, 2024

parthosa left a comment

parthosa Apr 18, 2024 •

edited

Loading

Support dynamic calculation of JVM resources in CLI cmd #944

Support dynamic calculation of JVM resources in CLI cmd #944

Conversation

amahussein commented Apr 16, 2024 • edited Loading

parthosa left a comment • edited Loading

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

nartal1 left a comment

Choose a reason for hiding this comment

parthosa Apr 18, 2024

Choose a reason for hiding this comment

amahussein Apr 18, 2024

Choose a reason for hiding this comment

parthosa Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

amahussein Apr 18, 2024

Choose a reason for hiding this comment

parthosa left a comment

Choose a reason for hiding this comment

parthosa Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

amahussein commented Apr 16, 2024 •

edited

Loading

parthosa left a comment •

edited

Loading

parthosa Apr 18, 2024 •

edited

Loading

parthosa Apr 18, 2024 •

edited

Loading