Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dynamic calculation of JVM resources in CLI cmd #944

Merged
merged 5 commits into from
Apr 18, 2024

Conversation

amahussein
Copy link
Collaborator

@amahussein amahussein commented Apr 16, 2024

Signed-off-by: Ahmed Hussein (amahussein) [email protected]

Fixes #943

This code change is to reduce the probability of OOME thrown by the core-tools when too many threads are created within the core module. The problem was that a thread processing the eventlog would need around 4-6 GB to succeed. This PR is aiming at dynamically calculating the number of threads that can fit to the virtual memory of the host. Note that this does not solve the problem. It is an improvement to dynamically pass JVM resources to the java cmd. Again, an OOME can be thrown if the batch of eventlogs is too large to exceed the expected 8 GB scenario.

What has changed:

  • Use G1GC as GC algorithm. this is to override the default JDK8 parallel GC. The G1GC which stands for Garbage-First GC could be a better option to target short living objects.
  • Pull the Virtual memory information of the host machine to calculate the default heap size. By default the heap size is set to 80% of the total virtual memory.
  • Next, calculate the number of threads to be passed to the RAPIDS java cmd. Assuming that a thread needs at least 8GB of heap memory. the number of threads is calculated at (heap_size / 8)
  • If the CLI is running in concurrent mode (i.e., estimation_model is enabled), then the CLI splits the resources between Profiling and Qualification by the ratio of 2:1 respectively.
  • Add jvm_heap_size to the spark_rapids CLI
  • Added a temporary flag to disable the concurrency-mode. The CLI will run Qualification and Profiling in sequence when estimation_model is set to XGBOOST
  • Add jvm_threads to the spark_rapids CLI
  • Put an upper bound to the number of threads assigned to the JVM

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Fixes NVIDIA#943

This code change is to reduce the probability of OOME thrown by the core-tools when too many threads are created within the core module.
The problem was that a thread processing the eventlog would need around
4-6 GB to succeed. This PR is aiming at dynamically calculating the number of threads that can fit to the virtual memory of the host.
Note that this does not solve the problem. It is an improvement to dynamically pass JVM resources to the java cmd.
Again, an OOME can be thrown if the batch of eventlogs is too large to
exceed the expected 8 GB scenario.

What has changed:

- Use G1GC as GC algorithm. this is to override the default JDK8 parallel GC. The G1GC which stands for Garbage-First GC could be a better option to target short living objects.
- Pull the Virtual memory information of the host machine to calculate the default heap size. By default the heap size is set to 80% of the total virtual memory.
- Next, calculate the number of threads to be passed to the RAPIDS java cmd. Assuming that a thread needs at least 8GB of heap memory. the number of threads is calculated at (`heap_size / 8`)
- If the CLI is running in concurrent mode (i.e., estimation_model is enabled), then the CLI splits the resources between Profiling and Qualification by the ratio of 2:1 respectively.
- Add `jvm_heap_size` to the `spark_rapids` CLI
@amahussein amahussein added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Apr 16, 2024
@amahussein amahussein self-assigned this Apr 16, 2024
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein for this change.

In this PR, by default we would run P and Q tool sequentially.

I was wondering if running them sequentially with explicit values of Xmx should be sufficient?

Concerns about splitting memory between P and Q tool:

  1. For smaller jobs, total time of running tools would be similar in both sequential and parallel case, so splitting should not have significant impact.
  2. For larger jobs, with splitting we can still crash due to low memory for Q tool.

parthosa
parthosa previously approved these changes Apr 17, 2024
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
Copy link
Collaborator Author

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new argument jvm_threads to allow setting the number of threads assigned to the RAPIDS as per @mattahrens request

Copy link
Collaborator

@nartal1 nartal1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein ! Just couple of questions on the latest commit.

# Maximum number of threads that can be used in the tools JVM.
# cpu_count returns the logical number of cores. So, we take a 50% to get better representation
# of physical cores.
return min(6, (psutil.cpu_count() + 1) // 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use psutil.cpu_count(logical=False) to get number of physical cores?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good question.
I actually tried that out on my local development and I got the same return value.
All the docs suggest that it should be different result.
I did not dive deeper to see why that's the case, but my intuition is that it could be a kernel (OS) or a library compatibility thing.
That's why I decided to lower down the value returned by dividing by 2. IIRC, the core tools code comments said that we use number_cores/4 as default num_threads

Copy link
Collaborator

@parthosa parthosa Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think in our Mac we have same number of physical and logical cores. In linux machines these should be different. Although it should not matter because we have min operator around it.

Copy link
Collaborator Author

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa and @nartal1

# Maximum number of threads that can be used in the tools JVM.
# cpu_count returns the logical number of cores. So, we take a 50% to get better representation
# of physical cores.
return min(6, (psutil.cpu_count() + 1) // 2)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good question.
I actually tried that out on my local development and I got the same return value.
All the docs suggest that it should be different result.
I did not dive deeper to see why that's the case, but my intuition is that it could be a kernel (OS) or a library compatibility thing.
That's why I decided to lower down the value returned by dividing by 2. IIRC, the core tools code comments said that we use number_cores/4 as default num_threads

Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein

# Maximum number of threads that can be used in the tools JVM.
# cpu_count returns the logical number of cores. So, we take a 50% to get better representation
# of physical cores.
return min(6, (psutil.cpu_count() + 1) // 2)
Copy link
Collaborator

@parthosa parthosa Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think in our Mac we have same number of physical and logical cores. In linux machines these should be different. Although it should not matter because we have min operator around it.

@amahussein amahussein merged commit c31172b into NVIDIA:dev Apr 18, 2024
15 checks passed
@amahussein amahussein deleted the spark-rapids-tools-943 branch April 18, 2024 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support dynamic calculation of JVM resources in CLI cmd
3 participants