-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support dynamic calculation of JVM resources in CLI cmd #944
Conversation
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]> Fixes NVIDIA#943 This code change is to reduce the probability of OOME thrown by the core-tools when too many threads are created within the core module. The problem was that a thread processing the eventlog would need around 4-6 GB to succeed. This PR is aiming at dynamically calculating the number of threads that can fit to the virtual memory of the host. Note that this does not solve the problem. It is an improvement to dynamically pass JVM resources to the java cmd. Again, an OOME can be thrown if the batch of eventlogs is too large to exceed the expected 8 GB scenario. What has changed: - Use G1GC as GC algorithm. this is to override the default JDK8 parallel GC. The G1GC which stands for Garbage-First GC could be a better option to target short living objects. - Pull the Virtual memory information of the host machine to calculate the default heap size. By default the heap size is set to 80% of the total virtual memory. - Next, calculate the number of threads to be passed to the RAPIDS java cmd. Assuming that a thread needs at least 8GB of heap memory. the number of threads is calculated at (`heap_size / 8`) - If the CLI is running in concurrent mode (i.e., estimation_model is enabled), then the CLI splits the resources between Profiling and Qualification by the ratio of 2:1 respectively. - Add `jvm_heap_size` to the `spark_rapids` CLI
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amahussein for this change.
In this PR, by default we would run P and Q tool sequentially.
I was wondering if running them sequentially with explicit values of Xmx
should be sufficient?
Concerns about splitting memory between P and Q tool:
- For smaller jobs, total time of running tools would be similar in both sequential and parallel case, so splitting should not have significant impact.
- For larger jobs, with splitting we can still crash due to low memory for Q tool.
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a new argument jvm_threads
to allow setting the number of threads assigned to the RAPIDS as per @mattahrens request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amahussein ! Just couple of questions on the latest commit.
# Maximum number of threads that can be used in the tools JVM. | ||
# cpu_count returns the logical number of cores. So, we take a 50% to get better representation | ||
# of physical cores. | ||
return min(6, (psutil.cpu_count() + 1) // 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use psutil.cpu_count(logical=False)
to get number of physical cores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good question.
I actually tried that out on my local development and I got the same return value.
All the docs suggest that it should be different result.
I did not dive deeper to see why that's the case, but my intuition is that it could be a kernel (OS) or a library compatibility thing.
That's why I decided to lower down the value returned by dividing by 2. IIRC, the core tools code comments said that we use number_cores/4 as default num_threads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think in our Mac we have same number of physical and logical cores. In linux machines these should be different. Although it should not matter because we have min operator around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Maximum number of threads that can be used in the tools JVM. | ||
# cpu_count returns the logical number of cores. So, we take a 50% to get better representation | ||
# of physical cores. | ||
return min(6, (psutil.cpu_count() + 1) // 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good question.
I actually tried that out on my local development and I got the same return value.
All the docs suggest that it should be different result.
I did not dive deeper to see why that's the case, but my intuition is that it could be a kernel (OS) or a library compatibility thing.
That's why I decided to lower down the value returned by dividing by 2. IIRC, the core tools code comments said that we use number_cores/4 as default num_threads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amahussein
# Maximum number of threads that can be used in the tools JVM. | ||
# cpu_count returns the logical number of cores. So, we take a 50% to get better representation | ||
# of physical cores. | ||
return min(6, (psutil.cpu_count() + 1) // 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think in our Mac we have same number of physical and logical cores. In linux machines these should be different. Although it should not matter because we have min operator around it.
Signed-off-by: Ahmed Hussein (amahussein) [email protected]
Fixes #943
This code change is to reduce the probability of OOME thrown by the core-tools when too many threads are created within the core module. The problem was that a thread processing the eventlog would need around 4-6 GB to succeed. This PR is aiming at dynamically calculating the number of threads that can fit to the virtual memory of the host. Note that this does not solve the problem. It is an improvement to dynamically pass JVM resources to the java cmd. Again, an OOME can be thrown if the batch of eventlogs is too large to exceed the expected 8 GB scenario.
What has changed:
heap_size / 8
)jvm_heap_size
to thespark_rapids
CLIestimation_model
is set to XGBOOSTjvm_threads
to thespark_rapids
CLI