-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Include bootstrap recommended configs in qualification output #451
[FEA] Include bootstrap recommended configs in qualification output #451
Conversation
Signed-off-by: Cindy Jiang <[email protected]>
Signed-off-by: Cindy Jiang <[email protected]>
We test the qualification tool with the following script:
|
@@ -192,6 +193,51 @@ class Qualification(RapidsJarTool): | |||
""" | |||
name = 'qualification' | |||
|
|||
def __calculate_spark_settings(self, worker_info: NodeHWInfo) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be a way to move this function into a utility library so that it's shared between qualification and bootstrap?
clusterConfigs: | ||
constants: | ||
# Maximum amount of pinned memory to use per executor in megabytes | ||
maxPinnedMemoryMB: 4096 | ||
# Default pageable pool size per executor in megabytes | ||
defaultPageablePoolMB: 1024 | ||
# Maximum number of concurrent tasks to run on the GPU | ||
maxGpuConcurrent: 4 | ||
# Amount of GPU memory to use per concurrent task in megabytes | ||
# Using a bit less than 8GB here since Dataproc clusters advertise | ||
# T4s as only having around 14.75 GB and we want to run with | ||
# 2 concurrent by default on T4s. | ||
gpuMemPerTaskMB: 7500 | ||
# Ideal amount of JVM heap memory to request per CPU core in megabytes | ||
heapPerCoreMB: 2048 | ||
# Fraction of the executor JVM heap size that should be additionally reserved | ||
# for JVM off-heap overhead (thread stacks, native libraries, etc.) | ||
heapOverheadFraction: 0.1 | ||
# Amount of CPU memory to reserve for system overhead (kernel, buffers, etc.) in megabytes | ||
systemReserveMB: 2048 | ||
# By default set the spark.sql.files.maxPartitionBytes to 512m | ||
maxSqlFilesPartitionsMB: 512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, would there be a way to move these constants into a common file so that they are shared between qualification and bootstrap?
Can you paste example output of the updates with bootstrap info for some of the qual commands? |
Signed-off-by: Cindy Jiang <[email protected]>
Signed-off-by: Cindy Jiang <[email protected]>
An example qualification output with bootstrap recommendations is as follows:
The |
Looks great! Only minor nit to change:
to
|
Signed-off-by: Cindy Jiang <[email protected]>
Thank you @mattahrens! Changes have been updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattahrens We did not validate boostrap for DB platforms.
Do you want the configurations to be part of the Qualification output even for DB-Azure/db-AWS?
res = { | ||
'spark.executor.cores': num_executor_cores, | ||
'spark.executor.memory': f'{executor_heap}m', | ||
'spark.executor.memoryOverhead': f'{executor_mem_overhead}m', | ||
'spark.rapids.sql.concurrentGpuTasks': gpu_concurrent_tasks, | ||
'spark.rapids.memory.pinnedPool.size': f'{pinned_mem}m', | ||
'spark.sql.files.maxPartitionBytes': f'{constants.get("maxSqlFilesPartitionsMB")}m', | ||
'spark.task.resource.gpu.amount': 1 / num_executor_cores, | ||
'spark.rapids.shuffle.multiThreaded.reader.threads': num_executor_cores, | ||
'spark.rapids.shuffle.multiThreaded.writer.threads': num_executor_cores, | ||
'spark.rapids.sql.multiThreadedRead.numThreads': max(20, num_executor_cores) | ||
} | ||
return res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recommendations generated by the bootstrap tool does not match all the platforms.
As an example, the changes in #440 which adjusted the default configurations to fit the Databricks platform.
Given that the bootstrap tool is not available for DB-azure/DB-AWS , this section should not be valid when running against DB.
@cindyyuanjiang I see you put example CLI for DB, but have you actually verified that the bootstrap-configurations are not part of the output on those platforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, we can not include bootstrap output when running qual tool on DB. File an issue to track adding bootstrap config support for DB qualification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amahussein thank you! I will remove the bootstrap recommendations for DB platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattahrens thank you! I filed a follow-up issue and tracked in this PR's description.
Signed-off-by: Cindy Jiang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bootstrap recommendations should be disabled ib DB runs as long as we define it as one of the wrapperReporting
.
This is how we generate the platform-specific reports as part of the tool.
I will follow up with a commit to get that fixed.
Stdout with the new section:
|
@cindyyuanjiang I could not push to the feature branch. So, I uploaded the changes as a diff-patch |
Couple comments:
|
Thanks @mattahrens . New STDOUT looks as the following:
|
Signed-off-by: Cindy Jiang <[email protected]>
We added back the bash script for creating the gpu cluster. Latest STDOUT looks like the following:
|
Fixes #410.
We add the recommended configs that the bootstrap tool would generate to the qualification output, both in the CLI and in the summary log file.
We will generate these recommended configs only when the qualification tool outputs a recommended GPU cluster.
For now, the bootstrap tool is not available for
Databricks-AWS
/Databricks-Azure
platforms, so we cannot include the bootstrap recommendation output for theDatabricks
platforms.Follow up issue is tracked here: #461