Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Include bootstrap recommended configs in qualification output #451

Merged
merged 8 commits into from
Jul 26, 2023

Conversation

cindyyuanjiang
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang commented Jul 19, 2023

Fixes #410.

We add the recommended configs that the bootstrap tool would generate to the qualification output, both in the CLI and in the summary log file.

We will generate these recommended configs only when the qualification tool outputs a recommended GPU cluster.

For now, the bootstrap tool is not available for Databricks-AWS/Databricks-Azure platforms, so we cannot include the bootstrap recommendation output for the Databricks platforms.
Follow up issue is tracked here: #461

@cindyyuanjiang cindyyuanjiang self-assigned this Jul 19, 2023
@cindyyuanjiang cindyyuanjiang added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Jul 19, 2023
@cindyyuanjiang
Copy link
Collaborator Author

cindyyuanjiang commented Jul 20, 2023

We test the qualification tool with the following script:

################################################
# running OnPrem qualification tool
################################################

# assume there is a local work-dir $SCRATCH_DIR where all the
# resources are being saved

export SCRATCH_DIR=
export LOCAL_OUTPUT_DIR=$SCRATCH_DIR/output_folder
export CPU_CLUSTER_PROPS=$SCRATCH_DIR/cpu_cluster_props.yaml

spark_rapids_user_tools onprem qualification \
  --cpu_cluster $CPU_CLUSTER_PROPS \
  --eventlogs $REMOTE_EVENT_LOGS \
  --target_platform=dataproc


########################################
# running Dataproc qualification tool
########################################

# assume there is a local work-dir $SCRATCH_DIR where all the resources
# are being saved and a remote work-dir $REMOTE_SCRATCH_DIR on GS

export SCRATCH_DIR=
export LOCAL_OUTPUT_DIR=$SCRATCH_DIR/output_folder
export CPU_CLUSTER_NAME=
export CPU_CLUSTER_PROPS=$SCRATCH_DIR/cpu_cluster_props.json

export REMOTE_SCRATCH_DIR=
export REMOTE_EVENT_LOGS=$REMOTE_SCRATCH_DIR/eventlogs
export REMOTE_OUTPUT_DIR=$REMOTE_SCRATCH_DIR/output_folder

# run command with local cpu cluster-properties 
spark_rapids_user_tools dataproc qualification \
  --cpu_cluster $CPU_CLUSTER_PROPS \
  --local_folder $LOCAL_OUTPUT_DIR \
  --eventlogs $REMOTE_EVENT_LOGS \
  --remote_folder $REMOTE_OUTPUT_DIR \
  --verbose

# run command with cpu cluster name
spark_rapids_user_tools dataproc qualification \
  --cpu_cluster $CPU_CLUSTER_NAME \
  --local_folder $LOCAL_OUTPUT_DIR \
  --eventlogs $REMOTE_EVENT_LOGS \
  --remote_folder $REMOTE_OUTPUT_DIR \
  --verbose


########################################
# running EMR qualification tool
########################################

# assume there is a local work-dir $SCRATCH_DIR where all the resources
# are being saved and a remote work-dir $REMOTE_SCRATCH_DIR on S3

export SCRATCH_DIR=
export LOCAL_OUTPUT_DIR=$SCRATCH_DIR/output_folder
export CPU_CLUSTER_NAME=
export CPU_CLUSTER_PROPS=$SCRATCH_DIR/cpu_cluster_props.json

export REMOTE_SCRATCH_DIR=
export REMOTE_EVENT_LOGS=$REMOTE_SCRATCH_DIR/eventlogs
export REMOTE_OUTPUT_DIR=$REMOTE_SCRATCH_DIR/output_folder

# run command with local cpu cluster-properties 
spark_rapids_user_tools emr qualification \
  --cpu_cluster $CPU_CLUSTER_PROPS \
  --local_folder $LOCAL_OUTPUT_DIR \
  --eventlogs $REMOTE_EVENT_LOGS \
  --remote_folder $REMOTE_OUTPUT_DIR \
  --verbose

# run command with cpu cluster name
spark_rapids_user_tools emr qualification \
  --cpu_cluster $CPU_CLUSTER_NAME \
  --local_folder $LOCAL_OUTPUT_DIR \
  --eventlogs $REMOTE_EVENT_LOGS \
  --remote_folder $REMOTE_OUTPUT_DIR \
  --verbose

########################################
# running DB-AWS/DB-Azure qualification tool
########################################

Not supported for now

@cindyyuanjiang cindyyuanjiang marked this pull request as ready for review July 21, 2023 01:14
@@ -192,6 +193,51 @@ class Qualification(RapidsJarTool):
"""
name = 'qualification'

def __calculate_spark_settings(self, worker_info: NodeHWInfo) -> dict:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be a way to move this function into a utility library so that it's shared between qualification and bootstrap?

Comment on lines 159 to 180
clusterConfigs:
constants:
# Maximum amount of pinned memory to use per executor in megabytes
maxPinnedMemoryMB: 4096
# Default pageable pool size per executor in megabytes
defaultPageablePoolMB: 1024
# Maximum number of concurrent tasks to run on the GPU
maxGpuConcurrent: 4
# Amount of GPU memory to use per concurrent task in megabytes
# Using a bit less than 8GB here since Dataproc clusters advertise
# T4s as only having around 14.75 GB and we want to run with
# 2 concurrent by default on T4s.
gpuMemPerTaskMB: 7500
# Ideal amount of JVM heap memory to request per CPU core in megabytes
heapPerCoreMB: 2048
# Fraction of the executor JVM heap size that should be additionally reserved
# for JVM off-heap overhead (thread stacks, native libraries, etc.)
heapOverheadFraction: 0.1
# Amount of CPU memory to reserve for system overhead (kernel, buffers, etc.) in megabytes
systemReserveMB: 2048
# By default set the spark.sql.files.maxPartitionBytes to 512m
maxSqlFilesPartitionsMB: 512
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, would there be a way to move these constants into a common file so that they are shared between qualification and bootstrap?

@mattahrens
Copy link
Collaborator

Can you paste example output of the updates with bootstrap info for some of the qual commands?

@cindyyuanjiang
Copy link
Collaborator Author

cindyyuanjiang commented Jul 24, 2023

Can you paste example output of the updates with bootstrap info for some of the qual commands?

An example qualification output with bootstrap recommendations is as follows:

____________________________________________________________________________________________________
                                        QUALIFICATION Report                                        
____________________________________________________________________________________________________

Output:
--------------------
Qualification tool output: MY_OUTPUT_FOLDER/qual_20230724230150_aD24bdc7/rapids_4_spark_qualification_output
    qual_20230724230150_aD24bdc7
    ├── rapids_4_spark_qualification_output
    │   ├── rapids_4_spark_qualification_output_stages.csv
    │   ├── rapids_4_spark_qualification_output_unsupportedOperators.csv
    │   ├── rapids_4_spark_qualification_output_execs.csv
    │   ├── rapids_4_spark_qualification_output.log
    │   ├── ui
    │   │   └── html
    │   │       ├── raw.html
    │   │       ├── index.html
    │   │       ├── application.html
    │   │       └── sql-recommendation.html
    │   └── rapids_4_spark_qualification_output.csv
    └── qualification_summary.csv
    3 directories, 10 files
    - To learn more about the output details, visit https://nvidia.github.io/spark-rapids/docs/spark-qualification-tool.html#understanding-the-qualification-tool-output
    - Full savings and speedups CSV report: MY_OUTPUT_FOLDER/qual_20230724230150_aD24bdc7/qualification_summary.csv
+----+--------------------------------+-----------------------------+------------------+------------------+---------------+-----------------+-----------------+-----------------+
|    | App ID                         | App Name                    | Speedup Based    | Savings Based    |           App |   Estimated GPU |   Estimated GPU |   Estimated GPU |
|    |                                |                             | Recommendation   | Recommendation   |   Duration(s) |     Duration(s) |         Speedup |      Savings(%) |
|----+--------------------------------+-----------------------------+------------------+------------------+---------------+-----------------+-----------------+-----------------|
|  0 | application_xxxxxxxxxxxxx_xxxx | xxxxxxxxxxx                 | Recommended      | Recommended      |         58.85 |           31.43 |            1.87 |           24.71 |
|  4 | application_xxxxxxxxxxxxx_xxxx | xxxxxxxxxxxxxxxxxxxxxxxxxxx | Recommended      | Recommended      |         62.20 |           35.54 |            1.75 |           19.45 |
+----+--------------------------------+-----------------------------+------------------+------------------+---------------+-----------------+-----------------+-----------------+

Report Summary:
------------------------------  ------
Total applications                   2
RAPIDS candidates                    2
Overall estimated speedup         1.81
Overall estimated cost savings  22.01%
------------------------------  ------

Notes:
--------------------
 - Apps with the same name are grouped together and their metrics are averaged

Instance types conversions:
---------------  --  --------------------
Standard_DS3_v2  to  Standard_NC4as_T4_v3
---------------  --  --------------------
To support acceleration with T4 GPUs, switch the worker node instance types

Recommended Spark configurations:
-------------------------------------------------  -----
spark.executor.cores                               4
spark.executor.memory                              8192m
spark.executor.memoryOverhead                      5939m
spark.rapids.sql.concurrentGpuTasks                2
spark.rapids.memory.pinnedPool.size                4096m
spark.sql.files.maxPartitionBytes                  512m
spark.task.resource.gpu.amount                     0.25
spark.rapids.shuffle.multiThreaded.reader.threads  4
spark.rapids.shuffle.multiThreaded.writer.threads  4
spark.rapids.sql.multiThreadedRead.numThreads      20
-------------------------------------------------  -----

The Recommended Spark configurations results will also be included in file rapids_4_spark_qualification_output.log.

@mattahrens
Copy link
Collaborator

Looks great! Only minor nit to change:

Recommended Spark configurations:

to

Recommended Spark configurations for running on GPUs:

@cindyyuanjiang
Copy link
Collaborator Author

Thank you @mattahrens! Changes have been updated.

@amahussein amahussein self-requested a review July 25, 2023 15:14
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattahrens We did not validate boostrap for DB platforms.
Do you want the configurations to be part of the Qualification output even for DB-Azure/db-AWS?

Comment on lines +353 to +365
res = {
'spark.executor.cores': num_executor_cores,
'spark.executor.memory': f'{executor_heap}m',
'spark.executor.memoryOverhead': f'{executor_mem_overhead}m',
'spark.rapids.sql.concurrentGpuTasks': gpu_concurrent_tasks,
'spark.rapids.memory.pinnedPool.size': f'{pinned_mem}m',
'spark.sql.files.maxPartitionBytes': f'{constants.get("maxSqlFilesPartitionsMB")}m',
'spark.task.resource.gpu.amount': 1 / num_executor_cores,
'spark.rapids.shuffle.multiThreaded.reader.threads': num_executor_cores,
'spark.rapids.shuffle.multiThreaded.writer.threads': num_executor_cores,
'spark.rapids.sql.multiThreadedRead.numThreads': max(20, num_executor_cores)
}
return res
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recommendations generated by the bootstrap tool does not match all the platforms.
As an example, the changes in #440 which adjusted the default configurations to fit the Databricks platform.
Given that the bootstrap tool is not available for DB-azure/DB-AWS , this section should not be valid when running against DB.
@cindyyuanjiang I see you put example CLI for DB, but have you actually verified that the bootstrap-configurations are not part of the output on those platforms?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, we can not include bootstrap output when running qual tool on DB. File an issue to track adding bootstrap config support for DB qualification.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amahussein thank you! I will remove the bootstrap recommendations for DB platforms.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattahrens thank you! I filed a follow-up issue and tracked in this PR's description.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bootstrap recommendations should be disabled ib DB runs as long as we define it as one of the wrapperReporting .
This is how we generate the platform-specific reports as part of the tool.

I will follow up with a commit to get that fixed.

@amahussein
Copy link
Collaborator

Stdout with the new section:

  • the stdout still shows how to run bootstrap in case the user wants to regenerate the ecommendations on the new GPU-cluster
  • @mattahrens does this output look fine for you?

____________________________________________________________________________________________________
                                        QUALIFICATION Report                                        
____________________________________________________________________________________________________

Output:
--------------------
Qualification tool output: /output_folder/qual_20230726164047_cF92b80f/rapids_4_spark_qualification_output
    qual_20230726164047_cF92b80f
    ├── qualification_summary.csv
    └── rapids_4_spark_qualification_output
        ├── ui
        │   └── html
        │       ├── sql-recommendation.html
        │       ├── index.html
        │       ├── application.html
        │       └── raw.html
        ├── rapids_4_spark_qualification_output_stages.csv
        ├── rapids_4_spark_qualification_output_unsupportedOperators.csv
        ├── rapids_4_spark_qualification_output.csv
        ├── rapids_4_spark_qualification_output_execs.csv
        └── rapids_4_spark_qualification_output.log
    3 directories, 10 files
    - To learn more about the output details, visit https://nvidia.github.io/spark-rapids/docs/spark-qualification-tool.html#understanding-the-qualification-tool-output
    - Full savings and speedups CSV report: /output_folder/qual_20230726164047_cF92b80f/qualification_summary.csv
+----+-------------------------+---------------------+----------------------+------------------+---------------+-----------------+-----------------+-----------------+
|    | App ID                  | App Name            | Speedup Based        | Savings Based    |           App |   Estimated GPU |   Estimated GPU |   Estimated GPU |
|    |                         |                     | Recommendation       | Recommendation   |   Duration(s) |     Duration(s) |         Speedup |      Savings(%) |
|----+-------------------------+---------------------+----------------------+------------------+---------------+-----------------+-----------------+-----------------|
|  0 | app-20200423035604-0002 | spark_data_utils.py | Strongly Recommended | Recommended      |       1201.72 |          317.25 |            3.79 |           10.96 |
+----+-------------------------+---------------------+----------------------+------------------+---------------+-----------------+-----------------+-----------------+

Report Summary:
------------------------------  ------
Total applications                   2
RAPIDS candidates                    2
Overall estimated speedup         3.23
Overall estimated cost savings  -4.33%
------------------------------  ------

Notes:
--------------------
 - Apps with the same name are grouped together and their metrics are averaged


Initialization Scripts:
-----------------------
To launch a GPU-accelerated cluster with RAPIDS Accelerator for Apache Spark, add the
  following to your cluster creation script:
    --initialization-actions=gs://goog-dataproc-initialization-actions-us-central1/spark-rapids/spark-rapids.sh \
    --worker-accelerator type=nvidia-tesla-t4,count=1

To create a GPU cluster, run the following script:

``bash
#!/bin/bash

export CLUSTER_NAME="dataproc-cluster"

gcloud dataproc clusters create $CLUSTER_NAME \
    --image-version=2.1.5-ubuntu20 \
    --region us-central1 \
    --zone us-central1-a-a \
    --master-machine-type n1-standard-2 \
    --num-workers 2 \
    --worker-machine-type n1-standard-2 \
    --num-worker-local-ssds 2 \
    --enable-component-gateway \
    --subnet=default \
    --initialization-actions=gs://goog-dataproc-initialization-actions-us-central1/spark-rapids/spark-rapids.sh \
    --worker-accelerator type=nvidia-tesla-t4,count=1 \
    --properties 'spark:spark.driver.memory=50g'

``

Recommended Spark configurations for running on GPUs:
-----------------------------------------------------

For the new GPU-accelerated cluster with RAPIDS Accelerator for Apache Spark,
  it is recommended to set the following Spark configurations:

-------------------------------------------------  -----
spark.executor.cores                               2
spark.executor.memory                              4096m
spark.executor.memoryOverhead                      1536m
spark.rapids.sql.concurrentGpuTasks                2
spark.rapids.memory.pinnedPool.size                103m
spark.sql.files.maxPartitionBytes                  512m
spark.task.resource.gpu.amount                     0.5
spark.rapids.shuffle.multiThreaded.reader.threads  2
spark.rapids.shuffle.multiThreaded.writer.threads  2
spark.rapids.sql.multiThreadedRead.numThreads      20
-------------------------------------------------  -----

Regenerating recommended configurations for existing clusters:
--------------------------------------------------------------

To generate the recommended configurations on an existing GPU Cluster,
  re-run the Bootstrap tool to provide optimized RAPIDS Accelerator
  for Apache Spark configs based on GPU cluster shape.
  Notes:
    - Overriding the Apache Spark default configurations on the cluster
      requires SSH access.
    - If SSH access is unavailable, you can still dump the recommended
      configurations by enabling the `dry_run` flag.

``bash
# To see all options, run `spark_rapids_user_tools dataproc bootstrap -- --help`

# The following cmd overrides the default Apache Spark configurations
# on the cluster (requires SSH)
spark_rapids_user_tools dataproc bootstrap \
    --cluster $CLUSTER_NAME \
    --verbose \
    --nodry_run

# The following cmd dumps the recommended configurations to the output
# without overriding the existing cluster configurations
spark_rapids_user_tools dataproc bootstrap \
    --cluster $CLUSTER_NAME \
    --verbose

``

@amahussein
Copy link
Collaborator

pr-451-ammend.patch

@cindyyuanjiang I could not push to the feature branch. So, I uploaded the changes as a diff-patch

@mattahrens
Copy link
Collaborator

Couple comments:

  1. I don't think we should have the bash with escape quotes in there. It should be obvious what commands to run to create the cluster.
  2. I don't think we want to give the bootstrap command since we've already provided the settings.

@amahussein
Copy link
Collaborator

Couple comments:

1. I don't think we should have the bash with escape quotes in there.  It should be obvious what commands to run to create the cluster.

2. I don't think we want to give the bootstrap command since we've already provided the settings.

Thanks @mattahrens .
I disabled the bash scripts snippets and removed the bootstrap cmd.

New STDOUT looks as the following:

Report Summary:
------------------------------  ------
Total applications                   2
RAPIDS candidates                    2
Overall estimated speedup         3.23
Overall estimated cost savings  -4.33%
------------------------------  ------

Notes:
--------------------
 - Apps with the same name are grouped together and their metrics are averaged


Initialization Scripts:
-----------------------
To launch a GPU-accelerated cluster with RAPIDS Accelerator for Apache Spark, add the
  following to your cluster creation script:
    --initialization-actions=gs://goog-dataproc-initialization-actions-us-central1/spark-rapids/spark-rapids.sh \
    --worker-accelerator type=nvidia-tesla-t4,count=1

Recommended Spark configurations for running on GPUs:
-----------------------------------------------------

For the new GPU-accelerated cluster with RAPIDS Accelerator for Apache Spark,
  it is recommended to set the following Spark configurations:

-------------------------------------------------  -----
spark.executor.cores                               2
spark.executor.memory                              4096m
spark.executor.memoryOverhead                      1536m
spark.rapids.sql.concurrentGpuTasks                2
spark.rapids.memory.pinnedPool.size                103m
spark.sql.files.maxPartitionBytes                  512m
spark.task.resource.gpu.amount                     0.5
spark.rapids.shuffle.multiThreaded.reader.threads  2
spark.rapids.shuffle.multiThreaded.writer.threads  2
spark.rapids.sql.multiThreadedRead.numThreads      20
-------------------------------------------------  -----

@cindyyuanjiang
Copy link
Collaborator Author

cindyyuanjiang commented Jul 26, 2023

We added back the bash script for creating the gpu cluster. Latest STDOUT looks like the following:

Report Summary:
------------------------------  ------
Total applications                   1
RAPIDS candidates                    1
Overall estimated speedup         1.43
Overall estimated cost savings  49.10%
------------------------------  ------

Instance types conversions:
------------  --  ------------
m5zn.6xlarge  to  g4dn.4xlarge
------------  --  ------------
To support acceleration with T4 GPUs, switch the worker node instance types

Initialization Scripts:
-----------------------

To create a GPU cluster, run the following script:

``bash
#!/bin/bash

export CLUSTER_NAME="user-tools-qualification-emr-20"

aws emr create-cluster \
    --name "$CLUSTER_NAME"  \
    --release-label emr-6.9.0 \
    --log-uri s3://$LOG_BUCKET/logs \
    --applications Name=Hadoop Name=Spark Name=Livy Name=JupyterEnterpriseGateway  \
    --bootstrap-actions '[{"Path":"s3://BUCKET_NAME/aws-emr-bootstrap.sh","Name":"My Spark Rapids Bootstrap action"}]'  \
    --ec2-attributes '{"KeyName":"MY_KEY_NAME","InstanceProfile":"EMR_EC2_DefaultRole","AvailabilityZone":"us-west-2b"}'  \
    --service-role EMR_DefaultRole \
    --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m5a.12xlarge \
                      InstanceGroupType=CORE,InstanceCount=4,InstanceType=g4dn.4xlarge \
    --configurations file://aws-emr-configuration.json \
    --ebs-root-volume-size 100 \

``

Recommended Spark configurations for running on GPUs:
-----------------------------------------------------

For the new GPU-accelerated cluster with RAPIDS Accelerator for Apache Spark,
  it is recommended to set the following Spark configurations:

-------------------------------------------------  ------
spark.executor.cores                               16
spark.executor.memory                              32768m
spark.executor.memoryOverhead                      8396m
spark.rapids.sql.concurrentGpuTasks                2
spark.rapids.memory.pinnedPool.size                4096m
spark.sql.files.maxPartitionBytes                  512m
spark.task.resource.gpu.amount                     0.0625
spark.rapids.shuffle.multiThreaded.reader.threads  16
spark.rapids.shuffle.multiThreaded.writer.threads  16
spark.rapids.sql.multiThreadedRead.numThreads      20
-------------------------------------------------  ------

@cindyyuanjiang cindyyuanjiang merged commit 73ad590 into NVIDIA:dev Jul 26, 2023
@cindyyuanjiang cindyyuanjiang deleted the add-bootstrap-configs branch July 26, 2023 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Include bootstrap recommended configs in qualification output
3 participants