Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve AQE support by capturing SQLPlan versions #1354

Merged
merged 6 commits into from
Sep 23, 2024

Conversation

amahussein
Copy link
Collaborator

@amahussein amahussein commented Sep 20, 2024

Contributes to #1172, Fixes #1351

This issue is to change the tools structure to support multiple version of SqlPlan.
Before that PR, if AQE is enabled, only the last plan is kept in the AppBase.SqlPlans map.

For sake of memory optimization, this PR adds the full implementation that supports capturing multiple versions but only caches the DSInformation extracted from old plans.
This allows the tools to generate the metadata of ReadV1 from original plans in case the metadata has been truncated in the finalPlan.

Change in output

  • New columns added to data_source_information.csv
    • sql_plan_version: an integer that represents the version number of the SqlPlaninfo where this row comes from.
    • from_final_plan : Boolean True/False to indicate whether this row comes from final plan or not.
  • Impact on API:
    • If users are interested in looking into all the information (ReadV1 and ReadV2) across all the versions
    • If users only want the datasource from finalPlan, then they should filter the rows using "from_final_plan=true"

Sample output file after the change:

data_source_information_after_change.csv

  • In this file the versions [1-5] are skipped because they have no V1 metadata
  • The last 4 rows represent the final plan
  • The first 4 rows show the detailed datasource from the original plan.

Headers of the new file:

appIndex,sqlID,sql_plan_version,nodeId,format,buffer_time,scan_time,data_size,decode_time,location,pushedFilters,schema,data_filters,partition_filters,from_final_plan

Code Changes:

  • Get rid of the AppBase.sqlPlans, replacing it with SqlManager class
  • Each Sql is represented by SqlPlanModel that keeps track of properties related to SQLplan and the versions.
  • The entity that contains the SparkPlanInfo is SqlPlanVersion.
  • For sake of memory optimization, a derived class SQLPlanModelWithDSCaching from SqlPlanModel will not keep track of all previous PlanInfo. Instead, it only caches the DataSourceRecord if any.

Future Work:

  • When we are confident on how to map the metadata between originalPlans and finalPlan, then we can eventually replace the truncated data with the new one.
  • Do some cleanup to get rid of all the maps inside AppBase and move the fields into the new class SQLPlanModel
  • Some cleanups related to extracting information of the readSchema and readParser.
  • File an issue on Spark open source to avoid truncating this information with AQE.

@amahussein amahussein added feature request New feature or request core_tools Scope the core module (scala) affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) labels Sep 20, 2024
@amahussein amahussein self-assigned this Sep 20, 2024
@amahussein
Copy link
Collaborator Author

@wjxiz1992 and @leewyang
Let me know if you have any feedback on the workaround to Spark truncating the schema from the AQE.

@leewyang
Copy link
Collaborator

Tested with eventlogs that previously failed to generate data_source_information.csv, and this PR seems to do better, so LGTM!

Signed-off-by: Ahmed Hussein <[email protected]>
Signed-off-by: Ahmed Hussein <[email protected]>
@amahussein
Copy link
Collaborator Author

Tested with eventlogs that previously failed to generate data_source_information.csv, and this PR seems to do better, so LGTM!

Thanks @leewyang !

Signed-off-by: Ahmed Hussein <[email protected]>
Copy link
Collaborator

@nartal1 nartal1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein for adding this feature. LGTM!

@amahussein amahussein merged commit 14a4213 into NVIDIA:dev Sep 23, 2024
14 checks passed
@amahussein amahussein deleted the rapids-tools-1351 branch September 23, 2024 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) core_tools Scope the core module (scala) feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Improve AQE support by capturing SQLPlan versions
3 participants