Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix prediction CSV files for multiple qual directories #1267

Merged
merged 3 commits into from
Aug 8, 2024

Conversation

leewyang
Copy link
Collaborator

@leewyang leewyang commented Aug 7, 2024

This PR fixes an open TODO item. Currently, if the path passed to --qual_output contains more than one qual tool output directory, the code will loop over the qual tool output directories, making predictions and saving out various CSV files (e.g. per_app.csv, per_sql.csv, shap_values.csv) in the xgboost_predictions output folder. Unfortunately, these files will be overwritten each with each iteration of the loop. Note, however, that the final dataset_summaries contains the full, concatenated results of all of the iterations, so only these CSV files were impacted.

This PR combines the qual tool output directories into a single prediction "dataset", so the various debugging files now contain data for all qual tool output directories found in --qual_output. This has the side-benefit of speeding up prediction in these cases. If the user wants individual results per qual tool output directory, they can still invoke the spark_rapids prediction command for each of those directories to produce one output directory per input directory.

I have confirmed that the final prediction output matches the prior version code (aside from ordering), while the CSV files now contain the full, expected data.

Test

Following CMDs have been tested.

External Usage:

spark_rapids prediction

Internal Usage:

python qualx_main.py predict

@leewyang leewyang self-assigned this Aug 7, 2024
@leewyang leewyang requested a review from parthosa August 7, 2024 18:50
@leewyang leewyang added the user_tools Scope the wrapper module running CSP, QualX, and reports (python) label Aug 7, 2024
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @leewyang. I tested the changes using the CMD:

spark_rapids prediction --qual_output test_dir

Directory Structure:

test_dir
├── qual_20240807213120_796364d1
├── qual_20240807213202_bdee1EF3
├── qual_20240807213609_FBbF7EBC

The tool now correctly generates the values for each app whereas previously it would overwrite and write the results for only the last one.

However, the appName seems to be inconsistent.

In features.csv and per_sql.csv, appName is test_dir
In per_app.csv and prediction.csv, appName is qual_2024xxx
In the previous version, all four files had appName as qual_2024xxx

Now, in actual, appName should be (from qualification_summary.csv)

  • NDS - query72 for qual_20240807213120_796364d1
  • Databricks Shell for qual_20240807213202_bdee1EF3
  • NDS - Power Run for qual_20240807213609_FBbF7EBC

We can fix this bug in a separate PR if needed.

@parthosa parthosa added the bug Something isn't working label Aug 7, 2024
@leewyang
Copy link
Collaborator Author

leewyang commented Aug 7, 2024

@parthosa Thanks for catching that. I was actually trying to match the behavior of the current code, but it should be simpler to just keep the original appName (e.g. NDS - query72), so I'll try to make that change.

Signed-off-by: Lee Yang <[email protected]>
@leewyang leewyang requested a review from parthosa August 8, 2024 00:04
Signed-off-by: Lee Yang <[email protected]>
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @leewyang for the fix. LGTME.

@leewyang leewyang merged commit 3cf66ce into NVIDIA:dev Aug 8, 2024
14 checks passed
@leewyang leewyang deleted the qualx_predict_csv branch August 8, 2024 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants