Fix prediction CSV files for multiple qual directories #1267
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes an open TODO item. Currently, if the path passed to
--qual_output
contains more than one qual tool output directory, the code will loop over the qual tool output directories, making predictions and saving out various CSV files (e.g.per_app.csv
,per_sql.csv
,shap_values.csv
) in thexgboost_predictions
output folder. Unfortunately, these files will be overwritten each with each iteration of the loop. Note, however, that the finaldataset_summaries
contains the full, concatenated results of all of the iterations, so only these CSV files were impacted.This PR combines the qual tool output directories into a single prediction "dataset", so the various debugging files now contain data for all qual tool output directories found in
--qual_output
. This has the side-benefit of speeding up prediction in these cases. If the user wants individual results per qual tool output directory, they can still invoke thespark_rapids prediction
command for each of those directories to produce one output directory per input directory.I have confirmed that the final prediction output matches the prior version code (aside from ordering), while the CSV files now contain the full, expected data.
Test
Following CMDs have been tested.
External Usage:
Internal Usage: