-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for AtomicCreateTableAsSelectExec #895
Conversation
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]> Fixes NVIDIA#881 - update the support level for existing Execs such as OverwriteByExpr, AppendDataExecV1, AtomicReplaceTableAsSelect - The Op is mainly support if the provider is Delta. Otherwise, the GPU should fallback to CPU. - We check if the provider is Delta by matching regex on node description. Ideally we should check the secondArgument of TableSpec but we don't have an AST for that expression. - Updates the score sheets assigning a default score as DataWritingCmd - Fixes hardcoded exec. Before that change, all writeCMDs mapped to DataWringCMDExec. This code changes that behavior when the Delta provider is satisfied. - updates the override_supported_configs.json file
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amahussein. Do we need unit tests for these as we have delta tables involved as well?
) | ||
|
||
// Checks whether a node is a write CMD Exec | ||
def isWritingCmdExec(nodeName: String): Boolean = { | ||
logicalWriteCommands.exists(nodeName.contains(_)) || DeltaLakeHelper.accepts(nodeName) | ||
} | ||
|
||
def getPhysicalExecName(opName: String): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this look risky? Earlier if the logical operator was not present in logicalToPhysicalCmdMap
it would throw an error. Now we would be returning defaultPhysicalCMD
value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is safer to have a default value. For example, this allows us to use the same method to get a physical representation of Delta Operators that are missing in the CSV files (i.e., MergeIntoCommandEdge
, WriteIntoDeltaCommand
, and SaveIntoDataSrcCMD
).
Otherwise, we will have fill in the content of the map with all the Operators in Hive/Delta, which can be crash-able if someone forget to update the map with the new entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. This is definitely reducing redundancy. We should be aware of this default behaviour.
We are lacking some of the tests in this area due to the complexity of env. Do you have bandwidth to take over to try this out by committing to this PR new tests? |
Sure. I was looking into delta tables few weeks ago. This would be a good starting point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amahussein. Based on offline discussion, we need to bump up delta package version to 3.x to support unit tests for spark 3.5.0. Created a separate issue #899 to add unit tests for delta tables.
Signed-off-by: Ahmed Hussein (amahussein) [email protected]
Fixes #881