Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode the file path from Iceberg when converting to a PartitionedFile [databricks] #9717

Merged
merged 2 commits into from
Nov 15, 2023

Conversation

firestarman
Copy link
Collaborator

close #9697

Iceberg always gives the raw data path but rapids multi-file readers expect the file path is encoded and url safe.
So this PR adds support to encode the file path string for Iceberg multi-files reads.

Because Iceberg always gives the raw data path but rapids multi-file readers
expect the file path is encoded and url safe.

Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Collaborator Author

build

Arrays.stream(pFiles).forEach(pFile -> {
String partFilePathString = pFile.filePath().toString();
FileScanTask fst = files.get(partFilePathString);
FilteredParquetFileInfo filteredInfo = filterParquetBlocks(fst, partFilePathString);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's add some comments why we need to pass in partFilePathString other than using fst.file().path().toString() to get it.

#9697 (comment)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use toEncodedPathString(fst) instead of this partFilePathString here, but I don't want to do this encoding again since we already have the encoded path string in the PartitionedFile.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

winningsix
winningsix previously approved these changes Nov 15, 2023
Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Collaborator Author

build

@jlowe jlowe merged commit 90789ba into NVIDIA:branch-23.12 Nov 15, 2023
37 checks passed
@firestarman firestarman deleted the fix-iceberg-multi-read branch November 16, 2023 00:25
@sameerz sameerz added the bug Something isn't working label Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Iceberg multiple file readers can not read files if the file paths contain encoded URL unsafe chars
4 participants