Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS S3 IO through KvikIO #16499

Open
wants to merge 14 commits into
base: branch-24.12
Choose a base branch
from
Open

Conversation

madsbk
Copy link
Member

@madsbk madsbk commented Aug 6, 2024

Implement remote IO read using KvikIO's S3 backend.

For now, this is an experimental feature for parquet read only. Enable by defining CUDF_KVIKIO_REMOTE_IO=ON.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@madsbk madsbk added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 6, 2024
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Aug 6, 2024
@madsbk madsbk force-pushed the kvikio-remote-io branch 2 times, most recently from dd1b46d to 55cc220 Compare August 12, 2024 12:52
@madsbk madsbk mentioned this pull request Aug 13, 2024
3 tasks
@github-actions github-actions bot removed the pylibcudf Issues specific to the pylibcudf package label Aug 22, 2024
@madsbk madsbk changed the base branch from branch-24.10 to branch-24.12 October 2, 2024 08:00
@github-actions github-actions bot added pylibcudf Issues specific to the pylibcudf package CMake CMake build issue labels Oct 2, 2024
@madsbk madsbk force-pushed the kvikio-remote-io branch 2 times, most recently from ad2e46d to ea983e1 Compare October 3, 2024 11:34
@github-actions github-actions bot removed the CMake CMake build issue label Oct 3, 2024
@madsbk madsbk force-pushed the kvikio-remote-io branch 2 times, most recently from 4f94346 to 22c0222 Compare October 3, 2024 17:32
@github-actions github-actions bot added the CMake CMake build issue label Oct 3, 2024
@madsbk madsbk force-pushed the kvikio-remote-io branch 3 times, most recently from d8e0c8d to a1a7ee1 Compare October 10, 2024 09:48
@madsbk madsbk force-pushed the kvikio-remote-io branch 3 times, most recently from 50f30c2 to f6c710a Compare October 22, 2024 10:29
@madsbk madsbk marked this pull request as ready for review October 24, 2024 09:26
@madsbk madsbk requested review from a team as code owners October 24, 2024 09:26
*/
static bool is_supported_remote_url(std::string const& url)
{
// Regular expression to match "<s3|http|https>://"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering about long-term: If we're given an http or https URI, how would we tell whether that's on amazon s3 vs some other cloud service?

Copy link
Contributor

@pmattione-nvidia pmattione-nvidia Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. if users rely on https automatically working for s3, then we need to change how https works in the future, could their code break.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I have limited the support to s3:// urls for now.
KvikIO also supports regular http servers but let's address that in a later PR.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great stuff!
few small comments

cpp/src/io/utilities/datasource.cpp Outdated Show resolved Hide resolved
cpp/src/io/utilities/datasource.cpp Show resolved Hide resolved
Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me!

CUDF_EXPECTS(supports_device_read(), "Device reads are not supported for this file.");

auto const read_size = std::min(size, this->size() - offset);
return _kvikio_file.pread(dst, read_size, offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: If insufficient memory is allocated in dst, does pread catch the exception thrown?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: In Progress
Status: In Progress
Status: Burndown
Development

Successfully merging this pull request may close these issues.

5 participants