Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 support #426

Closed
wants to merge 101 commits into from
Closed

S3 support #426

wants to merge 101 commits into from

Conversation

madsbk
Copy link
Member

@madsbk madsbk commented Jul 31, 2024

Implements AWS S3 read support.

Tasks

@madsbk madsbk added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jul 31, 2024
@madsbk madsbk force-pushed the s3_support branch 2 times, most recently from 25d4fe9 to f76ff32 Compare August 12, 2024 08:43
cmake

RemoteHandle::read()

remove StreamFuture(StreamFuture&&)

python bindings and tests

read to device memory

dependencies: aws-sdk-cpp

parse_s3_path

cmake: adding AWSSDK to python built

RemoteFile.from_url()

benchmark

benchmark: --api numpy

ibenchmark: --api cudf

cleanup

benchmark: adding --api cudf-fsspec

read(): use PushAndPopContext

impl. pread

ensure_aws_s3_api_is_initalized

AwsS3Client

S3Context

clean up

SameThreadExecutor

use shared pointer

cmake: AWSSDK COMPONENTS s3 transfer

create_transfer_manager

clean up

BufferAsStream

test_read: larger data size

KVIKIO_NVTX_FUNC_RANGE

benchmark: use random

RemoteHandle::read_to_host(): print bandwidth

benchmark

clean up

remove the use of the transfer module

ci: some more aws-sdk-cpp

benchmark:  pytest.importorskip("moto")

remote_handle.pyx

remote_file.py

make remote IO optional

don't use typing_extensions

dependencies: boto3 and moto

more aws-sdk-cpp

trigger CI error if remote module wasn't built
@madsbk madsbk marked this pull request as ready for review August 13, 2024 11:22
@madsbk madsbk requested review from a team as code owners August 13, 2024 11:22
@madsbk madsbk requested a review from jameslamb August 13, 2024 11:22
@madsbk madsbk changed the title [WIP] S3 support S3 support Aug 13, 2024
Comment on lines 14 to 15
# TODO: remove before PR merge. Trigger CI error if the remote module wasn't built
import kvikio._lib.remote_handle # isort: skip
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[debug] remove before PR merging

KyleFromNVIDIA added a commit to KyleFromNVIDIA/ci-imgs that referenced this pull request Aug 15, 2024
@madsbk
Copy link
Member Author

madsbk commented Sep 3, 2024

@KyleFromNVIDIA, what is the status of the building work? Is it ready for review/merge?

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about the handling of the aws-sdk-cpp dependency and how it will constrain our compatibility between RAPIDS and the conda-forge ecosystem. This dependency is used by many packages and it is tightly pinned, meaning that RAPIDS will need to track and continually migrate its dependency to align exactly with the conda-forge ecosystem. See threads in this review.

clean - remove all existing build artifacts and configuration (start over)
libkvikio - build and install the libkvikio C++ code
kvikio - build and install the kvikio Python package (requires libkvikio)
--no-s3 - build with no AWS S3 support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just be defined as a part of --cmake-args, with no special --no-s3 flag? It is possible for users to specify the option via --cmake-args. If the user gives a value there, we probably want to respect that with higher priority than defaulting to "ON" based on the absence of a --no-s3 flag.

@@ -64,11 +64,13 @@ requirements:
- rapids-build-backend >=0.3.0,<0.4.0.dev0
- scikit-build-core >=0.10.0
- libkvikio ={{ version }}
- aws-sdk-cpp>=1.11.267
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ We will have to be very careful with this pinning. We must match conda-forge exactly because aws-sdk-cpp requires a nearly-exact match at runtime (version x.x.x) and we need compatibility with conda-forge. Note that we will need to maintain this pinning very actively and align with conda-forge for every RAPIDS release, or we risk breaking environment compatibility with conda-forge. This is a large part of why I am hesitant on adding this dependency -- it constrains us tightly, and we do not benefit from the automated migrations that are used for conda-forge recipes to "rebuild the world" when pinnings like aws-sdk-cpp are updated.

We always need to be matching this line: https://github.com/conda-forge/conda-forge-pinning-feedstock/blob/2e592a612d925988747cd6daa9e271328ceb3bfc/recipe/conda_build_config.yaml#L268

Suggested change
- aws-sdk-cpp>=1.11.267
- aws-sdk-cpp==1.11.379

Copy link
Member Author

@madsbk madsbk Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my ignorance, but why do we need an exact match? Naively, I would have thought that a lower-limit would be more flexible? Do all packages in conda-forge needs to maintain this pinning if they use aws-sdk-cpp?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdice, would it help if we remove the version requirement?

Copy link
Contributor

@bdice bdice Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First I'll explain the migration logic. Then I'll answer the questions you posed about other options.

Migrations and why they're needed

The aws-sdk-cpp release cadence is very frequent. Some of those releases (there's a huge number of them) are picked up by conda-forge, but taking the new version requires "rebuilding the world" because aws-sdk-cpp is a core C++ dependency of many packages (see mamba repoquery whoneeds -c conda-forge aws-sdk-cpp). Updates happen via a conda-forge migration (tracked here) which rebuild everything that depends on aws-sdk-cpp.

See the number of recent AWS-related rebuilds of libarrow, for instance:
https://github.com/conda-forge/arrow-cpp-feedstock/commits/main/recipe/meta.yaml

I'll ignore the aws-crt-cpp and other rebuilds, and only focus on aws-sdk-cpp because that's the dependency we are introducing here.

In the past 12 months, there have been 6 rebuilds of libarrow for new versions of aws-sdk-cpp (1.11.379, 1.11.329, 1.11.267, 1.11.242, 1.11.210, 1.11.182).

Every time that conda-forge migrates to a new version, we must use the same version so that RAPIDS (KvikIO specifically) can be installed alongside the latest conda-forge packages.

Try this:

mamba create -n test --dry-run libarrow

Note that aws-sdk-cpp 1.11.379 is included in the environment because there has been a recent rebuild of libarrow.

Can we use a lower limit (or no version pinning)?

If we were to use a lower limit (or no version pinning), kvikio would use the latest aws-sdk-cpp when it is built. That would impose a runtime pinning on the latest aws-sdk-cpp. However, if a conda-forge migration has not happened yet (or is incomplete), those kvikio packages would be incompatible with the latest conda-forge packages of things like libarrow. We need to match the migrators so we don't get "latest aws-sdk-cpp" and instead align with "current version used by conda-forge's ecosystem-wide pinnings". Typically conda-forge maintainers will start a migration as soon as a new aws-sdk-cpp is released, but that's not guaranteed to happen immediately. Sometimes migrations are delayed or skipped, which would leave our packages unusable if we had picked up an aws-sdk-cpp version that conda-forge did not adopt with a migration.

So what do we do?

RAPIDS is not built with conda-forge infrastructure, but aims to be compatible with the conda-forge ecosystem. We do not benefit from conda-forge's automation for version migrations but we must track them in order to be compatible. (The conda-forge bots that "rebuild the world" do not rebuild RAPIDS recipes, but maybe we could create such a tool.)

To remain compatible with conda-forge, we need to match the exact pinnings for core packages like fmt, spdlog, aws-sdk-cpp, and others. We have a lot of pain from matching conda-forge migrations for fmt and spdlog (see Keith's warning and this problem in cuspatial that comes from us being behind on spdlog/fmt migrations)

Our best course of action for now is to:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened rapidsai/build-planning#100 to think more about conda-forge migration support in RAPIDS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, thanks for the explanation @bdice! I now see why you were hesitant from the get-go 😊

What if we move to use libcurl instead? They take API and ABI stability very serious: https://curl.se/libcurl/features.html#stableapi

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking of using libcurl from the start since we only need the very basic S3 operations. I don’t think it will be too hard to implement and it will make Azure Blob Storage support straightforward.

Copy link
Contributor

@bdice bdice Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be a better option than aws-sdk-cpp if we can use libcurl. It seems like libcurl is already a dependency of some packages we depend on in RAPIDS, so that gives me more confidence in it. See $ mamba repoquery whoneeds -c conda-forge libcurl 2>&1 | awk '{print $1;}' | sort | uniq.

Copy link
Member Author

@madsbk madsbk Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I will give libcurl a try

run:
- python
- numpy >=1.23,<3.0a0
- cupy >=12.0.0
- zarr
- aws-sdk-cpp>=1.11.267
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not list a run dependency, this will be handled by run-exports: https://github.com/conda-forge/aws-sdk-cpp-feedstock/blob/f82d968670bdb9939ed7604a5fb7bb4885e2e6ba/recipe/meta.yaml#L14

Suggested change
- aws-sdk-cpp>=1.11.267

@@ -52,6 +52,7 @@ requirements:
{% else %}
- libcufile-dev # [linux]
{% endif %}
- aws-sdk-cpp>=1.11.267
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Match conda-forge.

Suggested change
- aws-sdk-cpp>=1.11.267
- aws-sdk-cpp==1.11.379

@@ -83,6 +84,7 @@ outputs:
{% else %}
- libcufile-dev # [linux]
{% endif %}
- aws-sdk-cpp>=1.11.267
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this and add a host dependency on aws-sdk-cpp==1.11.379 here so that we can have compatible runtime pinnings inserted automatically by the run-exports.

python/kvikio/kvikio/remote_file.py Outdated Show resolved Hide resolved
python/kvikio/tests/test_benchmarks.py Outdated Show resolved Hide resolved
python/kvikio/tests/test_benchmarks.py Outdated Show resolved Hide resolved

retcode = run_cmd(
cmd=[
sys.executable or "python",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sys.executable should always be defined? Are there cases where that isn't safe?

],
cwd=benchmarks_path,
)
assert retcode == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this assert anything about the stdout/stderr outputs?

@GregoryKimball
Copy link

@madsbk Would you please share the status of this work? Are you planning to include AWS C++ SDK changes in 24.12?

@madsbk madsbk marked this pull request as draft September 24, 2024 06:30
@madsbk madsbk mentioned this pull request Sep 24, 2024
@madsbk
Copy link
Member Author

madsbk commented Sep 24, 2024

@madsbk Would you please share the status of this work? Are you planning to include AWS C++ SDK changes in 24.12?

The plan is to use libcurl instead of aws-s3-sdk to support S3. First step implemented by #464.

rapids-bot bot pushed a commit that referenced this pull request Oct 8, 2024
Support read directly from a http server like:
```python
import kvikio
import cupy

with kvikio.RemoteFile.from_http_url("http://127.0.0.1:9000/myfile") as f:
    ary = cupy.empty(f.nbytes, dtype="uint8")
    f.read(ary)
```

This PR is the first step to support S3 using libcurl instead of [aws-s3-sdk](#426), which has some pros and cons:

* Pros 
    * The [global conda pinning issue](#426 (comment)) is less of a problem.
    * We can support other protocols such as http, ftp, and Azure’s storage, without much work. 
    * We avoid the [free-after-main issue in aws-s3-sdk](https://github.com/rapidsai/kvikio/blob/000126516db430988ab9af5ee1576ca3fe6afe27/cpp/include/kvikio/remote_handle.hpp#L87-L94). This is huge since we would otherwise have to pass around a `S3Context` in libcudf and cudf to handle shutdown correctly.  This is not a problem in libcurl, see https://curl.se/libcurl/c/libcurl.html under `Global constants`.
* Cons 
    * Hard to support the AWS configuration file. We will require the user to either specify the options programmatically or through environment variables like `AWS_ACCESS_KEY_ID ` and `AWS_SECRET_ACCESS_KEY `.

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #464
@madsbk madsbk mentioned this pull request Oct 9, 2024
@madsbk madsbk closed this Oct 22, 2024
@vyasr
Copy link
Contributor

vyasr commented Oct 22, 2024

Replaced by #479.

rapids-bot bot pushed a commit that referenced this pull request Oct 22, 2024
Implements AWS S3 read support using libcurl:
```python
import kvikio
import cupy

with kvikio.RemoteFile.from_s3_url("s://my-bucket/my-file") as f:
    ary = cupy.empty(f.nbytes, dtype="uint8")
    f.read(ary)
```

Supersedes #426

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Lawrence Mitchell (https://github.com/wence-)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #479
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

7 participants