Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] BufferReader should zero-copy if and only if given a Buffer, not a pointer and length #37212

Closed
lidavidm opened this issue Aug 16, 2023 · 8 comments · Fixed by #37360
Closed

Comments

@lidavidm
Copy link
Member

Describe the enhancement requested

BufferReader can be constructed from a pointer and length:

BufferReader::BufferReader(std::string_view data)
: BufferReader(reinterpret_cast<const uint8_t*>(data.data()),
static_cast<int64_t>(data.size())) {}

In that case, it will happily hand you a Buffer backed by that pointer:

return std::make_shared<Buffer>(data_ + position, nbytes);

However, I don't think the APIs make it clear that this ties the lifetime of any data read from the reader to the lifetime of the input. It would be clearer if it would copy in this case, and require you to explicitly construct BufferReader from a Buffer if you want to enable zero-copy.

Component(s)

C++

@lidavidm
Copy link
Member Author

lidavidm commented Aug 17, 2023

There's a second issue here: BufferReader has no constructor for std::string, but the implicit conversion from string to string_view means you can still construct a BufferReader that is implicitly going to blow up on you. We should add an explicit constructor to avoid this.

@mapleFU
Copy link
Member

mapleFU commented Aug 21, 2023

So, supports_zero_copy should also decided by if the data is owned?

bool BufferReader::supports_zero_copy() const { return true; }

And can I try to work on this issue?

@lidavidm
Copy link
Member Author

Yeah, I think:

  1. Add an explicit constructor for std::string that wraps it in an Arrow buffer, or even prevent construction via string (since it's owned, but not an Arrow buffer)
  2. Make zero copy only happen if we are given a buffer, which is a promise about ownership. Otherwise always copy. I think you have to modify ReadAt as well.

Feel free to take this up. Thanks!

@wgtmac
Copy link
Member

wgtmac commented Aug 23, 2023

The ctor of Buffer has a comment implies that the passed memory must be kept alive through some other means.

  /// \brief Construct from buffer and size without copying memory
  ///
  /// \param[in] data a memory buffer
  /// \param[in] size buffer size
  ///
  /// \note The passed memory must be kept alive through some other means
  Buffer(const uint8_t* data, int64_t size)

Does it mean BufferReader should inherit the same implication? The buffer itself cannot tell if it is owned or not after construction. If we have that info on hand, could we handle zero-copy better?

@lidavidm
Copy link
Member Author

There's more discussion on the PR itself, FWIW

@mikelui
Copy link
Contributor

mikelui commented Aug 27, 2023

FWIW I think there'd be a lot of value in an "examples" PR that has conventions for doing common patterns (e.g. given I own this buffer/input stream/message stream, how do I pass that ownership to the Arrow RecordBatch/Table I create?)

@mapleFU
Copy link
Member

mapleFU commented Aug 27, 2023

e.g. given I own this buffer/input stream/message stream, how do I pass that ownership to the Arrow RecordBatch/Table I create?

Personally, I think BufferReader should always build from std::string or std::shared_ptr<Buffer>. Other constructor is deprecated, and when using std::shared_ptr<Buffer>, user should gurantee the lifetime of output buffer.

@lidavidm
Copy link
Member Author

@mikelui There's https://arrow.apache.org/cookbook/ / https://github.com/apache/arrow-cookbook ; more examples would always be welcome!

pitrou pushed a commit that referenced this issue Aug 30, 2023
### Rationale for this change

Previously, when input an non-owned string, `arrow::io::BufferReader` would zero-copy it. It would cause lifetime problem. This patch add `FromString` to help build from `std::string`.

### What changes are included in this PR?

* Add a ctor for FromString(std::string), and deprecate non-owning ctors

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Some APIs are being deprecated. Users can use the new interface.

* Closes: #37212

Authored-by: mwish <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@pitrou pitrou added this to the 14.0.0 milestone Aug 30, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…apache#37360)

### Rationale for this change

Previously, when input an non-owned string, `arrow::io::BufferReader` would zero-copy it. It would cause lifetime problem. This patch add `FromString` to help build from `std::string`.

### What changes are included in this PR?

* Add a ctor for FromString(std::string), and deprecate non-owning ctors

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Some APIs are being deprecated. Users can use the new interface.

* Closes: apache#37212

Authored-by: mwish <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
raulcd pushed a commit that referenced this issue Nov 23, 2023
…structor (#38721)

### Rationale for this change

The PR [GH-37360](#37360) for issue [GH-37212](#37212) deprecated three BufferReader constructors and I noticed one of them was missing a `\deprecated` command.

### What changes are included in this PR?

This adds a `\deprecated` command one of the deprecated constructors with a message. The other two didn't need it because they weren't already documented. i.e., they weren't listed under https://arrow.apache.org/docs/cpp/api/io.html so documenting them at this point just to add `\deprecated` wouldn't make sense.

### Are these changes tested?

No, this is a simple docs change.

### Are there any user-facing changes?

Yes, this adds a notice to the docs for this particular method.

Authored-by: Bryce Mecum <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…apache#37360)

### Rationale for this change

Previously, when input an non-owned string, `arrow::io::BufferReader` would zero-copy it. It would cause lifetime problem. This patch add `FromString` to help build from `std::string`.

### What changes are included in this PR?

* Add a ctor for FromString(std::string), and deprecate non-owning ctors

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Some APIs are being deprecated. Users can use the new interface.

* Closes: apache#37212

Authored-by: mwish <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…structor (apache#38721)

### Rationale for this change

The PR [apacheGH-37360](apache#37360) for issue [apacheGH-37212](apache#37212) deprecated three BufferReader constructors and I noticed one of them was missing a `\deprecated` command.

### What changes are included in this PR?

This adds a `\deprecated` command one of the deprecated constructors with a message. The other two didn't need it because they weren't already documented. i.e., they weren't listed under https://arrow.apache.org/docs/cpp/api/io.html so documenting them at this point just to add `\deprecated` wouldn't make sense.

### Are these changes tested?

No, this is a simple docs change.

### Are there any user-facing changes?

Yes, this adds a notice to the docs for this particular method.

Authored-by: Bryce Mecum <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment