Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37212: [C++] IO: BufferReader always owned buffer #37271

Closed

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Aug 21, 2023

Rationale for this change

Previously, when input an non-owned string, arrow::io::BufferReader would zero-copy it. It would cause lifetime problem. This patch force it copy in this case.

What changes are included in this PR?

  1. Force arrow::io::BufferReader to be "owned"
  2. Add a ctor for BufferReader(std::string), and remove non-owned ctors

Are these changes tested?

Yes

Are there any user-facing changes?

User can use BufferReader safely, but might get little more memcopy

@github-actions
Copy link

⚠️ GitHub issue #37212 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member Author

mapleFU commented Aug 21, 2023

Seems I meet a problem like nlohmann/json#586

Should I use BufferReader::From(std::string) like ::arrow::Buffer, or using the json like method to solve this?

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach will work, but I'm reluctant to introduce a code path where we allocate-and-copy on read in a class from which users could otherwise reasonably expect trivial IO costs.

As a potential alternative:

These non-owning constructors don't seem widely used; mostly tests and deserialization of FunctionOptions. Perhaps instead demoting BufferReader to only sometimes zero-copy, it'd be better to delete these constructors and require users to own the buffers to be read (probably doing a single copy all at once on construction instead of one copy per read).

@lidavidm

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Aug 21, 2023
@lidavidm
Copy link
Member

Also reasonable to me, though you'd have to go through a deprecation cycle I believe.

@lidavidm
Copy link
Member

Or you could have those constructors do the copy for you (and if you really want to zero-copy from a temporary, you can always explicitly construct the Buffer yourself)

@mapleFU
Copy link
Member Author

mapleFU commented Aug 22, 2023

This approach will work, but I'm reluctant to introduce a code path where we allocate-and-copy on read in a class from which users could otherwise reasonably expect trivial IO costs.

I also vote for this, let me implement it

@mapleFU mapleFU changed the title GH-37212: [C++] IO: BufferReader not zero-copy if and only if given a non-owned buffer GH-37212: [C++] IO: BufferReader always owned buffer Aug 22, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 22, 2023
@@ -83,7 +83,8 @@ Result<std::unique_ptr<FunctionOptions>> GenericOptionsType::Deserialize(

Result<std::unique_ptr<FunctionOptions>> DeserializeFunctionOptions(
const Buffer& buffer) {
io::BufferReader stream(buffer);
// FIXME: Would ToString to heavy?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder would this be a bit heavy? Should I implement a DeserializeFunctionOptions(const std::shared_ptr<Buffer>& buffer)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can explicitly std::make_shared<Buffer>(data_, nbytes) to preserve the prior behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I'm not fully understand.

io::BufferReader stream(buffer.ToString());

is same as io::BufferReader. I mean should change DeserializeFunctionOptions(const Buffer& buffer) to DeserializeFunctionOptions(const std::shared_ptr<Buffer>& buffer)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that you can keep const Buffer& and just explicitly wrap it in a non-owning shared_ptr<Buffer> to pass to BufferReader, which keeps the old behavior (zero-copy without owning)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply your advice and change the implemention, thanks!

@mapleFU mapleFU requested a review from wgtmac as a code owner August 22, 2023 10:51
@mapleFU
Copy link
Member Author

mapleFU commented Aug 22, 2023

@bkietz @lidavidm I've remove the non-owned style, and find #37271 (comment) might introduce cost, would you mind take a look?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 22, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 22, 2023
@mapleFU
Copy link
Member Author

mapleFU commented Aug 22, 2023

C:\projects\arrow>mamba update -q -y -c conda-forge --all   || exit /B 
Traceback (most recent call last):
  File "C:\Miniconda38-x64\Scripts\mamba-script.py", line 6, in <module>
    from mamba.mamba import main
  File "C:\Miniconda38-x64\lib\site-packages\mamba\mamba.py", line 49, in <module>
    import libmambapy as api
  File "C:\Miniconda38-x64\lib\site-packages\libmambapy\__init__.py", line 7, in <module>
    raise e
  File "C:\Miniconda38-x64\lib\site-packages\libmambapy\__init__.py", line 4, in <module>
    from libmambapy.bindings import *  # noqa: F401,F403
ImportError: DLL load failed while importing bindings: The specified module could not be found.
C:\projects\arrow>set lastexitcode=1 
C:\projects\arrow>set  1>C:\Users\appveyor\AppData\Local\Temp\1\tmp82BE.tmp 
C:\projects\arrow>echo C:\projects\arrow  1>C:\Users\appveyor\AppData\Local\Temp\1\tmp82BF.tmp 
C:\projects\arrow>exit /b 1 

No idea why it failed...

Comment on lines -149 to -154
explicit BufferReader(const Buffer& buffer);
BufferReader(const uint8_t* data, int64_t size);

/// \brief Instantiate from std::string or std::string_view. Does not
/// own data
explicit BufferReader(std::string_view data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should deprecate instead of remove?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could even introduce static factories that have the same behavior as these constructors, but whose names explicitly indicate that they may be unsafe

Copy link
Member Author

@mapleFU mapleFU Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, seems that I should mark them deprecated, but keep them as before (including zero-copy)?

Then how do I fix #37271 (comment) ? Using a BufferReader::FromString? Or other way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's annoying.

FromString is probably best.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 22, 2023
@mapleFU mapleFU closed this Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] BufferReader should zero-copy if and only if given a Buffer, not a pointer and length
3 participants