Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use IO.copy_stream when possible #383

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

casperisfine
Copy link

@casperisfine casperisfine commented Jan 24, 2018

Fix: #66

Context

Ref: googleapis/google-cloud-ruby#1897

We noticed that Google Cloud Storage's ruby library performance on download was heavily impacted by CPU usage on the host, especially for big files. After some digging it was clear it's due to how the data has to transit through read() and write() instead of leveraging sendfile().

An experiment using a quick and dirty patch showed a reduction from 15s to 5s for a 500MB download.

The patch

To leverage sendfile() in ruby, the best and simplest API is IO.copy_stream as suggested in #66.

The problem is that copy_stream need IO or IO like objects to work with, and httpclient's API mostly deal with blocks, so I had to adapt the API somehow.

One important thing to note, is that we can only leverage sendfile if there is no modifications to apply on the request body, e.g. no chunking, no compression.

I'll add comments on specific parts of the patch in a later comments.

@@ -651,8 +651,8 @@ def redirect_uri_callback=(redirect_uri_callback)
# use get method. get returns HTTP::Message as a response and you need to
# follow HTTP redirect by yourself if you need.
def get_content(uri, *args, &block)
query, header = keyword_argument(args, :query, :header)
success_content(follow_redirect(:get, uri, query, nil, header || {}, &block))
query, header, to = keyword_argument(args, :query, :header, :to)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can do better in term of API. But it seemed logical to expose this optimized code path through get_content only.

So: client.get_content('/big-file.bin', to: '/tmp/big-file.bin')

@inflater = inflater
end

def write(chunk)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that is not very well documented, is that copy_steam do accept fake IO objects as long as they respond to #write(). However it's kind of a fallback codepath, as it won't be able to use sendfile() so there will be no speed up.

@@ -829,16 +845,6 @@ def test_get_with_block_arity_2_and_redirects
assert_nil(res.content)
end

def test_get_with_block_string_recycle
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel bad about removing that test, but unfortunately read_block_size doesn't make any sense in case sendfile() is used.

But the next test, test the same behavior with chunked response, so I think it's okish to remove it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants