Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[build_manager] add support for remote zip #4263

Merged

Conversation

paulsemel
Copy link
Collaborator

This adds support for remote ZIP.

As of now, performances are quite good locally, and the read ahead mechanism should keep reasonable performance. Also, given that the ClusterFuzz bots are having HDD, numbers might even be better there, as we're only storing on disk when unpacking the build.

The memory consumption of this new feature is contant: it uses at most (and most of the time) 50 MB of RAM.

@paulsemel paulsemel force-pushed the add-support-for-remote-zip branch 3 times, most recently from 09b2ab2 to bb92946 Compare September 23, 2024 09:12
Copy link
Collaborator

@letitz letitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only had time to review part of this - did not get down to archive.py, but hopefully this helps already.


def unzip_over_http_compatible(build_url: str) -> bool:
"""Whether the build URL is compatible with unzipping over HTTP.
As for now, we're only checking for chromium compatible URLs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment needs an update.

src/clusterfuzz/_internal/bot/fuzzers/utils.py Outdated Show resolved Hide resolved
@paulsemel paulsemel force-pushed the add-support-for-remote-zip branch 6 times, most recently from 2bff45a to b3b9fa0 Compare September 26, 2024 08:38
Copy link
Collaborator

@letitz letitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good overall! A bunch of small comments.

src/clusterfuzz/_internal/bot/fuzzers/utils.py Outdated Show resolved Hide resolved
local_file_handle.seek(0)
result = utils.search_bytes_in_file(pattern, local_file_handle)
file_handle.seek(0)
result = utils.search_bytes_in_file(pattern, file_handle)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we'll end up downloading fuzz target binaries twice, once to search for magic bytes, and once again to unzip them to disk? Maybe not, depending on how we cache downloaded bytes - I have not read that far yet.

In any case, this has me yearning for a manifest file of some kind that lists fuzz targets in the zip.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but:

  1. We ultimately only download a fuzz target twice, because the other ones won't be selected to be unpacked, thus those will only be downloaded once.
  2. This is already true anyways. And yes, I agree this is bad, especially the code you're pointing to.

logs.info("Opening an archive over HTTP, skipping archive download.")
assert http_build_url
with build_archive.open_uri(http_build_url) as build:
yield build
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can reduce nesting in this function by putting this condition first and returning early:

if can_unzip_over_http:
  with build_archive.open_uri(http_build_url) as build:
    yield build
  return

# Download build archive locally.
...

In fact if you can just return the archive instead of yielding it, as suggested above, this becomes very clean:

if can_unzip_over_http:
  return build_archive.open_uri(http_build_url)

# Download build archive locally.
...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I scratched my head around to make this cleaner while still keeping the correct constraints (read the other comment). WDYT?

Comment on lines 456 to 457
with build_archive.open(build_local_archive) as build:
yield build
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just return the build archive? I think it will do the same thing, since ArchiveReader (of which BuildArchive is a subclass) is already a context manager. Then I think you can also annotate this function as returning BuildArchive, and get rid of the contextlib import.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I kind of need the context manager here because I want to delete the file after we used it. I changed the code a bit, and I think that's cleaner, WDYT?

src/clusterfuzz/_internal/system/archive.py Outdated Show resolved Hide resolved
src/clusterfuzz/_internal/system/archive.py Outdated Show resolved Hide resolved
read_size = min(self.file_size - self._pos, size)
end_range = self._pos + read_size - 1
if read_size > REMOTE_HTTP_MIN_READ_SIZE:
content = self._fetch_from_http(self._pos, end_range)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we won't cache these bytes - is that intentional? Why not always delegate to _fetch_from_cache() and let it determine whether or not to fetch more bytes from the wire?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's because the way those archive library works is that they only read what's needed. If the read is greater than 50 MB, that means we're reading a huge file, and we won't need it in following reads. This is only useful for small reads tbh.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Can you explain that in a comment for the future reader?

Optional comment / musing follows below:

It results in a behavior I find counterintuitive, though that does not terribly matter:

  • read(49M) -> fetches 99MB and caches it
  • read(50M) -> fetches 50MB, does not cache it

I think this could be addressed by doing something like:

HTTP_MAX_BYTES = 100 * 1024 * 1024

def fetch_size(size):
  """How many bytes to fetch over HTTP to serve `read(size)`."""
  cache_size = min(size + HTTP_READAHEAD_BYTES, HTTP_MAX_BYTES)
  return max(size, cache_size)

Then at least you don't get a cliff effect for network fetches. But is it useful? Probably not?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a comment. It only explains what is happening, but not why, which is really what's interesting here. In other words, I can see for myself from reading the code that we don't cache large requests, but it does not tell me why that is. You explained it well in your PR comment, can you do that in the code?

Copy link
Collaborator

@jonathanmetzman jonathanmetzman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave my comments from my first pass as well. I'll try to get to the rest of it today.

@paulsemel paulsemel force-pushed the add-support-for-remote-zip branch 2 times, most recently from 424537a to e9f8035 Compare September 27, 2024 14:46
Copy link
Collaborator

@letitz letitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Near-LGTM with a few last comments.

The only major thing remaining is _maybe_get_http_build_url(). I would love to switch that to a straightforward string manipulation instead of relying on environment variables.

try:
with file_opener(file_path) as file_handle:
result = False
for pattern in FUZZ_TARGET_SEARCH_BYTES:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for future work: If searching these bytes take a long time, we might want to search for them "in parallel" by pushing the pattern list into search_bytes_in_file, which can scan each chunk of the file for all patterns (avoiding the need to re-read the file X times) or even reach for something like Aho-Corasick: https://pyahocorasick.readthedocs.io/en/latest/

read_size = min(self.file_size - self._pos, size)
end_range = self._pos + read_size - 1
if read_size > REMOTE_HTTP_MIN_READ_SIZE:
content = self._fetch_from_http(self._pos, end_range)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Can you explain that in a comment for the future reader?

Optional comment / musing follows below:

It results in a behavior I find counterintuitive, though that does not terribly matter:

  • read(49M) -> fetches 99MB and caches it
  • read(50M) -> fetches 50MB, does not cache it

I think this could be addressed by doing something like:

HTTP_MAX_BYTES = 100 * 1024 * 1024

def fetch_size(size):
  """How many bytes to fetch over HTTP to serve `read(size)`."""
  cache_size = min(size + HTTP_READAHEAD_BYTES, HTTP_MAX_BYTES)
  return max(size, cache_size)

Then at least you don't get a cliff effect for network fetches. But is it useful? Probably not?

src/clusterfuzz/_internal/system/archive.py Outdated Show resolved Hide resolved
src/clusterfuzz/_internal/system/archive.py Outdated Show resolved Hide resolved
Returns:
the build URL.
"""
http_build_url_pattern = environment.get_value('RELEASE_BUILD_URL_PATTERN')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think debug and symbolized builds are a full chrome thing, and aren't used for engine fuzzer builds.

@@ -1076,6 +1142,35 @@ def _get_latest_revision(bucket_paths):
return None


def _maybe_get_http_build_url(revision) -> Optional[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to maybe note is that by converting the build URL to HTTP, I think we lose the ability to authenticate without any extra work. This may not be a problem in current chrome builds but may be a problem in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. we fuzzed widevine-based builds in the past.

@paulsemel paulsemel force-pushed the add-support-for-remote-zip branch 2 times, most recently from e002874 to c0e84e6 Compare October 1, 2024 12:05
Copy link
Collaborator

@oliverchang oliverchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to get this merged if this resolves Chrome's immediate issues.

That said, would it also be worthwhile also scoping out what it would take for Chrome to migrate to split builds? i.e.

class SplitTargetBuild(RegularBuild):

There's a fair bit of complexity already with the different build types we have, and long term we should aim to simplify this.

Copy link
Collaborator

@letitz letitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a couple last comments.


def unzip_over_http_compatible(build_url: str) -> bool:
"""Whether the build URL is compatible with unzipping over HTTP.
As for now, we're only checking for chromium compatible URLs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment needs an update.

src/clusterfuzz/_internal/system/archive.py Outdated Show resolved Hide resolved
read_size = min(self.file_size - self._pos, size)
end_range = self._pos + read_size - 1
if read_size > REMOTE_HTTP_MIN_READ_SIZE:
content = self._fetch_from_http(self._pos, end_range)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a comment. It only explains what is happening, but not why, which is really what's interesting here. In other words, I can see for myself from reading the code that we don't cache large requests, but it does not tell me why that is. You explained it well in your PR comment, can you do that in the code?

@letitz
Copy link
Collaborator

letitz commented Oct 3, 2024

@oliverchang right, I had filed crbug.com/333965940 to look into that. I think it's the right direction longer term, but I think it significantly more work than this change.

This adds support for remote ZIP.

As of now, performances are quite good locally, and the read ahead mechanism
should keep reasonable performance. Also, given that the ClusterFuzz bots are
having HDD, numbers might even be better there, as we're only storing on disk
when unpacking the build.

The memory consumption of this new feature is contant: it uses at most (and
most of the time) 50 MB of RAM.
@paulsemel
Copy link
Collaborator Author

Note: whenever this lands, don't forget that you need to add the ALLOW_UNPACK_OVER_HTTP = True environment variable on the job so that this kicks in.

@jonathanmetzman jonathanmetzman merged commit 734a0f0 into google:master Oct 8, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants