-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[build_manager] add support for remote zip #4263
[build_manager] add support for remote zip #4263
Conversation
09b2ab2
to
bb92946
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only had time to review part of this - did not get down to archive.py
, but hopefully this helps already.
|
||
def unzip_over_http_compatible(build_url: str) -> bool: | ||
"""Whether the build URL is compatible with unzipping over HTTP. | ||
As for now, we're only checking for chromium compatible URLs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment needs an update.
2bff45a
to
b3b9fa0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good overall! A bunch of small comments.
local_file_handle.seek(0) | ||
result = utils.search_bytes_in_file(pattern, local_file_handle) | ||
file_handle.seek(0) | ||
result = utils.search_bytes_in_file(pattern, file_handle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we'll end up downloading fuzz target binaries twice, once to search for magic bytes, and once again to unzip them to disk? Maybe not, depending on how we cache downloaded bytes - I have not read that far yet.
In any case, this has me yearning for a manifest file of some kind that lists fuzz targets in the zip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but:
- We ultimately only download a fuzz target twice, because the other ones won't be selected to be unpacked, thus those will only be downloaded once.
- This is already true anyways. And yes, I agree this is bad, especially the code you're pointing to.
logs.info("Opening an archive over HTTP, skipping archive download.") | ||
assert http_build_url | ||
with build_archive.open_uri(http_build_url) as build: | ||
yield build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can reduce nesting in this function by putting this condition first and returning early:
if can_unzip_over_http:
with build_archive.open_uri(http_build_url) as build:
yield build
return
# Download build archive locally.
...
In fact if you can just return the archive instead of yielding it, as suggested above, this becomes very clean:
if can_unzip_over_http:
return build_archive.open_uri(http_build_url)
# Download build archive locally.
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I scratched my head around to make this cleaner while still keeping the correct constraints (read the other comment). WDYT?
with build_archive.open(build_local_archive) as build: | ||
yield build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just return the build archive? I think it will do the same thing, since ArchiveReader
(of which BuildArchive
is a subclass) is already a context manager. Then I think you can also annotate this function as returning BuildArchive
, and get rid of the contextlib
import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I kind of need the context manager here because I want to delete the file after we used it. I changed the code a bit, and I think that's cleaner, WDYT?
read_size = min(self.file_size - self._pos, size) | ||
end_range = self._pos + read_size - 1 | ||
if read_size > REMOTE_HTTP_MIN_READ_SIZE: | ||
content = self._fetch_from_http(self._pos, end_range) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we won't cache these bytes - is that intentional? Why not always delegate to _fetch_from_cache()
and let it determine whether or not to fetch more bytes from the wire?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's because the way those archive library works is that they only read what's needed. If the read is greater than 50 MB, that means we're reading a huge file, and we won't need it in following reads. This is only useful for small reads tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Can you explain that in a comment for the future reader?
Optional comment / musing follows below:
It results in a behavior I find counterintuitive, though that does not terribly matter:
- read(49M) -> fetches 99MB and caches it
- read(50M) -> fetches 50MB, does not cache it
I think this could be addressed by doing something like:
HTTP_MAX_BYTES = 100 * 1024 * 1024
def fetch_size(size):
"""How many bytes to fetch over HTTP to serve `read(size)`."""
cache_size = min(size + HTTP_READAHEAD_BYTES, HTTP_MAX_BYTES)
return max(size, cache_size)
Then at least you don't get a cliff effect for network fetches. But is it useful? Probably not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding a comment. It only explains what is happening, but not why, which is really what's interesting here. In other words, I can see for myself from reading the code that we don't cache large requests, but it does not tell me why that is. You explained it well in your PR comment, can you do that in the code?
src/clusterfuzz/_internal/tests/core/build_management/build_manager_test.py
Outdated
Show resolved
Hide resolved
src/clusterfuzz/_internal/tests/core/build_management/build_manager_test.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave my comments from my first pass as well. I'll try to get to the rest of it today.
src/clusterfuzz/_internal/tests/core/build_management/build_manager_test.py
Outdated
Show resolved
Hide resolved
424537a
to
e9f8035
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Near-LGTM with a few last comments.
The only major thing remaining is _maybe_get_http_build_url()
. I would love to switch that to a straightforward string manipulation instead of relying on environment variables.
try: | ||
with file_opener(file_path) as file_handle: | ||
result = False | ||
for pattern in FUZZ_TARGET_SEARCH_BYTES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for future work: If searching these bytes take a long time, we might want to search for them "in parallel" by pushing the pattern list into search_bytes_in_file
, which can scan each chunk of the file for all patterns (avoiding the need to re-read the file X times) or even reach for something like Aho-Corasick: https://pyahocorasick.readthedocs.io/en/latest/
read_size = min(self.file_size - self._pos, size) | ||
end_range = self._pos + read_size - 1 | ||
if read_size > REMOTE_HTTP_MIN_READ_SIZE: | ||
content = self._fetch_from_http(self._pos, end_range) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Can you explain that in a comment for the future reader?
Optional comment / musing follows below:
It results in a behavior I find counterintuitive, though that does not terribly matter:
- read(49M) -> fetches 99MB and caches it
- read(50M) -> fetches 50MB, does not cache it
I think this could be addressed by doing something like:
HTTP_MAX_BYTES = 100 * 1024 * 1024
def fetch_size(size):
"""How many bytes to fetch over HTTP to serve `read(size)`."""
cache_size = min(size + HTTP_READAHEAD_BYTES, HTTP_MAX_BYTES)
return max(size, cache_size)
Then at least you don't get a cliff effect for network fetches. But is it useful? Probably not?
src/clusterfuzz/_internal/tests/core/build_management/build_manager_test.py
Outdated
Show resolved
Hide resolved
Returns: | ||
the build URL. | ||
""" | ||
http_build_url_pattern = environment.get_value('RELEASE_BUILD_URL_PATTERN') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think debug and symbolized builds are a full chrome thing, and aren't used for engine fuzzer builds.
@@ -1076,6 +1142,35 @@ def _get_latest_revision(bucket_paths): | |||
return None | |||
|
|||
|
|||
def _maybe_get_http_build_url(revision) -> Optional[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to maybe note is that by converting the build URL to HTTP, I think we lose the ability to authenticate without any extra work. This may not be a problem in current chrome builds but may be a problem in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. we fuzzed widevine-based builds in the past.
e002874
to
c0e84e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to get this merged if this resolves Chrome's immediate issues.
That said, would it also be worthwhile also scoping out what it would take for Chrome to migrate to split builds? i.e.
class SplitTargetBuild(RegularBuild): |
There's a fair bit of complexity already with the different build types we have, and long term we should aim to simplify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a couple last comments.
|
||
def unzip_over_http_compatible(build_url: str) -> bool: | ||
"""Whether the build URL is compatible with unzipping over HTTP. | ||
As for now, we're only checking for chromium compatible URLs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment needs an update.
read_size = min(self.file_size - self._pos, size) | ||
end_range = self._pos + read_size - 1 | ||
if read_size > REMOTE_HTTP_MIN_READ_SIZE: | ||
content = self._fetch_from_http(self._pos, end_range) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding a comment. It only explains what is happening, but not why, which is really what's interesting here. In other words, I can see for myself from reading the code that we don't cache large requests, but it does not tell me why that is. You explained it well in your PR comment, can you do that in the code?
@oliverchang right, I had filed crbug.com/333965940 to look into that. I think it's the right direction longer term, but I think it significantly more work than this change. |
This adds support for remote ZIP. As of now, performances are quite good locally, and the read ahead mechanism should keep reasonable performance. Also, given that the ClusterFuzz bots are having HDD, numbers might even be better there, as we're only storing on disk when unpacking the build. The memory consumption of this new feature is contant: it uses at most (and most of the time) 50 MB of RAM.
c0e84e6
to
532a130
Compare
Note: whenever this lands, don't forget that you need to add the |
This adds support for remote ZIP.
As of now, performances are quite good locally, and the read ahead mechanism should keep reasonable performance. Also, given that the ClusterFuzz bots are having HDD, numbers might even be better there, as we're only storing on disk when unpacking the build.
The memory consumption of this new feature is contant: it uses at most (and most of the time) 50 MB of RAM.