[build_manager] add support for remote zip #4263

paulsemel · 2024-09-20T14:40:03Z

This adds support for remote ZIP.

As of now, performances are quite good locally, and the read ahead mechanism should keep reasonable performance. Also, given that the ClusterFuzz bots are having HDD, numbers might even be better there, as we're only storing on disk when unpacking the build.

The memory consumption of this new feature is contant: it uses at most (and most of the time) 50 MB of RAM.

letitz

Only had time to review part of this - did not get down to archive.py, but hopefully this helps already.

src/clusterfuzz/_internal/build_management/build_archive.py

letitz · 2024-09-24T15:23:30Z

src/clusterfuzz/_internal/build_management/build_archive.py

+
+def unzip_over_http_compatible(build_url: str) -> bool:
+  """Whether the build URL is compatible with unzipping over HTTP.
+  As for now, we're only checking for chromium compatible URLs.


I removed that.

Comment needs an update.

src/clusterfuzz/_internal/build_management/build_archive.py

src/clusterfuzz/_internal/bot/fuzzers/utils.py

src/clusterfuzz/_internal/build_management/build_manager.py

src/clusterfuzz/_internal/build_management/build_archive.py

src/clusterfuzz/_internal/build_management/build_manager.py

letitz

Looking good overall! A bunch of small comments.

src/clusterfuzz/_internal/bot/fuzzers/utils.py

letitz · 2024-09-27T08:23:46Z

src/clusterfuzz/_internal/bot/fuzzers/utils.py

-    local_file_handle.seek(0)
-    result = utils.search_bytes_in_file(pattern, local_file_handle)
+    file_handle.seek(0)
+    result = utils.search_bytes_in_file(pattern, file_handle)


Does this mean we'll end up downloading fuzz target binaries twice, once to search for magic bytes, and once again to unzip them to disk? Maybe not, depending on how we cache downloaded bytes - I have not read that far yet.

In any case, this has me yearning for a manifest file of some kind that lists fuzz targets in the zip.

Yes, but:

We ultimately only download a fuzz target twice, because the other ones won't be selected to be unpacked, thus those will only be downloaded once.

This is already true anyways. And yes, I agree this is bad, especially the code you're pointing to.

src/clusterfuzz/_internal/build_management/build_manager.py

letitz · 2024-09-27T08:31:28Z

src/clusterfuzz/_internal/build_management/build_manager.py

+      logs.info("Opening an archive over HTTP, skipping archive download.")
+      assert http_build_url
+      with build_archive.open_uri(http_build_url) as build:
+        yield build


You can reduce nesting in this function by putting this condition first and returning early:

if can_unzip_over_http: with build_archive.open_uri(http_build_url) as build: yield build return # Download build archive locally. ...

In fact if you can just return the archive instead of yielding it, as suggested above, this becomes very clean:

if can_unzip_over_http: return build_archive.open_uri(http_build_url) # Download build archive locally. ...

Ok, I scratched my head around to make this cleaner while still keeping the correct constraints (read the other comment). WDYT?

letitz · 2024-09-27T08:32:55Z

src/clusterfuzz/_internal/build_management/build_manager.py

+        with build_archive.open(build_local_archive) as build:
+          yield build


Can you just return the build archive? I think it will do the same thing, since ArchiveReader (of which BuildArchive is a subclass) is already a context manager. Then I think you can also annotate this function as returning BuildArchive, and get rid of the contextlib import.

No, I kind of need the context manager here because I want to delete the file after we used it. I changed the code a bit, and I think that's cleaner, WDYT?

src/clusterfuzz/_internal/system/archive.py

letitz · 2024-09-27T09:11:41Z

src/clusterfuzz/_internal/system/archive.py

+    read_size = min(self.file_size - self._pos, size)
+    end_range = self._pos + read_size - 1
+    if read_size > REMOTE_HTTP_MIN_READ_SIZE:
+      content = self._fetch_from_http(self._pos, end_range)


This means we won't cache these bytes - is that intentional? Why not always delegate to _fetch_from_cache() and let it determine whether or not to fetch more bytes from the wire?

Yep, that's because the way those archive library works is that they only read what's needed. If the read is greater than 50 MB, that means we're reading a huge file, and we won't need it in following reads. This is only useful for small reads tbh.

Ah, I see. Can you explain that in a comment for the future reader?

Optional comment / musing follows below:

It results in a behavior I find counterintuitive, though that does not terribly matter:

read(49M) -> fetches 99MB and caches it

read(50M) -> fetches 50MB, does not cache it

I think this could be addressed by doing something like:

HTTP_MAX_BYTES = 100 * 1024 * 1024 def fetch_size(size): """How many bytes to fetch over HTTP to serve `read(size)`.""" cache_size = min(size + HTTP_READAHEAD_BYTES, HTTP_MAX_BYTES) return max(size, cache_size)

Then at least you don't get a cliff effect for network fetches. But is it useful? Probably not?

Thanks for adding a comment. It only explains what is happening, but not why, which is really what's interesting here. In other words, I can see for myself from reading the code that we don't cache large requests, but it does not tell me why that is. You explained it well in your PR comment, can you do that in the code?

src/clusterfuzz/_internal/tests/core/build_management/build_manager_test.py

jonathanmetzman

I'll leave my comments from my first pass as well. I'll try to get to the rest of it today.

src/clusterfuzz/_internal/system/archive.py

src/clusterfuzz/_internal/tests/core/build_management/build_manager_test.py

src/clusterfuzz/_internal/bot/fuzzers/utils.py

letitz

Near-LGTM with a few last comments.

The only major thing remaining is _maybe_get_http_build_url(). I would love to switch that to a straightforward string manipulation instead of relying on environment variables.

letitz · 2024-09-30T13:20:55Z

src/clusterfuzz/_internal/bot/fuzzers/utils.py

+  try:
+    with file_opener(file_path) as file_handle:
+      result = False
+      for pattern in FUZZ_TARGET_SEARCH_BYTES:


Note for future work: If searching these bytes take a long time, we might want to search for them "in parallel" by pushing the pattern list into search_bytes_in_file, which can scan each chunk of the file for all patterns (avoiding the need to re-read the file X times) or even reach for something like Aho-Corasick: https://pyahocorasick.readthedocs.io/en/latest/

src/clusterfuzz/_internal/bot/fuzzers/utils.py

src/clusterfuzz/_internal/build_management/build_archive.py

src/clusterfuzz/_internal/build_management/build_manager.py

letitz · 2024-09-30T13:47:18Z

src/clusterfuzz/_internal/system/archive.py

+    read_size = min(self.file_size - self._pos, size)
+    end_range = self._pos + read_size - 1
+    if read_size > REMOTE_HTTP_MIN_READ_SIZE:
+      content = self._fetch_from_http(self._pos, end_range)


Ah, I see. Can you explain that in a comment for the future reader?

Optional comment / musing follows below:

It results in a behavior I find counterintuitive, though that does not terribly matter:

read(49M) -> fetches 99MB and caches it

read(50M) -> fetches 50MB, does not cache it

I think this could be addressed by doing something like:

HTTP_MAX_BYTES = 100 * 1024 * 1024 def fetch_size(size): """How many bytes to fetch over HTTP to serve `read(size)`.""" cache_size = min(size + HTTP_READAHEAD_BYTES, HTTP_MAX_BYTES) return max(size, cache_size)

Then at least you don't get a cliff effect for network fetches. But is it useful? Probably not?

src/clusterfuzz/_internal/tests/core/build_management/build_manager_test.py

src/clusterfuzz/_internal/build_management/build_archive.py

src/clusterfuzz/_internal/system/archive.py

jonathanmetzman · 2024-09-30T16:38:34Z

src/clusterfuzz/_internal/build_management/build_manager.py

+  Returns:
+      the build URL.
+  """
+  http_build_url_pattern = environment.get_value('RELEASE_BUILD_URL_PATTERN')


I think debug and symbolized builds are a full chrome thing, and aren't used for engine fuzzer builds.

jonathanmetzman · 2024-09-30T16:39:51Z

src/clusterfuzz/_internal/build_management/build_manager.py

@@ -1076,6 +1142,35 @@ def _get_latest_revision(bucket_paths):
  return None


+def _maybe_get_http_build_url(revision) -> Optional[str]:


One thing to maybe note is that by converting the build URL to HTTP, I think we lose the ability to authenticate without any extra work. This may not be a problem in current chrome builds but may be a problem in the future.

E.g. we fuzzed widevine-based builds in the past.

src/clusterfuzz/_internal/build_management/build_manager.py

oliverchang

Happy to get this merged if this resolves Chrome's immediate issues.

That said, would it also be worthwhile also scoping out what it would take for Chrome to migrate to split builds? i.e.

clusterfuzz/src/clusterfuzz/_internal/build_management/build_manager.py

Line 653 in b529f77

class SplitTargetBuild(RegularBuild):

There's a fair bit of complexity already with the different build types we have, and long term we should aim to simplify this.

letitz

LGTM with a couple last comments.

letitz · 2024-10-03T11:57:12Z

src/clusterfuzz/_internal/build_management/build_archive.py

+
+def unzip_over_http_compatible(build_url: str) -> bool:
+  """Whether the build URL is compatible with unzipping over HTTP.
+  As for now, we're only checking for chromium compatible URLs.


Comment needs an update.

src/clusterfuzz/_internal/build_management/build_manager.py

src/clusterfuzz/_internal/system/archive.py

letitz · 2024-10-03T12:02:22Z

src/clusterfuzz/_internal/system/archive.py

+    read_size = min(self.file_size - self._pos, size)
+    end_range = self._pos + read_size - 1
+    if read_size > REMOTE_HTTP_MIN_READ_SIZE:
+      content = self._fetch_from_http(self._pos, end_range)


Thanks for adding a comment. It only explains what is happening, but not why, which is really what's interesting here. In other words, I can see for myself from reading the code that we don't cache large requests, but it does not tell me why that is. You explained it well in your PR comment, can you do that in the code?

letitz · 2024-10-03T12:09:33Z

@oliverchang right, I had filed crbug.com/333965940 to look into that. I think it's the right direction longer term, but I think it significantly more work than this change.

This adds support for remote ZIP. As of now, performances are quite good locally, and the read ahead mechanism should keep reasonable performance. Also, given that the ClusterFuzz bots are having HDD, numbers might even be better there, as we're only storing on disk when unpacking the build. The memory consumption of this new feature is contant: it uses at most (and most of the time) 50 MB of RAM.

paulsemel · 2024-10-03T12:31:52Z

Note: whenever this lands, don't forget that you need to add the ALLOW_UNPACK_OVER_HTTP = True environment variable on the job so that this kicks in.

paulsemel force-pushed the add-support-for-remote-zip branch 3 times, most recently from 09b2ab2 to bb92946 Compare September 23, 2024 09:12

letitz reviewed Sep 24, 2024

View reviewed changes

paulsemel force-pushed the add-support-for-remote-zip branch 6 times, most recently from 2bff45a to b3b9fa0 Compare September 26, 2024 08:38

paulsemel requested a review from letitz September 26, 2024 08:42

letitz reviewed Sep 27, 2024

View reviewed changes

jonathanmetzman reviewed Sep 27, 2024

View reviewed changes

src/clusterfuzz/_internal/system/archive.py Outdated Show resolved Hide resolved

src/clusterfuzz/_internal/tests/core/build_management/build_manager_test.py Outdated Show resolved Hide resolved

src/clusterfuzz/_internal/bot/fuzzers/utils.py Show resolved Hide resolved

paulsemel force-pushed the add-support-for-remote-zip branch 2 times, most recently from 424537a to e9f8035 Compare September 27, 2024 14:46

letitz reviewed Sep 30, 2024

View reviewed changes

jonathanmetzman reviewed Sep 30, 2024

View reviewed changes

paulsemel force-pushed the add-support-for-remote-zip branch 2 times, most recently from e002874 to c0e84e6 Compare October 1, 2024 12:05

paulsemel requested review from jonathanmetzman and letitz October 1, 2024 15:27

oliverchang reviewed Oct 1, 2024

View reviewed changes

letitz approved these changes Oct 3, 2024

View reviewed changes

paulsemel force-pushed the add-support-for-remote-zip branch from c0e84e6 to 532a130 Compare October 3, 2024 12:23

jonathanmetzman merged commit 734a0f0 into google:master Oct 8, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[build_manager] add support for remote zip #4263

[build_manager] add support for remote zip #4263

paulsemel commented Sep 20, 2024

letitz left a comment

letitz Sep 24, 2024

paulsemel Sep 25, 2024

letitz Oct 3, 2024

letitz left a comment

letitz Sep 27, 2024

paulsemel Sep 27, 2024

letitz Sep 27, 2024

paulsemel Sep 27, 2024

letitz Sep 27, 2024

paulsemel Sep 27, 2024

letitz Sep 27, 2024

paulsemel Sep 27, 2024

letitz Sep 30, 2024

letitz Oct 3, 2024

jonathanmetzman left a comment

letitz left a comment •

edited

Loading

letitz Sep 30, 2024

letitz Sep 30, 2024

jonathanmetzman Sep 30, 2024

jonathanmetzman Sep 30, 2024

jonathanmetzman Oct 7, 2024

oliverchang left a comment

letitz left a comment

letitz Oct 3, 2024

letitz Oct 3, 2024

letitz commented Oct 3, 2024

paulsemel commented Oct 3, 2024

		with build_archive.open(build_local_archive) as build:
		yield build

		@@ -1076,6 +1142,35 @@ def _get_latest_revision(bucket_paths):
		return None


		def _maybe_get_http_build_url(revision) -> Optional[str]:

[build_manager] add support for remote zip #4263

[build_manager] add support for remote zip #4263

Conversation

paulsemel commented Sep 20, 2024

letitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

letitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonathanmetzman left a comment

Choose a reason for hiding this comment

letitz left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oliverchang left a comment

Choose a reason for hiding this comment

letitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

letitz commented Oct 3, 2024

paulsemel commented Oct 3, 2024

letitz left a comment •

edited

Loading