Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code fastpath scanning for valid jump destinations #348

Merged
merged 24 commits into from
Aug 22, 2024

Conversation

karmacoma-eth
Copy link
Collaborator

jumpi_id() keeps showing up when I profile tests, because it causes us to eagerly compute valid_jump_destinations()

Fundamentally, valid_jump_destinations() does a linear scan of the whole code for a contract. Even though ByteVec avoids z3 operations for this, it's still not ideal to do 10k+ individual byte lookups in a bytevec for almost contiguous data.

The insight here is that it is quite common to process contracts that have a large concrete prefix, sometimes followed by some symbolic data. In this case, we extract the concrete prefix in _fastcode and scan that first for valid jump destinations.

With this change, the regression tests run about 15% faster for me (44s to 37s)

@karmacoma-eth karmacoma-eth changed the title Code fastpath rebased Code fastpath scanning for valid jump destinations Aug 16, 2024
@karmacoma-eth karmacoma-eth marked this pull request as draft August 16, 2024 00:19
@karmacoma-eth karmacoma-eth marked this pull request as ready for review August 19, 2024 22:22
@karmacoma-eth
Copy link
Collaborator Author

karmacoma-eth commented Aug 19, 2024

main baseline for my maze benchmark: 62.89s
new time: 52.56s (-16.4%) 50.10s (-21%)

shaves another second from the maze bench (52.56s -> 51.67s)
profiling showed that the last remaining major cost in jumpi_id was actually the cost of unbox_int

This improves the maze bench by another 1.5s (51.67s -> 50.10s), about 3% faster
valid_jumpdests are enumated when symbolic jumps are enabled, so let's separate it into an int set and a string set
@@ -297,9 +297,6 @@ def unbox_int(x: Any) -> Any:
if is_bv_value(x):
return x.as_long()

if is_bv(x):
x = simplify(x)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine if the result of unbox_int is immediately pushed to the stack. but, there are cases where a simplified result is needed, e.g.,:

  • and unbox_int(ex.context.output.data) == ASSERT_FAIL
    the structural equality test may fail against unsimplified results, while it would pass with simplified ones.
  • return unbox_int(data)
    the result of get_word() is better to be simplified, as it is used in many cheatcodes, and passed through further logic.

simplifying symbolic terms into their normalized form is leveraged by several other routines, so passing around unsimplified terms may cause unexpected issues, which may be difficult to debug later. this also increases code maintenance cost, while we still need further large iterations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's weird for unbox_int to try to convert to an int, but also simplify if it's not.

I think simplify should happen at well defined boundaries, like:

  • in and out of the stack
  • in and out of memory
  • not in an intermediate place like unbox_int

unbox_int(ex.context.output.data) == ASSERT_FAIL

☝️ this is not a great use of unbox_int, if the result is a bv we produce a new cond rather than evaluating a bool. In this case we probably want to call simplify first, then try to unbox, and skip if the unboxed output is still a bv

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the result of get_word() is better to be simplified, as it is used in many cheatcodes, and passed through further logic

I agree, that's why the output of unwrap is simplified

Copy link
Collaborator

@daejunpark daejunpark Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we probably want to call simplify first, then try to unbox, and skip if the unboxed output is still a bv

i agree. could you please fix that as you suggested? btw, for the long term, we'd need to generalize this condition to cover arbitrary Panic(k) where k is given by users.

I agree, that's why the output of unwrap is simplified

thanks for clarification, now i see that it's already simplified before being passed into unbox_int

--

i agree that it would make more sense to not simplify inside unbox_int. then, let me double-check other places to make sure:

  • halmos/src/halmos/sevm.py

    Lines 727 to 728 in c3f45dd

    def current_opcode(self) -> UnionType[int, BitVecRef]:
    return unbox_int(self.pgm[self.pc])
    this is fine because self.pgm[self.pc] is already simplified, right?

  • halmos/src/halmos/utils.py

    Lines 265 to 269 in c3f45dd

    def extract_funsig(calldata: Bytes) -> Any:
    """Extracts the function signature (first 4 bytes) from calldata"""
    if hasattr(calldata, "__getitem__"):
    return unbox_int(calldata[:4])
    return extract_bytes(calldata, 0, 4)
    if i understand correctly, calldata[:4] may not be simplified, right? for now, it's fine because the extract_funsig() result is immediately passed to int_of, which would revert anyway if it's a bv. (there is another use of extract_funsig(), but it will be removed anyway in feat: branching over symbolic call addresses #349, so it doesn't matter.) however, for the longer term, should we simplify calldata[:4] here? wdty?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please fix that as you suggested?

yea 👍

self.pgm[self.pc] is already simplified, right?

it defers to ByteVec.get_byte(), which calls chunk.get_byte()

  • for a concrete chunk, we get an int
  • for a symbolic chunk, it relies on extract_bytes, which does return simplified bitvecs

so rn yes, but it's not strongly part of the contract (it just so happens that extract_bytes does simplify)

should we simplify calldata[:4] here?

yes, instead of "trying" to unbox it, returning calldata[:4].unwrap() would do it (return either a concrete value or a simplified bv)

Copy link
Collaborator Author

@karmacoma-eth karmacoma-eth Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please fix that as you suggested?

more robust version in af3df48 and test in afd5c11. I confirmed that unbox_int(...) == ASSERT_FAIL does produce a BoolRef of the form Concat(1313373041, p_x_uint256) == 152078208365357342262005707660225848957176981554335715805457651098985835139029979365377, which for some reason evaluates to False in the condition (instead of generating an exception as I was expecting).

Copy link
Collaborator

@daejunpark daejunpark Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we simplify calldata[:4] here?

yes, instead of "trying" to unbox it, returning calldata[:4].unwrap() would do it (return either a concrete value or a simplified bv)

great, let's do it! actually, i realized that unbox_int() calls unwrap() inside, so it's already simplified there. also, if i understand correctly, unwrap() may return a bv_value, so calldata[:4].unwrap() may not be fully equivalent to unbox_int(calldata[:4]). while it might be fine for extract_funsig() to return a bv_value, i'd like to just leave it as is.

src/halmos/sevm.py Show resolved Hide resolved

# bytevec equality check, will take care of length check, bv vs symbolic, etc.
return returndata == expected

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! but why do we need to compare only the prefix of data, not the full data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equality check fails if length is different, do you think that's closer to what we want?

@daejunpark daejunpark merged commit 536299a into main Aug 22, 2024
57 checks passed
@daejunpark daejunpark deleted the code-fastpath-rebased branch August 22, 2024 04:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants