Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISLE: lexer simplifications #9108

Merged
merged 9 commits into from
Aug 14, 2024
Merged

Conversation

Kmeakin
Copy link
Contributor

@Kmeakin Kmeakin commented Aug 11, 2024

  • Instead of pushing every byte in a lexed integer into a Vec<u8>, use a slice of the original buffer, and only reallocate if there are underscores that need to be removed
  • Don't track line/column in Pos, so don't need to update line/column in advance_pos()
  • Store text being lexed as &str rather than &[u8] to avoid checking substrings for UTF-8 validity
  • Don't try to parse integers twice

@Kmeakin Kmeakin requested a review from a team as a code owner August 11, 2024 20:37
@Kmeakin Kmeakin requested review from fitzgen and removed request for a team August 11, 2024 20:37
Instead of creating a temporary `Vec<u8>`, use a slice of the original
underlying `buf`, and only allocate a temporary `String` if it contains
an `_`.

Copyright (c) 2024, Arm Limited.

Signed-off-by: Karl Meakin <[email protected]>
`Vec` can be compared against arrays, since both deref to slices.

Copyright (c) 2024, Arm Limited.

Signed-off-by: Karl Meakin <[email protected]>
Centralize all file related arenas in `Files` struct.

Copyright (c) 2024, Arm Limited.

Signed-off-by: Karl Meakin <[email protected]>
They are already tracked in `Files`, so no need to track them in `Pos`
as well. This lets us simply the implementation of `Lexer::advance_pos`
a bit.

Copyright (c) 2024, Arm Limited.

Signed-off-by: Karl Meakin <[email protected]>
`Files` was being threaded through a lot of passes where it wasn't
needed. It is only needed for reporting errors in `compile.rs` and for
reporting line numbers when printing in `codegen.rs`.

Copyright (c) 2024, Arm Limited.

Signed-off-by: Karl Meakin <[email protected]>
@Kmeakin Kmeakin changed the title ISLE: reduce allocations when lexing integers ISLE: lexer simplifications Aug 11, 2024
Store the text being lexed as `&str`, rather than `&[u8]`, so that
substrings don't need to be rechecked for UTF-8 validity when lexing
identifiers or integers.

Copyright (c) 2024, Arm Limited.

Signed-off-by: Karl Meakin <[email protected]>
Copyright (c) 2024, Arm Limited.

Signed-off-by: Karl Meakin <[email protected]>
Copyright (c) 2024, Arm Limited.

Signed-off-by: Karl Meakin <[email protected]>
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language labels Aug 12, 2024
Copy link

Subscribe to Label Action

cc @cfallin, @fitzgen

This issue or pull request has been labeled: "cranelift", "isle"

Thus the following users have been cc'd because of the following labels:

  • cfallin: isle
  • fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@pchickey
Copy link
Contributor

Hi, thanks for these contributions! If I can nerd-snipe you into some unrelated lexer work, we have a logos lexer over in the wasm-wave crate that we need to be rewritten entirely in safe rust. If you are interested in learning more, ping me over on #8872

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic, thanks!

Apologies for the slow review turn around, I've been having ISP issues.

One comment below that I'd like to get your take on, but shouldn't be a blocker for landing this. Can address it in a follow up, if necessary.

Comment on lines +248 to +250
fn peek_byte(&self) -> Option<u8> {
self.src.as_bytes().get(self.pos.offset).copied()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not fn peek_char(&self) -> Option<char>? I guess a byte is mildly more performant, in theory if LLVM doesn't clean things up, but that doesn't seem super compelling compared to the simplification of "we are dealing with strings, so we should also deal with chars".

@fitzgen fitzgen added this pull request to the merge queue Aug 14, 2024
Instead of trying to parse an integer as an `i128`, and then as an
`u128` if that fails, parse it only as a `u128` and then check for
`i128::MIN`.

Copyright (c) 2024, Arm Limited.

Signed-off-by: Karl Meakin <[email protected]>
Merged via the queue into bytecodealliance:main with commit e0a907a Aug 14, 2024
35 checks passed
@Kmeakin Kmeakin deleted the km/isle-lexer branch October 29, 2024 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants