Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Shrink the size of wast text tokens (#1103)
This commit is another improvement towards addressing #1095 where the goal here is to shrink the size of `Token` and reduce the allocated memory that it retains. Currently the entire input string is tokenized and stored as a list of tokens for `Parser` to process. This means that the size of a token has a large affect on the size of this vector for large inputs. Even before this commit tokens had been slightly optimized for size where some variants were heap-allocated with a `Box`. In profiling with DHAT, however, it appears that a large portion of peak memory was these boxes, namely for integer/float tokens which appear quite a lot in many inputs. The changes in this commit were to: * Shrink the size of `Token` to two words. This is done by removing all pointers from `Token` and instead only storing a `TokenKind` which is packed to 32-bits or less. Span information is still stored in a `Token`, however. * With no more payload tokens which previously had a payload such as integers, strings, and floats are now re-parsed. They're sort of parsed once while lexing, then again when the token is interpreted later on. Some of this is fundamental where the parsing currently happens in a type-specific context but the context isn't known during lexing (e.g. if something is parsed as `u8` then that shouldn't accept `256` as input). The hypothesis behind this is that tokens are far more often keywords, whitespace, and comments rather than integers, strings, and floats. This means that if these tokens require some extra work then that should hopefully "come out in the wash" after and this representation would otherwise allow for other speedups. Locally the example in #1095 has a peak memory usage reduced from 5G to 4G from this commit and additionally the parsing time drops from 8.9s to 7.6s.
- Loading branch information