This directory contains a Rust runner program for benchmarking ICU Regular Expressions. It uses ICU4C's regex API.
This program makes the following choices:
- It will error if given a haystack that contains invalid UTF-8. Namely, as far as I can tell, there is no way to use ICU's regex engine on arbitrary bytes. Its API seems to suggest that it is only possible to run it on sequences of UTF-16 code units.
For obvious reasons, ICU has excellent Unicode support. It is also impossible to disable Unicode mode, which makes sense, because its Unicode features are probably the reason why you would use ICU's regex engine in the first place.
Note that we don't currently enable its UREGEX_UWORD
option under any
circumstances, and instead let \b
just be Unicode-aware in the same way that
most other regex engines are. This is because ICU is probably unparalleled
here, although it might be nice to add a benchmark for it.