You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inspired by this library (thanks for the great work @NMZivkovic!), I built FastBertTokenizer (nuget). It shouldn't suffer from this issue. If you give it a try and stumble up on an issue, please let me know.
The following code:
never returns if
™
cannot be matched in the vocabulary.The issue was introduced in this commit:
0f29cef#diff-82215a359c504385d48356d59d6635f3b968278cca935c73977e16cea13f4174
Specifically this line:
BertTokenizers/src/Base/TokenizerBase.cs
Line 122 in 150e40a
Changing it back to
2
fixes it. Otherwise the code keeps looping by adding and removing#
symbols until stopped.Plus, I think the following two lines might cause additional issues:
BertTokenizers/src/Base/TokenizerBase.cs
Lines 142 to 143 in 150e40a
The matched token shouldn't be treated as a regular expression. Those two lines can be replaced by:
Which is also most likely much more efficient and more closely resembles Google's original tokenizer implementation.
The text was updated successfully, but these errors were encountered: