You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that the occurrence of a backwards quotation marks (“end“) in the text causes different tokenization results compared to Python implementations. This is the only inconsistency I've run into thus far. Curious if anyone else has seen similar issues.
Example: “ends -> tokenizes to ##end and ##s instead of ##ends
The text was updated successfully, but these errors were encountered:
rghavimi
changed the title
Backwards quotation mark causing different tokenization results
Backwards quotation mark causing inaccurate tokenization results
Feb 16, 2023
rghavimi
changed the title
Backwards quotation mark causing inaccurate tokenization results
Words surrounded by backwards quotation marks causing inaccurate tokenization results
Feb 16, 2023
It seems that the occurrence of a backwards quotation marks (“end“) in the text causes different tokenization results compared to Python implementations. This is the only inconsistency I've run into thus far. Curious if anyone else has seen similar issues.
Example: “ends -> tokenizes to ##end and ##s instead of ##ends
The text was updated successfully, but these errors were encountered: