-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rtext] Fix GetCodepointNext() to return default value on invalid input with size=0 #2997
Conversation
…nput. Modify LoadCodepoints to work when GetCodepointNext returns a size of 0. All internal use of GetCodepointNext and GetCodepointPrev checked. This fix may break external code dealing with invalid input as the old code erroneously never returned a size of 0, external code that doesn't properly check for size=0 may endlessly loop or overflow a buffer on invalid input.
|
Exactly that. Not something anyone is likely to actually encounter but I mention it for completeness.
All subsequent bytes in a multibyte UTF-8 encoding have the pattern 10xxxxxx, a 2 bit multibyte identifier followed by a 6 bit payload. This
Yes. I don't know how long it's been like this, but bad input should be rare enough that even if there are bugs in code it's unlikely they'll be triggered.
The size==0 is so that validation can happen elsewhere. It is a little odd to me too that a default value and an error state is returned, but I just restored the intent. It has led to some places internally checking the size for an error, and others checking for ?.
That is the intent, with a weird default edge-case of size==0 that has to be checked. Maybe when the default ? is triggered it should return size=1. This would simplify things including the fix no longer potentially breaking flakey external code, but also makes the function unsuitable for validation as we're papering over the cracks. As it stands it's in a bit of a hybrid state where we expect good input but can't assume good output.
All internal use of the function checks for size=0 and replaces it with size=1 anyway, meaning it keeps reading beyond an invalid code. Ignoring the error is another indicator that size=0 should probably be eliminated entirely.
Valid first byte values are 0..127 for size=1, 192..223 for size=2, 224..239 for size=3, 240..247 for size=4. Everything else is invalid and returns size=0 with this PR, before this PR invalid first-byte values would erroneously be returned as valid with
raylib uses it internally to read UTF-8 strings where allowing bad input is desirable so it doesn't fail, and it's available for external use. As the error state of size=0 is not really used internally you have to assume it's for external use only, in which case a separate function would probably be better that also does full validation. Personally I would have the default case return size=1 and also check 10xxxxxx so that all invalid input is replaced with ?, and let the programmer overallocate externally if they want to be cautious instead of adding expensive bounds checking. This would have the benefit of not breaking dodgy external code and also simplify internal code that uses GetCodepointNext, GetCodepointNext tends to be used like this
, eliminating size=0 would eliminate the branch
|
I'm included to lean toward that also. There are two possible code points to return other than "?" (in case that's the actual codepoint). There's the ASCII SUB code, U+001A, which is intended for precisely this purpose and also U+FFFD SUBSTITUTE which has this nice "?"-in-diamond glyph and which I see occasionally in Twitter subject-lines in my reader. Either way, I think we are talking about breaking changes and I'm unclear what provides the softest landing. Also, I can see using U+001A as pretty universally sensible and a client could substitute U+FFFD as part of any presentation. This would have users see that something was amiss without stopping the train. |
We probably need to be mindful of basic fonts that might only define ascii or even just the obviously visible part of ascii. If U+001A is pretty universally defined then it's suitable but I don't think U+FFFD can be suitable (not a font guy). I've assumed 0x3f was used specifically because it is universal. This function is used by rtextures to convert a string to an image, I could be wrong but it looks part of the main way text is displayed by raylib, guaranteeing a visual representation of bad input seems important so the dev can notice and fix it.
The current PR is a breaking change, not from the intended functionality but from the functionality that's been in place for who knows how long. IMO the softest landing is definitely to remove size=0 as a possibility (after all it has been accidentally removed already and no one has seemed to notice), and fixing it so that erroneous codepoints cannot be returned at all just makes sense (if we want to replace bad input with ? why are we only partially doing that?). I'll modify the PR to that effect later today. |
Excuse the late response, I'm on vacation and a bit out of the loop but I think avoiding the |
…stead of 0. This matches existing prod behaviour and guarantees size 1..4 is returned. Simplify internal code that uses GetCodepointNext that previously had to account for size=0.
…ureTextEx. This change matches existing precedent in DrawTextEx
The advantage of U+001A is that it does not confuse with an actual "?". There is the problem of having no visible graphic usually, and "?" is a common practice. I agree that it is rare to find programs that deal with U+001A and practice trumps purism here. :( We are assuming that the app does not check on these and the "?" is rather automatically incorporated in whatever the text is being processed on behalf of. It would be nice to have a way to be more defensive but I concede that's too much breaking change for edge cases, despite my general preference. PS: I assume there are cases of these errors where size ends up being greater than 1, so that advancement past whatever was bogus/incomplete happens properly, but not going beyond what was consumed as valid, when more than 1. |
If U+001A is visible with raylib's default font then there's an argument that U+001A is a good default choice. I have no preference.
What we are assuming is that the input is good and if not that it's better to push through than error. This function could be used more defensively even if the return value remains 0x3f, all it would take is comparing IO on 0x3f return to notice the error.
Not in the PR implementation, on error size is always 1 regardless of the size the first byte of input indicates the codepoint is meant to be. Normally this means for example an invalid 3 byte UTF-8 is decoded to three 0x3f codepoints, but returning size=1 on error always means it is still possible for invalid codepoints to be returned as valid (ie an invalid codepoint followed by a "misaligned" valid codepoint). This is unlikely unless the input is crafted or garbage, a single bit error in the control bits of a UTF-8 character is not enough to generate a misaligned valid codepoint. If the input is crafted or garbage all bets are off anyway. We could instead on error return the size of the invalid UTF8 character. The problem with that is that good characters could be skipped by a bad character basically hiding them (three ascii characters can be skipped by prepending 11110xxx). Not to mention that sloppy looped use of this function to read a UTF-8 c-string could skip over the |
I agree and I did not mean that it be done that easily. At least one octet has to be consumed and valid successive characters (in the case of a good multibyte first-octet) can be consumed, stopping prematurely before the first improper octet. That makes for one "?" and whatever number of octets were consumed before there is an unexpected one (including the end of the string). It is true that if someone treats an ISO 8-bit code as UTF8, there can be many "?" either way. I'm planting this here in case I am ever foolish enough to dig into the proper treatment of Unicode in all its glory, especially composed characters, reverse-writing, and all of that. I do think some sensible (and TTF quality) handling of CJK would be valuable along with Arabic and Hebrew, and (shudder) some highly-composed languages from Southeast Asia. Maybe not in my life-time :) |
@chocolate42 Thanks for the review! Current implementation looks ok to me, actually I like that it simplifies code in other functions. |
Fixes #2996
There's a few remaining issues to become bulletproof that haven't been fixed, probably won't be implemented but mentioned for completeness:
Tested on Linux with this.