-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new API function utf8_to_uv() #22541
base: blead
Are you sure you want to change the base?
Conversation
inline.h
Outdated
A negative return from this function indicates an error. If you take its | ||
negative (turning it into a positive), the value will be the same as that | ||
returned in C<*error> parameter to C<L</utf8n_to_uvchr_error>>, so you can | ||
determine the exact malformations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem strange to me that the return value between the two functions differs.
I can see someone switching from the non-_flags
version to the _flags
version to supply some flag and not realizing the return value behaviour is different.
inline.h
Outdated
*cp = utf8n_to_uvchr_error(s, send - s, advance, | ||
(flags | (UTF8_ALLOW_ANY | UTF8_ALLOW_EMPTY)), | ||
&errors); | ||
return -errors; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further to the return value, negating the U32 adds a warning on MSVC:
D:\a\perl5\perl5\inline.h(3178): warning C4146: unary minus operator applied to unsigned type, result still unsigned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You convinced me to just remove the second function
After this is in, is it possible for Devel::PPPort to generate a warning when utf8_to_uvchr is used, with a reference to its replacement? (I'm not suggesting that you do this Tony, just wondering if we can and should) |
It is easy in ppport.h to either output a hint or a warning about using any particular function. It's just text. Here's an example:
|
This is designed to replace the problematic utf8_to_uvchr(), which is problematic. Its behavior varies depending on if <utf8> warnings are enabled or not, and no code in core actually takes that into account If warnings are enabled: A zero return can mean both success or failure Hence a zero return must be disambiguated. Success would come from the next character being a NUL. If failure, <retlen> will be -1, so can't be used to find where to start parsing again. If disabled: Both the return and <retlen> will be usable values, but the return of the REPLACEMENT CHARACTER is ambiguous. It could mean failure, or it could mean that that was the next character in the input and was successfully decoded. It may very well not matter to you what the source of this particular value was. It likely means a failure somewhere. But there are occasions where you might care. utf8_to_uv() solves these. This commit includes a few changes to use it, to show it works. I have WIP that changes the rest of core to use it. I found that it makes coding simpler. The new function returns true upon success; false on failure. And it is passed pointers to return the computed code point and byte length into. These values always contain the correct information, regardless of if the input is malformed or not. It is easy to test for failure in a conditional and then to take appropriate action. However, most often it seems the appropriate action is to use, going forward, the REPLACEMENT CHARACTER returned in failure cases. And if you don't care particularly if it succeeds or not, you just use it without testing the result. This happens when you are confident that the input is well-formed, or say in converting a string for display.
b3321c0
to
1e35e2d
Compare
The name I used for this function was the first name used for this functionality, being removed in 5.7. A succession of names has followed, as the previous incarnation was found to be deficient for one reason or another. But I don't know. I'm opening this up for discussion |
If it returns a native code point rather than a Unicode code point I'd be inclined to avoid "cp". We've generally used "uvchr" for this but utf8_to_uvchr() is what we're trying to replace. Maybe Maybe |
I'm then inclined to go with |
This is designed to replace the problematic utf8_to_uvchr(), which is problematic. Its behavior varies depending on if warnings are enabled or not, and no code in core actually takes that into account
If warnings are enabled:
If disabled:
utf8_to_uv() solves these. This commit includes a few changes to use it, to show it works. I have WIP that changes the rest of core to use it. I found that it makes coding simpler.
The new function returns true upon success; false on failure. And it is passed pointers to return the computed code point and byte length into. These values always contain the correct information, regardless of if the input is malformed or not.
It is easy to test for failure in a conditional and then to take appropriate action. However, most often it seems the appropriate action is to use, going forward, the REPLACEMENT CHARACTER returned in failure cases.
And if you don't care particularly if it succeeds or not, you just use it without testing the result. This happens when you are confident that the input is well-formed, or say in converting a string for display.
There is another function utf8_to_uv_flags() which merely extends this API for more flexible use, and doesn't offer the advantages over the existing API function that does the same thing. I included it because the main function is just a small wrapper around it, and the API is similar and some may prefer it.