Add new API function utf8_to_uv() #22541

khwilliamson · 2024-08-25T22:32:20Z

This is designed to replace the problematic utf8_to_uvchr(), which is problematic. Its behavior varies depending on if warnings are enabled or not, and no code in core actually takes that into account

If warnings are enabled:

A zero return can mean both success or failure

    Hence a zero return must be disambiguated.  Success would come
    from the next character being a NUL.

If failure, <retlen> will be -1, so can't be used to find where to
start parsing again.

If disabled:

Both the return and <retlen> will be usable values, but the return
of the REPLACEMENT CHARACTER is ambiguous.  It could mean failure,
or it could mean that that was the next character in the input and
was successfully decoded.

utf8_to_uv() solves these. This commit includes a few changes to use it, to show it works. I have WIP that changes the rest of core to use it. I found that it makes coding simpler.

The new function returns true upon success; false on failure. And it is passed pointers to return the computed code point and byte length into. These values always contain the correct information, regardless of if the input is malformed or not.

It is easy to test for failure in a conditional and then to take appropriate action. However, most often it seems the appropriate action is to use, going forward, the REPLACEMENT CHARACTER returned in failure cases.

And if you don't care particularly if it succeeds or not, you just use it without testing the result. This happens when you are confident that the input is well-formed, or say in converting a string for display.

There is another function utf8_to_uv_flags() which merely extends this API for more flexible use, and doesn't offer the advantages over the existing API function that does the same thing. I included it because the main function is just a small wrapper around it, and the API is similar and some may prefer it.

inline.h

tonycoz · 2024-08-29T00:14:23Z

inline.h

+A negative return from this function indicates an error.  If you take its
+negative (turning it into a positive), the value will be the same as that
+returned in C<*error> parameter to C<L</utf8n_to_uvchr_error>>, so you can
+determine the exact malformations.


It does seem strange to me that the return value between the two functions differs.

I can see someone switching from the non-_flags version to the _flags version to supply some flag and not realizing the return value behaviour is different.

tonycoz · 2024-08-29T00:14:53Z

inline.h

+    *cp = utf8n_to_uvchr_error(s, send - s, advance,
+                               (flags | (UTF8_ALLOW_ANY | UTF8_ALLOW_EMPTY)),
+                               &errors);
+    return -errors;


Further to the return value, negating the U32 adds a warning on MSVC:

D:\a\perl5\perl5\inline.h(3178): warning C4146: unary minus operator applied to unsigned type, result still unsigned

You convinced me to just remove the second function

karenetheridge · 2024-08-29T01:42:16Z

After this is in, is it possible for Devel::PPPort to generate a warning when utf8_to_uvchr is used, with a reference to its replacement?

(I'm not suggesting that you do this Tony, just wondering if we can and should)

khwilliamson · 2024-08-29T02:24:25Z

It is easy in ppport.h to either output a hint or a warning about using any particular function. It's just text. Here's an example:

*** WARNING: my_sprintf
*** It's safer to use my_snprintf instead

This is designed to replace the problematic utf8_to_uvchr(), which is problematic. Its behavior varies depending on if <utf8> warnings are enabled or not, and no code in core actually takes that into account If warnings are enabled: A zero return can mean both success or failure Hence a zero return must be disambiguated. Success would come from the next character being a NUL. If failure, <retlen> will be -1, so can't be used to find where to start parsing again. If disabled: Both the return and <retlen> will be usable values, but the return of the REPLACEMENT CHARACTER is ambiguous. It could mean failure, or it could mean that that was the next character in the input and was successfully decoded. It may very well not matter to you what the source of this particular value was. It likely means a failure somewhere. But there are occasions where you might care. utf8_to_uv() solves these. This commit includes a few changes to use it, to show it works. I have WIP that changes the rest of core to use it. I found that it makes coding simpler. The new function returns true upon success; false on failure. And it is passed pointers to return the computed code point and byte length into. These values always contain the correct information, regardless of if the input is malformed or not. It is easy to test for failure in a conditional and then to take appropriate action. However, most often it seems the appropriate action is to use, going forward, the REPLACEMENT CHARACTER returned in failure cases. And if you don't care particularly if it succeeds or not, you just use it without testing the result. This happens when you are confident that the input is well-formed, or say in converting a string for display.

khwilliamson · 2024-09-02T04:03:27Z

The name I used for this function was the first name used for this functionality, being removed in 5.7. A succession of names has followed, as the previous incarnation was found to be deficient for one reason or another. uvchr apparently came about as a result of supporting non-ASCII systems, and I presumed was chosen to somehow note the fluidity of the underlying character set. I thought that it had been long enough since utf8_to_uv' had been in use that it could safely be reused again. But I came up with an alternative name in the meantime utf8_to_cp (We would want to make an inversecp_to_utf8. I think cp` is in common enough usage these days that the name would be self-explanatory to modern programmers.

But I don't know. I'm opening this up for discussion

tonycoz · 2024-09-04T02:10:45Z

If it returns a native code point rather than a Unicode code point I'd be inclined to avoid "cp".

We've generally used "uvchr" for this but utf8_to_uvchr() is what we're trying to replace.

Maybe utf8_decode_uvchr(), but that has it's own potential for confusion.

Maybe utf8_to_uvchr4() to emphasize that this is an alternative to utf8_to_uvchr().

khwilliamson · 2024-09-04T19:48:19Z

I'm then inclined to go with utf8_to_uv

tonycoz reviewed Aug 28, 2024

View reviewed changes

inline.h Outdated Show resolved Hide resolved

tonycoz reviewed Aug 29, 2024

View reviewed changes

khwilliamson force-pushed the utf8_to_uv branch from b3321c0 to 1e35e2d Compare August 29, 2024 03:02

tonycoz approved these changes Sep 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new API function utf8_to_uv() #22541

Add new API function utf8_to_uv() #22541

khwilliamson commented Aug 25, 2024

tonycoz Aug 29, 2024

tonycoz Aug 29, 2024

khwilliamson Aug 29, 2024

karenetheridge commented Aug 29, 2024 •

edited

Loading

khwilliamson commented Aug 29, 2024

khwilliamson commented Sep 2, 2024

tonycoz commented Sep 4, 2024

khwilliamson commented Sep 4, 2024

Add new API function utf8_to_uv() #22541

Are you sure you want to change the base?

Add new API function utf8_to_uv() #22541

Conversation

khwilliamson commented Aug 25, 2024

tonycoz Aug 29, 2024

Choose a reason for hiding this comment

tonycoz Aug 29, 2024

Choose a reason for hiding this comment

khwilliamson Aug 29, 2024

Choose a reason for hiding this comment

karenetheridge commented Aug 29, 2024 • edited Loading

khwilliamson commented Aug 29, 2024

khwilliamson commented Sep 2, 2024

tonycoz commented Sep 4, 2024

khwilliamson commented Sep 4, 2024

karenetheridge commented Aug 29, 2024 •

edited

Loading