Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new API function utf8_to_uv() #22541

Open
wants to merge 1 commit into
base: blead
Choose a base branch
from

Conversation

khwilliamson
Copy link
Contributor

This is designed to replace the problematic utf8_to_uvchr(), which is problematic. Its behavior varies depending on if warnings are enabled or not, and no code in core actually takes that into account

If warnings are enabled:

A zero return can mean both success or failure

    Hence a zero return must be disambiguated.  Success would come
    from the next character being a NUL.

If failure, <retlen> will be -1, so can't be used to find where to
start parsing again.

If disabled:

Both the return and <retlen> will be usable values, but the return
of the REPLACEMENT CHARACTER is ambiguous.  It could mean failure,
or it could mean that that was the next character in the input and
was successfully decoded.

utf8_to_uv() solves these. This commit includes a few changes to use it, to show it works. I have WIP that changes the rest of core to use it. I found that it makes coding simpler.

The new function returns true upon success; false on failure. And it is passed pointers to return the computed code point and byte length into. These values always contain the correct information, regardless of if the input is malformed or not.

It is easy to test for failure in a conditional and then to take appropriate action. However, most often it seems the appropriate action is to use, going forward, the REPLACEMENT CHARACTER returned in failure cases.

And if you don't care particularly if it succeeds or not, you just use it without testing the result. This happens when you are confident that the input is well-formed, or say in converting a string for display.

There is another function utf8_to_uv_flags() which merely extends this API for more flexible use, and doesn't offer the advantages over the existing API function that does the same thing. I included it because the main function is just a small wrapper around it, and the API is similar and some may prefer it.

inline.h Outdated Show resolved Hide resolved
inline.h Outdated
Comment on lines 3154 to 3157
A negative return from this function indicates an error. If you take its
negative (turning it into a positive), the value will be the same as that
returned in C<*error> parameter to C<L</utf8n_to_uvchr_error>>, so you can
determine the exact malformations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem strange to me that the return value between the two functions differs.

I can see someone switching from the non-_flags version to the _flags version to supply some flag and not realizing the return value behaviour is different.

inline.h Outdated
*cp = utf8n_to_uvchr_error(s, send - s, advance,
(flags | (UTF8_ALLOW_ANY | UTF8_ALLOW_EMPTY)),
&errors);
return -errors;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further to the return value, negating the U32 adds a warning on MSVC:

D:\a\perl5\perl5\inline.h(3178): warning C4146: unary minus operator applied to unsigned type, result still unsigned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You convinced me to just remove the second function

@karenetheridge
Copy link
Member

karenetheridge commented Aug 29, 2024

After this is in, is it possible for Devel::PPPort to generate a warning when utf8_to_uvchr is used, with a reference to its replacement?

(I'm not suggesting that you do this Tony, just wondering if we can and should)

@khwilliamson
Copy link
Contributor Author

It is easy in ppport.h to either output a hint or a warning about using any particular function. It's just text. Here's an example:

*** WARNING: my_sprintf
*** It's safer to use my_snprintf instead


This is designed to replace the problematic utf8_to_uvchr(), which
is problematic.  Its behavior varies depending on if <utf8> warnings are
enabled or not, and no code in core actually takes that into account

If warnings are enabled:

    A zero return can mean both success or failure

        Hence a zero return must be disambiguated.  Success would come
        from the next character being a NUL.

    If failure, <retlen> will be -1, so can't be used to find where to
    start parsing again.

If disabled:

    Both the return and <retlen> will be usable values, but the return
    of the REPLACEMENT CHARACTER is ambiguous.  It could mean failure,
    or it could mean that that was the next character in the input and
    was successfully decoded.  It may very well not matter to you what
    the source of this particular value was.  It likely means a failure
    somewhere.  But there are occasions where you might care.

utf8_to_uv() solves these.  This commit includes a few changes to use
it, to show it works.  I have WIP that changes the rest of core to use
it.  I found that it makes coding simpler.

The new function returns true upon success; false on failure.  And it is
passed pointers to return the computed code point and byte length into.
These values always contain the correct information, regardless of if
the input is malformed or not.

It is easy to test for failure in a conditional and then to take
appropriate action.  However, most often it seems the appropriate action
is to use, going forward, the REPLACEMENT CHARACTER returned in failure
cases.

And if you don't care particularly if it succeeds or not, you just use
it without testing the result.  This happens when you are confident that
the input is well-formed, or say in converting a string for display.
@khwilliamson
Copy link
Contributor Author

The name I used for this function was the first name used for this functionality, being removed in 5.7. A succession of names has followed, as the previous incarnation was found to be deficient for one reason or another. uvchr apparently came about as a result of supporting non-ASCII systems, and I presumed was chosen to somehow note the fluidity of the underlying character set. I thought that it had been long enough since utf8_to_uv' had been in use that it could safely be reused again. But I came up with an alternative name in the meantime utf8_to_cp (We would want to make an inversecp_to_utf8. I think cp` is in common enough usage these days that the name would be self-explanatory to modern programmers.

But I don't know. I'm opening this up for discussion

@tonycoz
Copy link
Contributor

tonycoz commented Sep 4, 2024

If it returns a native code point rather than a Unicode code point I'd be inclined to avoid "cp".

We've generally used "uvchr" for this but utf8_to_uvchr() is what we're trying to replace.

Maybe utf8_decode_uvchr(), but that has it's own potential for confusion.

Maybe utf8_to_uvchr4() to emphasize that this is an alternative to utf8_to_uvchr().

@khwilliamson
Copy link
Contributor Author

I'm then inclined to go with utf8_to_uv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants