-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new API function utf8_to_uv() #22541
Open
khwilliamson
wants to merge
11
commits into
Perl:blead
Choose a base branch
from
khwilliamson:utf8_to_uv
base: blead
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Commits on Oct 28, 2024
-
This is a one line function that just calls another function.
Configuration menu - View commit details
-
Copy full SHA for 9d64eb3 - Browse repository at this point
Copy the full SHA 9d64eb3View commit details -
Merge utf8_to_uvchr_buf() and its helper
The helper adds no value
Configuration menu - View commit details
-
Copy full SHA for 7e4b18b - Browse repository at this point
Copy the full SHA 7e4b18bView commit details -
Convert utf8n_to_uvchr_error to macro
It was a macro, but had a long-name function as well. This converts to using two macros.
Configuration menu - View commit details
-
Copy full SHA for ccf4ea4 - Browse repository at this point
Copy the full SHA ccf4ea4View commit details -
Convert utf8n_to_uvchr() to macro
It was a macro, but had a long-name function as well. This converts to using two macros.
Configuration menu - View commit details
-
Copy full SHA for 8a5653c - Browse repository at this point
Copy the full SHA 8a5653cView commit details -
This is the first of several functions with the naming style utf8_to_uv(), and which are designed to be used instead of the problematic current ones that are like utf8_to_uvchr(). The previous ones basically throw away crucial information in their returns upon failure, creating hassles for the caller. It is hard to recover from malformed input with them to keep going to continue parsing. That is what modern UTF-8 handlers have settled on doing. Originally I planned to replace just the most problematic one, utf8_to_uvchr_buf(), but I realized that each level threw away information, so it would be better to start at the base level one, which utf8_to_uvchr_buf() eventually calls with a bunch of 0 parameters. The previous functions all had to disambiguate failure returns. This stops that at the root. The new series all return a boolean as to their success, with a consistent API throughout. The old series had one outlier, again utf8_to_uvchr_buf(), which had a different calling convention and returns. The basic logic in the base level function, which this commit handles, was sound. It just failed to return relevant information upon failure. The new API has somewhat different formal parameter names and uses Size_t instead of STRLEN for one of the parameters. It also passes the end of string position instead of a length. The latter is problematic when it could go negative, and instead becomes a huge positive number. The old base function now merely calls the new one, and throws away the relevant information, as it always has.
Configuration menu - View commit details
-
Copy full SHA for fff476c - Browse repository at this point
Copy the full SHA fff476cView commit details -
This is just utf8n_to_uvchr_error() with a more convenient API that is harder to misuse. New code should use this new function instead of the old.
Configuration menu - View commit details
-
Copy full SHA for 5f3b889 - Browse repository at this point
Copy the full SHA 5f3b889View commit details -
This is just utf8n_to_uvchr() with a more convenient API that is harder to misuse. New code should use this new function instead of the old.
Configuration menu - View commit details
-
Copy full SHA for 2848afb - Browse repository at this point
Copy the full SHA 2848afbView commit details -
This performs the same function as utf8_to_uvchr_buf() with a more convenient API that is much harder to misuse. All code should convert to use this new function instead of the old. The behavior of utf8_to_uvchr_buf() varies depending on if <utf8> warnings are enabled or not, and no code in core actually takes that into account If warnings are enabled: A zero return can mean both success or failure Hence a zero return must be disambiguated. Success would come from the next character being a NUL. If failure, <retlen> will be -1, so can't be used to find where to start parsing again. If disabled: Both the return and <retlen> will be usable values, but the return of the REPLACEMENT CHARACTER is ambiguous. It could mean failure, or it could mean that that was the next character in the input and was successfully decoded. It may very well not matter to you what the source of this particular value was. It likely means a failure somewhere. But there are occasions where you might care. The new function returns true upon success; false on failure. And it is passed pointers to return the computed code point and byte length into. These values always contain the correct information, regardless of if the input is malformed or not. It is easy to test for failure in a conditional and then to take appropriate action. However, most often it seems the appropriate action is to use, going forward, the REPLACEMENT CHARACTER returned in failure cases. And if you don't care particularly if it succeeds or not, you just use it without testing the result. This happens when you are confident that the input is well-formed, or say in converting a string for display.
Configuration menu - View commit details
-
Copy full SHA for 1a3aac8 - Browse repository at this point
Copy the full SHA 1a3aac8View commit details -
Implement utf8_to_uvchr_buf in terms of utf8_to_uv_flags
This is simpler than the existing one.
Configuration menu - View commit details
-
Copy full SHA for c56befe - Browse repository at this point
Copy the full SHA c56befeView commit details -
One of these is a more explicit synonym for that function; the other two restrict what's acceptable to Unicode's legal interchange or their C9 legal interchange.
Configuration menu - View commit details
-
Copy full SHA for 6a6c5a5 - Browse repository at this point
Copy the full SHA 6a6c5a5View commit details -
Configuration menu - View commit details
-
Copy full SHA for bd1b0f7 - Browse repository at this point
Copy the full SHA bd1b0f7View commit details
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.