Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new API function utf8_to_uv() #22541

Open
wants to merge 11 commits into
base: blead
Choose a base branch
from

Commits on Oct 28, 2024

  1. Inline utf8_to_uvchr_buf

    This is a one line function that just calls another function.
    khwilliamson committed Oct 28, 2024
    Configuration menu
    Copy the full SHA
    9d64eb3 View commit details
    Browse the repository at this point in the history
  2. Merge utf8_to_uvchr_buf() and its helper

    The helper adds no value
    khwilliamson committed Oct 28, 2024
    Configuration menu
    Copy the full SHA
    7e4b18b View commit details
    Browse the repository at this point in the history
  3. Convert utf8n_to_uvchr_error to macro

    It was a macro, but had a long-name function as well.  This converts to
    using two macros.
    khwilliamson committed Oct 28, 2024
    Configuration menu
    Copy the full SHA
    ccf4ea4 View commit details
    Browse the repository at this point in the history
  4. Convert utf8n_to_uvchr() to macro

    It was a macro, but had a long-name function as well.  This converts to
    using two macros.
    khwilliamson committed Oct 28, 2024
    Configuration menu
    Copy the full SHA
    8a5653c View commit details
    Browse the repository at this point in the history
  5. Add utf8_to_uv_msgs()

    This is the first of several functions with the naming style
    utf8_to_uv(), and which are designed to be used instead of the
    problematic current ones that are like utf8_to_uvchr().
    
    The previous ones basically throw away crucial information in their
    returns upon failure, creating hassles for the caller.  It is hard to
    recover from malformed input with them to keep going to continue
    parsing.  That is what modern UTF-8 handlers have settled on doing.
    
    Originally I planned to replace just the most problematic one,
    utf8_to_uvchr_buf(), but I realized that each level threw away
    information, so it would be better to start at the base level one, which
    utf8_to_uvchr_buf() eventually calls with a bunch of 0 parameters.  The
    previous functions all had to disambiguate failure returns.  This stops
    that at the root.
    
    The new series all return a boolean as to their success, with a
    consistent API throughout.  The old series had one outlier, again
    utf8_to_uvchr_buf(), which had a different calling convention and
    returns.
    
    The basic logic in the base level function, which this commit handles,
    was sound.  It just failed to return relevant information upon failure.
    
    The new API has somewhat different formal parameter names and uses
    Size_t instead of STRLEN for one of the parameters.  It also passes the
    end of string position instead of a length.  The latter is problematic
    when it could go negative, and instead becomes a huge positive number.
    
    The old base function now merely calls the new one, and throws away the
    relevant information, as it always has.
    khwilliamson committed Oct 28, 2024
    Configuration menu
    Copy the full SHA
    fff476c View commit details
    Browse the repository at this point in the history
  6. Add utf8_to_uv_error(s)

    This is just utf8n_to_uvchr_error() with a more convenient API that is
    harder to misuse.
    
    New code should use this new function instead of the old.
    khwilliamson committed Oct 28, 2024
    Configuration menu
    Copy the full SHA
    5f3b889 View commit details
    Browse the repository at this point in the history
  7. Add utf8_to_uv_flags()

    This is just utf8n_to_uvchr() with a more convenient API that is harder
    to misuse.
    
    New code should use this new function instead of the old.
    khwilliamson committed Oct 28, 2024
    Configuration menu
    Copy the full SHA
    2848afb View commit details
    Browse the repository at this point in the history
  8. Add utf8_to_uv()

    This performs the same function as utf8_to_uvchr_buf() with a more
    convenient API that is much harder to misuse.
    
    All code should convert to use this new function instead of the old.
    
    The behavior of utf8_to_uvchr_buf()  varies depending on if <utf8>
    warnings are enabled or not, and no code in core actually takes that
    into account
    
    If warnings are enabled:
    
     A zero return can mean both success or failure
    
         Hence a zero return must be disambiguated.  Success would come
         from the next character being a NUL.
    
     If failure, <retlen> will be -1, so can't be used to find where to
     start parsing again.
    
    If disabled:
    
     Both the return and <retlen> will be usable values, but the return
     of the REPLACEMENT CHARACTER is ambiguous.  It could mean failure,
     or it could mean that that was the next character in the input and
     was successfully decoded.  It may very well not matter to you what
     the source of this particular value was.  It likely means a failure
     somewhere.  But there are occasions where you might care.
    
    The new function returns true upon success; false on failure.  And it is
    passed pointers to return the computed code point and byte length into.
    These values always contain the correct information, regardless of if
    the input is malformed or not.
    
    It is easy to test for failure in a conditional and then to take
    appropriate action.  However, most often it seems the appropriate action
    is to use, going forward, the REPLACEMENT CHARACTER returned in failure
    cases.
    
    And if you don't care particularly if it succeeds or not, you just use
    it without testing the result.  This happens when you are confident that
    the input is well-formed, or say in converting a string for display.
    khwilliamson committed Oct 28, 2024
    Configuration menu
    Copy the full SHA
    1a3aac8 View commit details
    Browse the repository at this point in the history
  9. Implement utf8_to_uvchr_buf in terms of utf8_to_uv_flags

    This is simpler than the existing one.
    khwilliamson committed Oct 28, 2024
    Configuration menu
    Copy the full SHA
    c56befe View commit details
    Browse the repository at this point in the history
  10. Add utf8_to_uv() flavors

    One of these is a more explicit synonym for that function; the other two
    restrict what's acceptable to Unicode's legal interchange or their C9
    legal interchange.
    khwilliamson committed Oct 28, 2024
    Configuration menu
    Copy the full SHA
    6a6c5a5 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    bd1b0f7 View commit details
    Browse the repository at this point in the history