refcounted_he_(new|fetch)_pvn: Don't roll-own code #22638

khwilliamson · 2024-10-01T21:20:28Z

The function bytes_from_utf8() already does what what these two instances of duplicated code do.

This set of changes does not require a perldelta entry.

tonycoz · 2024-10-02T04:26:45Z

My only issue with this is bytes_from_utf8() always pays the cost of an allocation, even when all of the characters are invariants, while the original code only pays that cost if an invariant is found.

khwilliamson · 2024-10-03T02:46:06Z

I don't thing these hv functions have any business re-implementing utf8 translations. It's unfortunate the API is the way it is. I first tried the equivalent function utf8_to_bytes which is destructive of the input. We know that the translated version will be no longer, and likely shorter than the original. So that should work. But I kept getting segfaults with what appeared to be valid writes. I came to the conclusion it must be writing to some shared or read-only memory.

So what to do. What strikes me immediately is to change the len parameter from STRLEN to SSize_t. I think we can get away with that without affecting existing code; but if I'm wrong this won't fly. If that is feasible, then a negative length would signal the function to just return the input if there are no variants. The caller would check if the return pointer is the same as the passed one.

khwilliamson · 2024-10-04T04:16:11Z

Another option is to convert bool *is_utf8p to an enum with an extra value that says turn off this flag and return the original source if everything is an invariant

khwilliamson · 2024-10-04T21:39:33Z

That's not going to work either. How about a new function Size_t utf8_to_bytes_flags(U8 ** s, int type)

Returns 0 if **s could not be converted; otherwise returns the number of bytes in the converted version. Does nothing if **s doesn't need conversion; otherwise *s will point to the converted version. By default, the conversion is destructive, meaning *s is converted itself. The type is an enum with three states:
a) destructive
b) non-destructive, so that new memory is allocated for the conversion should there need to be converting.
c) non-destructive plus mortalized, so that any new memory will be deallocated without the caller needing to deal with it.

khwilliamson · 2024-10-10T13:29:40Z

I have worked on this, and come with the following API proposal. Feedback welcome

"utf8_to_bytes_type"
"utf8_to_bytes"
"bytes_from_utf8"
    NOTE: "utf8_to_bytes" is experimental and may change or be removed
    without notice.

    These each convert a string encoded as UTF-8 into the equivalent
    native byte representation, if possible.

    "utf8_to_bytes_type" has the more modern API, and is the easiest to
    use. On input, "s_ptr" is a pointer to the string to be converted
    (so that the first byte will be at *sptr[0]), and *lenp is its
    length.

    It returns non-zero if the end result is native bytes; zero if there
    are code points in the string not expressible in native byte
    encoding. In many situations, treating the return as a boolean is
    sufficient, but it actually returns an enum "xxx" so you can tease
    apart the reason it succeeded. "noop" means that the input already
    was in bytes, so no action was necessary. "converted" means that the
    conversion was successful. "cant_convert" is a synonym you can use
    for 0 or "false".

    In all cases, *s_ptr and *lenp will have correct and consistent
    values, unchanged if the return is either 0 or "noop"; otherwise
    updated as necessary.

    If the return is "converted" and "type" is 0 (or the equivalent
    "overwrite"), the converted value overwrote the input string. *s_ptr
    will be unchanged, but *lenp will be updated to its new (shortened)
    length.

    If the return is "converted" and "type" is "preserve", the original
    string is never changed. Instead a new "NUL"-terminated string is
    allocated, and *s_ptr is changed to point to that new memory. The
    caller is responsible for freeing that memory. "type" can also be
    set to "use_temp". In this case the original string is also
    preserved, and the converted string placed in newly allocated
    memory, but that will have been marked for automatic destruction (by
    using "SAVEFREEPV").

    Note that the caller has to free the memory at *s_ptr if and only if
    it called this function with "type" set to "preserve", and the
    function returned "converted".

    Upon successful return, the number of variants in the string can be
    computed by having saved the value of *lenp before the call, and
    subtracting the after-call value of *lenp from it. This is also true
    for the other two functions described below.

    Plain "utf8_to_bytes" also converts a UTF-8 encoded string to bytes,
    but there are more glitches that the caller has to be prepared to
    handle.

    The input string is passed with one less indirection level, "s".

    If the conversion was successful or a noop
        The function returns "s" (unchanged) and *lenp will contain the
        correct length.

    If the conversion failed
        The function returns NULL and sets *lenp to -1, cast to
        "STRLEN". This means that you will have to use a temporary
        containing the string length to pass to the function if you will
        need the value afterwards.

    "bytes_from_utf8" also converts a potentially UTF-8 encoded string
    "s" to bytes. It preserves "s", allocating new memory for the
    converted string.

    In contrast to the other two functions, the input string to this one
    need not be UTF-8. If not, the caller has set *is_utf8p to be
    "false", and the function does nothing, returning the original "s".

    Also do nothing if there are code points in the string not
    expressible in native byte encoding, returning the original "s".

    Otherwise, *is_utf8p is set to 0, and the return value is a pointer
    to a newly created string containing the native byte equivalent of
    "s", and whose length is returned in *lenp, updated. The new string
    is "NUL"-terminated. The caller is responsible for arranging for the
    memory used by this string to get freed.

    The major problem with this function is that memory is allocated
    and filled even when no conversion was necessary..

        U8         utf8_to_bytes_type(      U8 **s_ptr, STRLEN *lenp,
                                            U32 type)
        U8 *       utf8_to_bytes     (      U8 *s, STRLEN *lenp)
        U8 *  Perl_utf8_to_bytes     (pTHX_ U8 *s, STRLEN *lenp)
        U8 *       bytes_from_utf8   (      const U8 *s, STRLEN *lenp,
                                            bool *is_utf8p)
        U8 *  Perl_bytes_from_utf8   (pTHX_ const U8 *s, STRLEN *lenp,
                                            bool *is_utf8p)

bulk88 · 2024-10-17T18:29:56Z

 but that will have been marked for automatic destruction (by
    using "SAVEFREEPV").

I didnt read the full commit, but the api looks flexible and efficient.

It covers 3 states.
-input is output, array of U8s not changed
-input was converted in place to output, and input contents destroyed (but alloc was reused/in place)
-input was converted, input preserved, output is new malloc

One question, is there any way to undo in perlapi SAVEFREEPV? like for sv_usepvn_flags?

khwilliamson · 2024-10-22T15:06:37Z

One question, is there any way to undo in perlapi SAVEFREEPV? like for sv_usepvn_flags?

I don't believe so

tonycoz · 2024-10-28T01:15:55Z

Rather than requiring the caller to decode the return value to decide whether the string should be released, perhaps add another pointer to pointer:

U8         utf8_to_bytes_type(      U8 **s_ptr, STRLEN *lenp,
                                            U32 type, void **free_me)

utf8_to_bytes_type() would set *free_me to NULL and iff the caller needs to free ortake ownership of the PV, it sets *free_me to that pointer. This way a caller can just:

U8 *orig = ...;
STRLEN len = ...;
void *free_me; // possibly U8 * instead
utf8_bytes_to_type(&orig, &len, preserve, &free_me);
// process orig for len bytes
Safefree(free_me); // Safefree() handles NULL like free()

To make the call even less error prone, supply the original string pointer/length by value to avoid the caller having to remember to initialize them:

U8 *outstr;
STRLEN outlen;
void *free_me;
utf8_bytes_to_type(&outstr, &outlen, origstr, origlen, preserve, &free_me);
// process outstr for outlen bytes
Safefree(free_me);

If utf8_to_bytes_type() returns or accepts an enum, its return and parameter types should reflect that.

Please document the enum values for the type parameter and return types as lists.

Hopefully the the enum names will decorated to avoid name conflicts.

One question, is there any way to undo in perlapi SAVEFREEPV? like for sv_usepvn_flags?

It's not needed here, since you can call with type = preserve.

But in general there's no easy way to do that (like release() for std::unique_ptrin C++.

It would be nice to have that for both this and sv_2mortal(), and there's at least one place that manually does this for mortal SVs.

khwilliamson · 2024-10-28T21:00:12Z

I forgot about this pull request when I created a more refined version of the utf8_to_bytes proposal. I did that in #22703, now closed. And this has been updated to the latest.

One thing I realized when converting existing calls to the new way was that there was a problem with const. The same function now has to be callable in cases where it both changes the input, and is required to not change the input, depending on a flag. To solve this, I made the function internal, and created macros that don't cast away const except when the actual function isn't going to change it. This protects the caller from calling the destructive version with a const string.

The macros are currently named utf8_to_bytes_overwrite() utf8_to_bytes_new_pv(), and utf8_to_bytes_temp_pv(). Better name ideas welcome

These hide from the user the new enum parameter that the internal function is called with.

Using @tonycoz idea of an extra parameter that the internal function sets to indicate that there is a need for freeing the converted string means the enum return type can be gotten rid of, and the functions return bool. Only the new-pv case actually needs that, so having the macros as the public interface hides this parameter except when needed.

I had this parameter as a pointer to a bool. @tonycoz idea of making it point to the memory to be freed is a better idea, but in all cases in the core, that doesn't simplify things; they each need to do more things than free the memory when the function has created that memory.

Included in this pr are the conversions to use the new function in all the core files. These are intended to not be actually merged at the same time as the rest, but showed how this would work out in practice.

tonycoz · 2024-10-30T23:30:05Z

utf8.h

+        Perl_utf8_to_bytes_(aTHX_ s, l, (bool *) 1,                         \
+                                  PL_utf8_to_bytes_overwrite)


This cast (bool *) 1 is unsafe.

Such casts are implementation defined, and may result in an invalid pointer, use (which the function call itself does) of which may result in undefined behaviour.

I was too lazy to look up how to use INT2PTR, which would have been less effort than what it ended up costing me. I changed to use that and also to using U8* instead of bool. Presumably that is no longer UB

Outdent and reflow some comments and code in preparation for them to be moved out of the loop

This is for clarity. All this very-unlikely-to-be-used code was in the middle of what is really going on, creating a distraction.

The previous version did not make sure that it wasn't reading beyond the end of the buffer in all cases, and the first pass through the input string already ruled out it having most problems. Thus we don't need the full generality here of the macro UTF8_IS_DOWNGRADEABLE_START; and this simplifies things

These were misleading. On ASCII platforms, many calls to this function won't use the per-word algorithm. That's only done for long-enough strings.

The new name, s0, is used in more other places for this meaning, and is more descriptive.

This is an internal function, designed to be an extension of utf8_to_bytes(), with a slightly different API. This function just adds it and calls it from just utf8_to_bytes. Future commits will extend this API.

This variable should not be being changed by the function

The argument is currently unused. The macro is a public facing API that calls this function with the correct argument

This makes the next commit smaller

This causes this function to be able to both overwrite the input, and to instead create new memory. It changes bytes_from_utf8() to use this new capability instead of being a near duplication of the core code of this function. Prior to this commit, bytes_from_utf8() just allocated memory the size of the original string, and started copying into it. When it came to a sequence that wasn't convertible, it stopped, and freed up the copy. The new behavior has it checking first before the malloc that the string is convertible. That has the advantage that there is no malloc without being sure it will be useful; but the disadvantage that there is an extra pass through the input string, but that pass is per-word. The next commit will introduce another advantage. Thanks to Tony Cook for the 'free_me' idea

Prior to this commit, the size malloced was just the same as the length of the input string, which is a worst case scenario. This commit changes so the new pass through the input (introduced in the previous commit) also calculates the needed length. The additional cost of doing this is minimal. It has advantages on a very long string with lots of sequences that are convertible.

This is a non-destructive conversion of the input into native bytes, and with any new memory required set for destruction via SAVEFREEPV. This allows the caller to not have to be concerned at all if memory was created or not. A new macro is created that calls this internal function with the correct parameter to force this behavior.

The indentation for some lines in one function was only 2 spaces. Remove trailing blanks And This will minimize the diff listing in the next commit when white space isn't ignored.

khwilliamson force-pushed the hv_fetch branch from 3d24e84 to 1c817a2 Compare October 28, 2024 18:56

khwilliamson mentioned this pull request Oct 28, 2024

Combine the conversion of UTF-8 to bytes into a single base function #22703

Closed

github-actions bot added the hasConflicts label Oct 29, 2024

khwilliamson force-pushed the hv_fetch branch from 1c817a2 to 82fa339 Compare October 29, 2024 13:58

github-actions bot removed the hasConflicts label Oct 30, 2024

tonycoz reviewed Oct 30, 2024

View reviewed changes

khwilliamson added 15 commits October 31, 2024 05:53

utf8.c: Move declaration to first use

d602b15

utf8.c: White-space only

ef094c4

Outdent and reflow some comments and code in preparation for them to be moved out of the loop

utf8_to_bytes() Move failure code out of loop

50be063

This is for clarity. All this very-unlikely-to-be-used code was in the middle of what is really going on, creating a distraction.

utf8_to_bytes: Update and fix comments.

9055913

These were misleading. On ASCII platforms, many calls to this function won't use the per-word algorithm. That's only done for long-enough strings.

utf8_to_bytes: Rename variable

43df601

The new name, s0, is used in more other places for this meaning, and is more descriptive.

Add preliminary utf8_to_bytes_()

8150a3c

This is an internal function, designed to be an extension of utf8_to_bytes(), with a slightly different API. This function just adds it and calls it from just utf8_to_bytes. Future commits will extend this API.

utf8_to_bytes_: Add const

aaee005

This variable should not be being changed by the function

utf8_to_bytes_: Add argument, macro

62a4328

The argument is currently unused. The macro is a public facing API that calls this function with the correct argument

utf8_to_bytes_: Slight refactor

8bfe955

This makes the next commit smaller

Document new utf8_to_bytes() variants

9aeb0aa

pp_sys

fedcb20

khwilliamson added 12 commits October 31, 2024 05:53

locale1

9d1cb08

locale2

3a46442

doio

b653137

pp1

96a823d

pp2

9c7e3a4

sv

786157e

hv.c: White-space only

92d12ab

The indentation for some lines in one function was only 2 spaces. Remove trailing blanks And This will minimize the diff listing in the next commit when white space isn't ignored.

hv1

d37edf0

hv2

1144d5a

hv3

5fb351c

hv4

73bceda

hv5

6d240a9

khwilliamson force-pushed the hv_fetch branch from 82fa339 to 6d240a9 Compare October 31, 2024 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refcounted_he_(new|fetch)_pvn: Don't roll-own code #22638

refcounted_he_(new|fetch)_pvn: Don't roll-own code #22638

khwilliamson commented Oct 1, 2024

tonycoz commented Oct 2, 2024

khwilliamson commented Oct 3, 2024

khwilliamson commented Oct 4, 2024

khwilliamson commented Oct 4, 2024

khwilliamson commented Oct 10, 2024

bulk88 commented Oct 17, 2024

khwilliamson commented Oct 22, 2024

tonycoz commented Oct 28, 2024

khwilliamson commented Oct 28, 2024

tonycoz Oct 30, 2024

khwilliamson Oct 31, 2024

		Perl_utf8_to_bytes_(aTHX_ s, l, (bool *) 1, \
		PL_utf8_to_bytes_overwrite)

refcounted_he_(new|fetch)_pvn: Don't roll-own code #22638

Are you sure you want to change the base?

refcounted_he_(new|fetch)_pvn: Don't roll-own code #22638

Conversation

khwilliamson commented Oct 1, 2024

tonycoz commented Oct 2, 2024

khwilliamson commented Oct 3, 2024

khwilliamson commented Oct 4, 2024

khwilliamson commented Oct 4, 2024

khwilliamson commented Oct 10, 2024

bulk88 commented Oct 17, 2024

khwilliamson commented Oct 22, 2024

tonycoz commented Oct 28, 2024

khwilliamson commented Oct 28, 2024

tonycoz Oct 30, 2024

Choose a reason for hiding this comment

khwilliamson Oct 31, 2024

Choose a reason for hiding this comment