-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refcounted_he_(new|fetch)_pvn: Don't roll-own code #22638
base: blead
Are you sure you want to change the base?
Conversation
My only issue with this is bytes_from_utf8() always pays the cost of an allocation, even when all of the characters are invariants, while the original code only pays that cost if an invariant is found. |
I don't thing these hv functions have any business re-implementing utf8 translations. It's unfortunate the API is the way it is. I first tried the equivalent function So what to do. What strikes me immediately is to change the len parameter from STRLEN to SSize_t. I think we can get away with that without affecting existing code; but if I'm wrong this won't fly. If that is feasible, then a negative length would signal the function to just return the input if there are no variants. The caller would check if the return pointer is the same as the passed one. |
Another option is to convert |
That's not going to work either. How about a new function
|
I have worked on this, and come with the following API proposal. Feedback welcome
|
I didnt read the full commit, but the api looks flexible and efficient. It covers 3 states. One question, is there any way to undo in perlapi |
I don't believe so |
Rather than requiring the caller to decode the return value to decide whether the string should be released, perhaps add another pointer to pointer:
utf8_to_bytes_type() would set
To make the call even less error prone, supply the original string pointer/length by value to avoid the caller having to remember to initialize them:
If utf8_to_bytes_type() returns or accepts an enum, its return and parameter types should reflect that. Please document the enum values for the Hopefully the the enum names will decorated to avoid name conflicts.
It's not needed here, since you can call with type = preserve. But in general there's no easy way to do that (like release() for std::unique_ptrin C++. It would be nice to have that for both this and sv_2mortal(), and there's at least one place that manually does this for mortal SVs. |
3d24e84
to
1c817a2
Compare
I forgot about this pull request when I created a more refined version of the utf8_to_bytes proposal. I did that in #22703, now closed. And this has been updated to the latest. One thing I realized when converting existing calls to the new way was that there was a problem with const. The same function now has to be callable in cases where it both changes the input, and is required to not change the input, depending on a flag. To solve this, I made the function internal, and created macros that don't cast away const except when the actual function isn't going to change it. This protects the caller from calling the destructive version with a const string. The macros are currently named utf8_to_bytes_overwrite() utf8_to_bytes_new_pv(), and utf8_to_bytes_temp_pv(). Better name ideas welcome These hide from the user the new enum parameter that the internal function is called with. Using @tonycoz idea of an extra parameter that the internal function sets to indicate that there is a need for freeing the converted string means the enum return type can be gotten rid of, and the functions return bool. Only the new-pv case actually needs that, so having the macros as the public interface hides this parameter except when needed. I had this parameter as a pointer to a bool. @tonycoz idea of making it point to the memory to be freed is a better idea, but in all cases in the core, that doesn't simplify things; they each need to do more things than free the memory when the function has created that memory. Included in this pr are the conversions to use the new function in all the core files. These are intended to not be actually merged at the same time as the rest, but showed how this would work out in practice. |
1c817a2
to
82fa339
Compare
utf8.h
Outdated
Perl_utf8_to_bytes_(aTHX_ s, l, (bool *) 1, \ | ||
PL_utf8_to_bytes_overwrite) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cast (bool *) 1
is unsafe.
Such casts are implementation defined, and may result in an invalid pointer, use (which the function call itself does) of which may result in undefined behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was too lazy to look up how to use INT2PTR, which would have been less effort than what it ended up costing me. I changed to use that and also to using U8* instead of bool. Presumably that is no longer UB
Outdent and reflow some comments and code in preparation for them to be moved out of the loop
This is for clarity. All this very-unlikely-to-be-used code was in the middle of what is really going on, creating a distraction.
The previous version did not make sure that it wasn't reading beyond the end of the buffer in all cases, and the first pass through the input string already ruled out it having most problems. Thus we don't need the full generality here of the macro UTF8_IS_DOWNGRADEABLE_START; and this simplifies things
These were misleading. On ASCII platforms, many calls to this function won't use the per-word algorithm. That's only done for long-enough strings.
The new name, s0, is used in more other places for this meaning, and is more descriptive.
This is an internal function, designed to be an extension of utf8_to_bytes(), with a slightly different API. This function just adds it and calls it from just utf8_to_bytes. Future commits will extend this API.
This variable should not be being changed by the function
The argument is currently unused. The macro is a public facing API that calls this function with the correct argument
This makes the next commit smaller
This causes this function to be able to both overwrite the input, and to instead create new memory. It changes bytes_from_utf8() to use this new capability instead of being a near duplication of the core code of this function. Prior to this commit, bytes_from_utf8() just allocated memory the size of the original string, and started copying into it. When it came to a sequence that wasn't convertible, it stopped, and freed up the copy. The new behavior has it checking first before the malloc that the string is convertible. That has the advantage that there is no malloc without being sure it will be useful; but the disadvantage that there is an extra pass through the input string, but that pass is per-word. The next commit will introduce another advantage. Thanks to Tony Cook for the 'free_me' idea
Prior to this commit, the size malloced was just the same as the length of the input string, which is a worst case scenario. This commit changes so the new pass through the input (introduced in the previous commit) also calculates the needed length. The additional cost of doing this is minimal. It has advantages on a very long string with lots of sequences that are convertible.
This is a non-destructive conversion of the input into native bytes, and with any new memory required set for destruction via SAVEFREEPV. This allows the caller to not have to be concerned at all if memory was created or not. A new macro is created that calls this internal function with the correct parameter to force this behavior.
The indentation for some lines in one function was only 2 spaces. Remove trailing blanks And This will minimize the diff listing in the next commit when white space isn't ignored.
82fa339
to
6d240a9
Compare
The function bytes_from_utf8() already does what what these two instances of duplicated code do.