-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with exposing is_utf8() #20419
Comments
I disagree. utf8::upgrade is a string function, so it's appropriate to coerce its argument into a string in order to operate on it, much like |
If $foo is undef, $foo++ adds 1 to something unknown, which can be taken to be In any case the execution warning should trigger the developer, the logs, the If $foo is an undef string marking the end of a while read() loop from a file Reasoning: utf8::upgrade() acts on the internal representation of scalars in If a string is coded in binary and undef, why should its equivalent coded in open(my $fh, "<:encoding(UTF-8)", $filename) then the $line would have been undef. |
Regarding this I tried an experiment to see it utf8::upgrade and use utf8;
use Data::Dumper;
my $vanilla_name = "Sébastien";
my $combining_name = "Se\x{0301}bastien"; # U+0301 COMBINING ACUTE ACCENT
print " vanilla_name: " . Dumper($vanilla_name);
print "combining_name: " . Dumper($combining_name);
utf8::downgrade($vanilla_name);
utf8::downgrade($combining_name); does not work as I would have fantasized:
So in any case I expect utf8::upgrade and utf8::downgrade to work like this:
and this breaks for undef. |
On Thu, 20 Oct 2022 at 12:20, Dan Book ***@***.***> wrote:
I disagree. utf8::upgrade is a string function, so it's appropriate to
coerce its argument into a string in order to operate on it, much like
$foo++ would define an undef variable.
length is a string function, and it does not coerce undef to the empty
string, it returns undef.
Personally I am surprised by this, I would not have guessed that
utf8::upgrade() would convert its argument to a string, including refs. I
guess it makes sense from an internals point of view. But it strikes me as
quite odd from the perl level, i would expect the XS glue to check the var
is defined and not a ref at the least.
The docs do not explicitly state this happens, they refer to strings only:
(Since Perl v5.8.0) Converts in-place the internal representation of
the string from an octet sequence in the native encoding (Latin-1 or
EBCDIC) to UTF-8. The logical character sequence itself is
unchanged. If *$string* is already upgraded, then this is a no-op.
Returns the number of octets necessary to represent the string as
UTF-8.
I would consider it pretty reasonable to see 'if $string is a reference or
undefined then this is a no-op".
Seems like a bug to me.
cheers,
Yves
…--
perl -Mre=debug -e "/just|another|perl|hacker/"
|
length is an exception, and also does not operate in place. All other string functions treat their argument as empty string if operating on undef (and warn about it) |
On Sat, 22 Oct 2022 at 11:53, Dan Book ***@***.***> wrote:
On Thu, 20 Oct 2022 at 12:20, Dan Book *@*.***> wrote: I disagree.
utf8::upgrade is a string function, so it's appropriate to coerce its
argument into a string in order to operate on it, much like $foo++ would
define an undef variable.
length is a string function, and it does not coerce undef to the empty
string, it returns undef.
length is an exception, and also does not operate in place. All other
string functions treat their argument as empty string if operating on undef
(and warn about it)
Well, as I said /personally/ I would consider this a bug, but if we are not
going to consider it a bug we should document it as I think many people
would find the behavior strange.
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
|
On Sat, Oct 22, 2022 at 03:02:18AM -0700, Yves Orton wrote:
Well, as I said /personally/ I would consider this a bug, but if we are not
going to consider it a bug we should document it as I think many people
would find the behavior strange.
It's behaviour is certainly inconsistent with other common "modify in
place" functions and length():
sub d { printf "%9s %s\n", defined $_[0] ? "defined" : "undefined", $_[1] }
$y1 = length $x1; d($x1, "length");
chomp $x2; d($x2, "chomp");
chop $x3; d($x3, "chop");
utf8::upgrade $x4; d($x4, "utf8::upgrade");
utf8::downgrade $x4; d($x5, "utf8::downgrade");
outputs:
undefined length
undefined chomp
undefined chop
defined utf8::upgrade
undefined utf8::downgrade
…--
All wight. I will give you one more chance. This time, I want to hear
no Wubens. No Weginalds. No Wudolf the wed-nosed weindeers.
-- Life of Brian
|
I submitted a PR to fix this for comments. Note I don't really know what I'm doing with this area of the code. |
Not that this shouldn’t change, but @sblondeel, did you report the other bugs/issues you’ve found? In my own experience, utf8::upgrade is useful in testing, but generally some encode/decode combination leads to a happier place. FWIW. |
This fixes GH Perl#20419
This fixes GH Perl#20419
This fixes GH Perl#20419
This fixes GH Perl#20419
This fixes GH Perl#20419
This fixes GH Perl#20419
Hi, I already reported shlomif/perl-XML-LibXML#72 Motivated by your interest, I tried to remember/reproduce other such problems I Either Excel::Writer::XLSX got corrected or the bug occurred with XLS? Anyway the following still looks buggy to me: #! /usr/bin/perl
use warnings;
use strict;
use utf8;
use feature 'say';
use URI;
my $u = URI->new("http://www.perl.com");
my $label = "Sébastien";
utf8::upgrade($label); # useless, to make it clearer
$u->query_keywords($label);
say " UTF-8: " . $u->as_string;
utf8::downgrade($label);
$u->query_keywords($label);
say "binary: " . $u->as_string; Output:
Apparently the URI module infers the encoding to use to URL-encode non-ASCII Reading https://en.wikipedia.org/wiki/Percent-encoding leads me to believe the But what is worrisome is that something internal to Perl can be shown outside Oh no, the author was aware of this, look: /usr/share/perl5/URI/Escape.pm # XXX FIXME escape_char is buggy as it assigns meaning to the string's storage format.
sub escape_char {
# Old versions of utf8::is_utf8() didn't properly handle magical vars (e.g. $1).
# The following forces a fetch to occur beforehand.
my $dummy = substr($_[0], 0, 0);
if (utf8::is_utf8($_[0])) {
my $s = shift;
utf8::encode($s);
unshift(@_, $s);
}
return join '', @URI::Escape::escapes{split //, $_[0]};
} ...which leads me to believe the utf8::is_utf8 function should never be allowed Regards, |
@sblondeel Indeed, lots of XS modules—and even Perl built-ins—expose Perl strings’ internal representation to the outside world. The problem is that fixing these cases may break existing applications. This was actually the topic of my presentation at the last Perl/Raku conference: https://www.youtube.com/watch?v=yH5IyYyvWHU The solution I’ve wondered about is to repurpose 2 bits from the refcount in order to store string state:
… but I’ve not gotten much further than that. |
This fixes GH Perl#20419
This fixes GH Perl#20419
While migrating from an old system I had to add utf8::upgrade($line) after each
$line = <$fh>
type code snippet.
Module: utf8
Description
If a scalar $var is undef utf8::upgrade($var) should not turn it into q{}.
Steps to Reproduce
Output
Expected output
Expected behavior
An undef value should still be undef after being upgraded to the utf8 internal
Perl representation of scalars.
undef is a special value of scalars, unrelated to any encoding scheme.
This breaks loops like
Note: the utf8::upgrade call is in itself a circumvent of other Perl
bugs/module bugs I have reported.
Perl configuration
The text was updated successfully, but these errors were encountered: