Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

piconv bug of decoding UTF-16 (with fix) #19

Closed
numa2666 opened this issue Mar 26, 2014 · 2 comments
Closed

piconv bug of decoding UTF-16 (with fix) #19

numa2666 opened this issue Mar 26, 2014 · 2 comments

Comments

@numa2666
Copy link

Hi,

I've found a bug in piconv 2.5 included in perl 5.18.2 distribution.
The problem is, it can not handle UTF-16 correctly, especially
when UTF-16BE or UTF-16LE is specified. It is because that
piconv does not slurp the input (necessary when UTF-16/UTF-32
are read), and try to read line by line, which will lead to failed
conversion. They must be treated the same as UTF-16 and
UTF-32 (non-BE/LE versions). Ah, well, piconv also fails to
determine whether slurp is needed or not, by using $to
encoding instead of $from encoding. I've included a patch
below.

The following patch also includes some minor fixes.

  • piconv fails to detect the locale-specific encoding if LANG
    or LC_* environment variables are not set, or even they are set,
    if they don't have any encoding field. It is quite common situation,
    and using Encode::Locale's encoding value instead of environment
    variable is better and portable to (especially) non-Unix systems.
  • The help message is fixed to display backslash-escaped string
    correctly.

Hope this helps,

Toshinori Numata
http://blog.livedoor.jp/numa2666/ (in Japanese)

--- c:/strawberry/perl/bin/piconv   Tue Jan  7 19:19:14 2014
+++ piconv  Sun Mar 23 19:30:29 2014
@@ -5,6 +5,7 @@
 use strict;
 use Encode ;
 use Encode::Alias;
+use Encode::Locale;
 my %Scheme =  map {$_ => 1} qw(from_to decode_encode perlio);

 use File::Basename;
@@ -34,7 +35,7 @@

 $Opt{help} and help();
 $Opt{list} and list_encodings();
-my $locale = $ENV{LC_CTYPE} || $ENV{LC_ALL} || $ENV{LANG};
+my $locale = $Encode::Locale::ENCODING_LOCALE;
 defined $Opt{resolve} and resolve_encoding($Opt{resolve});
 $Opt{from} || $Opt{to} || help();
 my $from = $Opt{from} || $locale or help("from_encoding unspecified");
@@ -68,14 +69,14 @@
 EOT
 }

-my %use_bom = map { $_ => 1 } qw/UTF-16 UTF-32/;
+my %use_bom = map { $_ => 1 } qw/UTF-16 UTF-16BE UTF-16LE UTF-32 UTF-32BE UTF-32LE/;

 # we do not use <> (or ARGV) for the sake of binmode()
 @ARGV or push @ARGV, \*STDIN;

 unless ( $scheme eq 'perlio' ) {
     binmode STDOUT;
-    my $need2slurp = $use_bom{ find_encoding($to)->name };
+    my $need2slurp = $use_bom{ find_encoding($from)->name };
     for my $argv (@ARGV) {
         my $ifh = ref $argv ? $argv : undef;
    $ifh or open $ifh, "<", $argv or warn "Can't open $argv: $!" and next;
@@ -169,7 +170,7 @@
   -D,--debug          show debug information
   -S,--scheme scheme  use the scheme for conversion
 Those are handy when you can only see ASCII characters:
-  -p,--perlqq         transliterate characters missing in encoding to \x{HHHH}
+  -p,--perlqq         transliterate characters missing in encoding to \\x{HHHH}
                       where HHHH is the hexadecimal Unicode code point
   --htmlcref          transliterate characters missing in encoding to &#NNN;
                       where NNN is the decimal Unicode code point
@dankogai
Copy link
Owner

Thank you for your work. And I am sorry I cannot take this patch. Encode is a core module so I cannot depend on non-core module. Encode::Locale is not in core.

I'll come up with other solution…

Dan the Maintainer Thereof

@dankogai
Copy link
Owner

Encode 2.58 released -- with your fixes except for the Encode::Locale dependency.

Dan the Maintainer Thereof

jsonn pushed a commit to jsonn/pkgsrc that referenced this issue Jul 16, 2014
2.62 2014/05/31 12:12:39
! Encode.pm
  s/2013/2014/ on COPYRIGHT section
! Byte/Makefile.PL
  CN/Makefile.PL
  EBCDIC/Makefile.PL
  Encode/Makefile_PL.e2x
  Encode.xs
  JP/Makefile.PL
  KR/Makefile.PL
  Symbol/Makefile.PL
  TW/Makefile.PL
  bin/enc2xs
  Merged from perl.git: "Fix Encode 2.60 with g++"
  http://perl5.git.perl.org/perl.git/commit/89c2544cd3

2.61 2014/05/31 09:48:48
! bin/piconv
  Applied: piconv nit
  + Better error handling when the encoding name is nonexistent
  Message-Id: <[email protected]>
! Encode.xs
  Applied: RT #95466:
   fallback definition of SvIsCOW() is wrong
   (and hence breaks on 5.8.2 and earlier)
  https://rt.cpan.org/Ticket/Display.html?id=95466

2.60 2014/04/29 16:25:06
! Byte/Makefile.PL
  CN/Makefile.PL
  EBCDIC/Makefile.PL
  Encode/Makefile_PL.e2x
  Encode/encode.h
  JP/Makefile.PL
  KR/Makefile.PL
  Symbol/Makefile.PL
  TW/Makefile.PL
  bin/enc2xs
  encengine.c
  Applied: more Fix Windows build (of Encode) with VC++ 6.0
  http://perl5.git.perl.org/perl.git/commit/9e9002efd1609c7d154f98af43a026320df7582c
! Unicode/Unicode.xs
  Addressed: sign extension issue found by Coverity #21
  dankogai/p5-encode#21
! Encode/encode.h Encode.xs Unicode/Unicode.xs
  removed #define U8 U8
  https://rt.perl.org/Ticket/Display.html?id=121554
  http://perl5.git.perl.org/perl.git/commit/2f2b4ff2c154a8e461857f2e82cb815c238d0d94

2.59 2014/04/06 17:23:55
! Byte/Makefile.PL
  CN/Makefile.PL
  EBCDIC/Makefile.PL
  Encode.pm
  Encode.xs
  Encode/Makefile_PL.e2x
  JP/Makefile.PL
  KR/Makefile.PL
  Symbol/Makefile.PL
  TW/Makefile.PL
  bin/enc2xs
  Restored the signature of Encode_XSEncoding() to address RT#94478
  * While dankogai/p5-encode#20
    pulls the symnames via argument thus breaks the compatibility
    with Encode::XX modules with *.ucm, the restored version
    pulls the symanmes via enc->name[0] so the added 2nd argument
    is no longer needed.
  https://rt.cpan.org/Public/Bug/Display.html?id=94478

2.58 2014/03/28 02:37:42
! bin/piconv
  Addressed: piconv bug of decoding UTF-16 (with fix)
  dankogai/p5-encode#19
! Byte/Makefile.PL
  CN/Makefile.PL
  EBCDIC/Makefile.PL
  Encode.pm
  Encode.xs
  Encode/Makefile_PL.e2x
  JP/Makefile.PL
  KR/Makefile.PL
  Symbol/Makefile.PL
  TW/Makefile.PL
  bin/enc2xs
  Pulled: Remap symname [RT #94221]
  dankogai/p5-encode#20
  https://rt.cpan.org/Public/Bug/Display.html?id=94221
! Encode.pm
  Pulled: [doc] clarify that CHECK coderefs return octets #18
  dankogai/p5-encode#18
jsonn pushed a commit to jsonn/pkgsrc that referenced this issue Oct 11, 2014
2.62 2014/05/31 12:12:39
! Encode.pm
  s/2013/2014/ on COPYRIGHT section
! Byte/Makefile.PL
  CN/Makefile.PL
  EBCDIC/Makefile.PL
  Encode/Makefile_PL.e2x
  Encode.xs
  JP/Makefile.PL
  KR/Makefile.PL
  Symbol/Makefile.PL
  TW/Makefile.PL
  bin/enc2xs
  Merged from perl.git: "Fix Encode 2.60 with g++"
  http://perl5.git.perl.org/perl.git/commit/89c2544cd3

2.61 2014/05/31 09:48:48
! bin/piconv
  Applied: piconv nit
  + Better error handling when the encoding name is nonexistent
  Message-Id: <[email protected]>
! Encode.xs
  Applied: RT #95466:
   fallback definition of SvIsCOW() is wrong
   (and hence breaks on 5.8.2 and earlier)
  https://rt.cpan.org/Ticket/Display.html?id=95466

2.60 2014/04/29 16:25:06
! Byte/Makefile.PL
  CN/Makefile.PL
  EBCDIC/Makefile.PL
  Encode/Makefile_PL.e2x
  Encode/encode.h
  JP/Makefile.PL
  KR/Makefile.PL
  Symbol/Makefile.PL
  TW/Makefile.PL
  bin/enc2xs
  encengine.c
  Applied: more Fix Windows build (of Encode) with VC++ 6.0
  http://perl5.git.perl.org/perl.git/commit/9e9002efd1609c7d154f98af43a026320df7582c
! Unicode/Unicode.xs
  Addressed: sign extension issue found by Coverity #21
  dankogai/p5-encode#21
! Encode/encode.h Encode.xs Unicode/Unicode.xs
  removed #define U8 U8
  https://rt.perl.org/Ticket/Display.html?id=121554
  http://perl5.git.perl.org/perl.git/commit/2f2b4ff2c154a8e461857f2e82cb815c238d0d94

2.59 2014/04/06 17:23:55
! Byte/Makefile.PL
  CN/Makefile.PL
  EBCDIC/Makefile.PL
  Encode.pm
  Encode.xs
  Encode/Makefile_PL.e2x
  JP/Makefile.PL
  KR/Makefile.PL
  Symbol/Makefile.PL
  TW/Makefile.PL
  bin/enc2xs
  Restored the signature of Encode_XSEncoding() to address RT#94478
  * While dankogai/p5-encode#20
    pulls the symnames via argument thus breaks the compatibility
    with Encode::XX modules with *.ucm, the restored version
    pulls the symanmes via enc->name[0] so the added 2nd argument
    is no longer needed.
  https://rt.cpan.org/Public/Bug/Display.html?id=94478

2.58 2014/03/28 02:37:42
! bin/piconv
  Addressed: piconv bug of decoding UTF-16 (with fix)
  dankogai/p5-encode#19
! Byte/Makefile.PL
  CN/Makefile.PL
  EBCDIC/Makefile.PL
  Encode.pm
  Encode.xs
  Encode/Makefile_PL.e2x
  JP/Makefile.PL
  KR/Makefile.PL
  Symbol/Makefile.PL
  TW/Makefile.PL
  bin/enc2xs
  Pulled: Remap symname [RT #94221]
  dankogai/p5-encode#20
  https://rt.cpan.org/Public/Bug/Display.html?id=94221
! Encode.pm
  Pulled: [doc] clarify that CHECK coderefs return octets #18
  dankogai/p5-encode#18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants