Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML::LibXML->load_xml( string => ...) fails to build a correct object depending on Perl's scalar internal representation #72

Open
sblondeel opened this issue Sep 20, 2022 · 0 comments

Comments

@sblondeel
Copy link

sblondeel commented Sep 20, 2022

Hello,

I like to hangout in cafés or pubs. Hence this XSD:

File schema.xsd (ASCII):

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="place">
    <xsd:simpleType>
      <xsd:restriction base="xsd:string">
        <xsd:enumeration value="caf&#xe9;" />
        <xsd:enumeration value="pub" />
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>
</xsd:schema>

What matters is the fact the word café has a non-ASCII character in it.

This is a valid XML document for this XSD (at least xmllint --schema agrees on that):

<?xml version="1.0" encoding="utf-8"?><place>café</place>

where the "é" character is coded as bytes 0xc3 0xa9:

$ hexdump -C doc-u8.xml 
00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version="1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 75 74  |.0" encoding="ut|
00000020  66 2d 38 22 3f 3e 3c 70  6c 61 63 65 3e 63 61 66  |f-8"?><place>caf|
00000030  c3 a9 3c 2f 70 6c 61 63  65 3e 0a                 |..</place>.|
0000003b

This is another valid XML document with another choice of encoding in the preamble:

<?xml version="1.0" encoding="iso-8859-1"?><place>café</place>

where the "é" character is coded as byte 0xe9:

$ hexdump -C doc-l1.xml 
00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version="1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 69 73  |.0" encoding="is|
00000020  6f 2d 38 38 35 39 2d 31  22 3f 3e 3c 70 6c 61 63  |o-8859-1"?><plac|
00000030  65 3e 63 61 66 e9 3c 2f  70 6c 61 63 65 3e 0a     |e>caf.</place>.|
0000003f

However, this code

$document = XML::LibXML->load_xml( string => $xml_string );
print $document->serialize();

(what is a source of confusion is that the argument is called "string' when it
should be called "octets_stream" or something. A string is an array of Unicode
characters, whereas an XML serialization is an array of semanticless bytes,
which can become characters given the choice of an encoding, and that is what
an XML document serialization is).

does not return consistent results (and even fails sometimes), considering the Cartesian product of:

  • choice of the XML preamble's encoding (iso-8859-1 or utf-8)
  • choice of the internal representation of the scalar in Perl (flag utf8::is_utf8 is on or off)

You will find hereafter a Perl program using the XSD schema above and demonstrating this (I cannot drag'n'drop on my platform):

  • $XML_U8 has utf-8 preamble and utf8 internal representation; all works well
  • $XML_L1 has latin-1 preamble and raw internal representation; all works well

However,

  • $xml_l1 has latin-1 preamble and utf8 internal representation; XML::LibXML fails to (re)serialise it as it was and therefore XML::LibXML::Schema fails to validate it
  • $xml_u8 has utf-8 preamble and raw internal representation; XML::LibXML cannot even load it as an XML document

Perl's behaviour should not depend on the internal representation of scalars (perldoc utf8).

Regards,

File validate.pl (UTF-8):

#! /usr/bin/perl
use utf8;
use warnings;
use strict;
use feature 'say';
use Carp;
use Data::Dumper;
use English qw( -no_match_vars );
use File::Slurp qw( read_file );
use Term::ANSIColor qw( colored );
use XML::LibXML;
use Readonly;

Readonly my $XML_U8   => '<?xml version="1.0" encoding="utf-8"?><place>café</place>';
Readonly my $XML_L1   => "<?xml version='1.0' encoding='iso-8859-1'?><place>caf\x{e9}</place>";
Readonly my $XSD_FILE => 'schema.xsd';

sub dumper {
  my ($str) = @_;
  my $res = Dumper($str);
  $res =~ s{\$VAR1 \s+ = \s+ (.*) ;\s*$}{$1}sx;
  $res =~ s{([^\x00-\x7e])}{colored((sprintf '[0x%02x]', ord $1), 'bold red')}gsex;
  return $res;
}

sub validation_error {
  say STDERR "VALIDATION ERROR: " . Dumper($EVAL_ERROR);
}

sub validate {
  my ($xsd_file, $xml_string) = @_;

  say STDERR sprintf "\nValidating [%s]...", dumper($xml_string);

  my ($schema, $document);

  eval { $schema = XML::LibXML::Schema->new( location => $xsd_file ); 1 } or return validation_error();

  eval { $document = XML::LibXML->load_xml( string => $xml_string ); 1 } or return validation_error();
  say STDERR sprintf "XML parsed document reserialized: %s\n", dumper($document->serialize());

  eval { $schema->validate($document); 1 } or return validation_error();

  my $utf8_flag = utf8::is_utf8($xml_string) ? 1 : 0;

  say STDERR sprintf "xml_string [%s] -- is_utf8? [%s] -- validates with respect to [%s]", dumper($xml_string), $utf8_flag, $xsd_file;
  return;
}

validate($XSD_FILE, $XML_U8);

validate($XSD_FILE, $XML_L1);

my $xml_l1 = $XML_L1;
utf8::upgrade($xml_l1);
validate($XSD_FILE, $xml_l1);

my $xml_u8 = $XML_U8;
utf8::downgrade($xml_u8);
validate($XSD_FILE, $xml_u8);

exit 0;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant