You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(what is a source of confusion is that the argument is called "string' when it
should be called "octets_stream" or something. A string is an array of Unicode
characters, whereas an XML serialization is an array of semanticless bytes,
which can become characters given the choice of an encoding, and that is what
an XML document serialization is).
does not return consistent results (and even fails sometimes), considering the Cartesian product of:
choice of the XML preamble's encoding (iso-8859-1 or utf-8)
choice of the internal representation of the scalar in Perl (flag utf8::is_utf8 is on or off)
You will find hereafter a Perl program using the XSD schema above and demonstrating this (I cannot drag'n'drop on my platform):
$XML_U8 has utf-8 preamble and utf8 internal representation; all works well
$XML_L1 has latin-1 preamble and raw internal representation; all works well
However,
$xml_l1 has latin-1 preamble and utf8 internal representation; XML::LibXML fails to (re)serialise it as it was and therefore XML::LibXML::Schema fails to validate it
$xml_u8 has utf-8 preamble and raw internal representation; XML::LibXML cannot even load it as an XML document
Perl's behaviour should not depend on the internal representation of scalars (perldoc utf8).
Regards,
File validate.pl (UTF-8):
#! /usr/bin/perluse utf8;
use warnings;
use strict;
use feature 'say';
use Carp;
use Data::Dumper;
use English qw( -no_match_vars );
use File::Slurp qw( read_file );
use Term::ANSIColor qw( colored );
use XML::LibXML;
use Readonly;
Readonly my$XML_U8=>'<?xml version="1.0" encoding="utf-8"?><place>café</place>';
Readonly my$XML_L1=>"<?xml version='1.0' encoding='iso-8859-1'?><place>caf\x{e9}</place>";
Readonly my$XSD_FILE=>'schema.xsd';
subdumper {
my ($str) = @_;
my$res = Dumper($str);
$res =~ s{\$VAR1 \s+ = \s+ (.*) ;\s*$}{$1}sx;
$res =~ s{([^\x00-\x7e])}{colored((sprintf '[0x%02x]', ord $1), 'bold red')}gsex;
return$res;
}
subvalidation_error {
saySTDERR"VALIDATION ERROR: " . Dumper($EVAL_ERROR);
}
subvalidate {
my ($xsd_file, $xml_string) = @_;
saySTDERRsprintf"\nValidating [%s]...", dumper($xml_string);
my ($schema, $document);
eval { $schema = XML::LibXML::Schema->new( location=>$xsd_file ); 1 } orreturn validation_error();
eval { $document = XML::LibXML->load_xml( string=>$xml_string ); 1 } orreturn validation_error();
saySTDERRsprintf"XML parsed document reserialized: %s\n", dumper($document->serialize());
eval { $schema->validate($document); 1 } orreturn validation_error();
my$utf8_flag = utf8::is_utf8($xml_string) ? 1 : 0;
saySTDERRsprintf"xml_string [%s] -- is_utf8? [%s] -- validates with respect to [%s]", dumper($xml_string), $utf8_flag, $xsd_file;
return;
}
validate($XSD_FILE, $XML_U8);
validate($XSD_FILE, $XML_L1);
my$xml_l1 = $XML_L1;
utf8::upgrade($xml_l1);
validate($XSD_FILE, $xml_l1);
my$xml_u8 = $XML_U8;
utf8::downgrade($xml_u8);
validate($XSD_FILE, $xml_u8);
exit 0;
The text was updated successfully, but these errors were encountered:
Hello,
I like to hangout in cafés or pubs. Hence this XSD:
File schema.xsd (ASCII):
What matters is the fact the word café has a non-ASCII character in it.
This is a valid XML document for this XSD (at least xmllint --schema agrees on that):
where the "é" character is coded as bytes 0xc3 0xa9:
This is another valid XML document with another choice of encoding in the preamble:
where the "é" character is coded as byte 0xe9:
However, this code
(what is a source of confusion is that the argument is called "string' when it
should be called "octets_stream" or something. A string is an array of Unicode
characters, whereas an XML serialization is an array of semanticless bytes,
which can become characters given the choice of an encoding, and that is what
an XML document serialization is).
does not return consistent results (and even fails sometimes), considering the Cartesian product of:
You will find hereafter a Perl program using the XSD schema above and demonstrating this (I cannot drag'n'drop on my platform):
However,
Perl's behaviour should not depend on the internal representation of scalars (perldoc utf8).
Regards,
File validate.pl (UTF-8):
The text was updated successfully, but these errors were encountered: