Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML entities in element content confuses xpath #28

Open
mhsdef opened this issue Mar 14, 2016 · 6 comments
Open

HTML entities in element content confuses xpath #28

mhsdef opened this issue Mar 14, 2016 · 6 comments

Comments

@mhsdef
Copy link

mhsdef commented Mar 14, 2016

HTML entities in the element content appear to confuse xpath. It either seems to truncate the string on certain valid entities (eg, <) or blows up entirely.

Example failures:
_the_following_data_ |> SweetXml.xpath( ~x"//soapenv:Body/*[1]/*", message: ~x"name(.)", part: ~x"./text()")

<?xml version=\"1.0\" encoding=\"UTF-8\"?><soapenv:Envelope xmlns:soapenv=\"http://schemas.xmlsoap.org/soap/envelope/\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><soapenv:Body><ns1:loginResponse soapenv:encodingStyle=\"http://schemas.xmlsoap.org/soap/encoding/\" xmlns:ns1=\"http://www.someplace.com/webservices/\"><loginReturn xsi:type=\"soapenc:string\" xmlns:soapenc=\"http://schemas.xmlsoap.org/soap/encoding/\">vSFFDDDzA34/SNu384NhbT93cGEEE+msH4hk&lt;separator&gt;LfhRIM7U9B0=+_+Blahblah</loginReturn></ns1:loginResponse></soapenv:Body></soapenv:Envelope>
<?xml version=\"1.0\" encoding=\"UTF-8\"?><soapenv:Envelope xmlns:soapenv=\"http://schemas.xmlsoap.org/soap/envelope/\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><soapenv:Body><ns1:loginResponse soapenv:encodingStyle=\"http://schemas.xmlsoap.org/soap/encoding/\" xmlns:ns1=\"http://www.someplace.com/webservices/\"><loginReturn xsi:type=\"soapenc:string\" xmlns:soapenc=\"http://schemas.xmlsoap.org/soap/encoding/\">vSFFDDDzA34/SNu384NhbT93cGEEE+msH4hk&dlt;separator&xgt;LfhRIM7U9B0=+_+Blahblah</loginReturn></ns1:loginResponse></soapenv:Body></soapenv:Envelope>

Remove the ampersands in the loginReturn bodies and the query works.

@awetzel
Copy link
Collaborator

awetzel commented Mar 14, 2016

Hello, first of all, SweetXml is just a wrapper of Xmerl from erlang standard library.
Here your issue seems to come from xmerl yourstr |> to_char_list |> :xmerl_scan.string.

I will try to investigate when I find some time this week.

@awetzel
Copy link
Collaborator

awetzel commented Mar 14, 2016

The error I found executing your command on first xml is because your xpath //soapenv:Body/[1]/ is not correct.

Maybe you mean : //soapenv:Body/*[1] ? (*[1] instead of [1] and do not end your xpath with / !

So here are the remarks I found with your error :

  • first your xpath is malformed, which leads to a :xmerl_xpath_parse badmatch exception
  • with a correct xpath (//soapenv:Body/*[1]), the first xml in your issue works well
  • for the second xml, there is another error : there are two xml entities : &dlt; and &xgt; which does not exist ! this leads to an error of xmerl : :error_scanning_entity_ref.
  • if I correct these entities (with &lt; and &gt; or escaping & : &amp;dlt; and &amp;xgt; then it works well with the second xml

(I tested it with erlang 18.1)

@mhsdef
Copy link
Author

mhsdef commented Mar 14, 2016

Hi there!

Sorry, I should have wrapped the xpath the first time with backticks. GH applied markdown. I've corrected so the xpath shows as intended in the OP.

@mhsdef
Copy link
Author

mhsdef commented Mar 14, 2016

The behavior I see with the first example is truncation of the vSFFDDDzA34/SNu384NhbT93cGEEE+msH4hk&lt;separator&gt;LfhRIM7U9B0=+_+Blahblah string at &lt;. I get back vSFFDDDzA34/SNu384NhbT93cGEEE+msH4hk instead of the desired whole thing.

The second example, yeah, is really unhappy that it thinks it sees an entity but it is an invalid one. I'm not sure necessarily what (if anything) we can do but I added that example as it felt non-graceful. And problematic if you have random characters that happen to look like that.

@awetzel
Copy link
Collaborator

awetzel commented Mar 14, 2016

Hi :)
ok I understand your issue.
Again SweetXml is only a wrapper around xmerl, and xmerl make text() node list around xml entities.

Still the string modifier of SweetXml (/s) join text nodes to help you to handle this case. So after an import SweetXml :

iex> xml |> xpath( ~x"//soapenv:Body/*[1]/*", message: ~x"name(.)", part: ~x"./text()"s) 
%{message: 'loginReturn', part: "vSFFDDDzA34/SNu384NhbT93cGEEE+msH4hk<separator>LfhRIM7U9B0=+_+Blahblah"}

The behavior you observe is that if the list specifier (/l) is not used and there are multiple nodes() in the result, then only the first element is returned, that is why you got only the first of the multiple text() nodes resulting of the xmerl parsing.
To highlight this, here is another way of handling this kind of input:

iex> xml |> xpath( ~x"//soapenv:Body/*[1]/*", message: ~x"name(.)", 
                              part: ~x"./text()"l |> transform_by(&Enum.join/1))
%{message: 'loginReturn',
  part: "vSFFDDDzA34/SNu384NhbT93cGEEE+msH4hk<separator>LfhRIM7U9B0=+_+Blahblah"}

XML text node with a & char (not in a CDATA and without escaping it with &amp;) which is not the beginning of a known XML entity is malformed in the XML spec. So the xmerl behavior is not faulty.

Still both behaviors can be cumbersome, but as they are standard erlang xmerl behaviors, SweetXml cannot bypass it without being a complete XML parser and xpath implementation by itself.

Still I think bypass them with the "sigil with modifiers" approach is sufficient.

@Shakadak
Copy link
Member

Shakadak commented Feb 3, 2021

Hi, is it relevant to keep this open ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants