Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illegal character entity using XMLStreamReader on value encoded by external service #165

Closed
Magmaruss opened this issue Feb 6, 2023 · 4 comments
Milestone

Comments

@Magmaruss
Copy link
Contributor

Hello.
Using communication with external service and reading the response I met problematic value. One of xml elements has value with emoji character encoded as two surrogate characters instead of one code-point which is problematic for XMLStreamReader and throws exception.

Provided value (throws exception):
Merry Christmas ��

The same value encoded by one entity code (works good):
Merry Christmas 🎅
Merry Christmas &#x1F385

Exception:

Exception in thread "main" com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0xd83c
 at [row,col {unknown-source}]: [1,31]
	at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:634)
	at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:504)
	at com.ctc.wstx.sr.StreamScanner.reportIllegalChar(StreamScanner.java:2465)
	at com.ctc.wstx.sr.StreamScanner.validateChar(StreamScanner.java:2394)
	at com.ctc.wstx.sr.StreamScanner.resolveCharEnt(StreamScanner.java:2378)
	at com.ctc.wstx.sr.StreamScanner.fullyResolveEntity(StreamScanner.java:1526)
	at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2838)
	at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1122)
	at com.test.Test.main(Test.java:40)

Reproduction code:

// Value before encoding: Merry Christmas 🎅 (if no emoji font installed, the last character should be displayed as santa claus emoji U+1F385)
final String inputElement = "<value>Merry Christmas &#55356;&#57221;</value>";
final ByteArrayInputStream is = new ByteArrayInputStream(inputElement.getBytes());
WstxInputFactory wstxInputFactory = new WstxInputFactory();
wstxInputFactory.configureForSpeed();
final XMLStreamReader reader = wstxInputFactory.createXMLStreamReader(is);


while(reader.hasNext()) {
    reader.next();
    if (reader.getEventType() == XMLStreamConstants.CHARACTERS) {
        reader.getTextLength();
        // or
        // reader.getText();
        // or
        // reader.getTextStart();
        // or
        // reader.getTextCharacters();
    }
}
@cowtowncoder
Copy link
Member

Ok but is this not invalid XML content, and as such to be fixed by whatever produced it?
Meaning it'd make sense to file a bug report against said external service.

This based on thinking that encoder is making the mistake of blindly encoding 2 Java (?) UCS-2 surrogate characters as separate entities, producing what is not well-formed XML as per XML specification.

@Magmaruss
Copy link
Contributor Author

Magmaruss commented Feb 7, 2023

Right, I did some research in problems with xml parsing and surrogate pairs I see bugs reported to external systems/libraries, which produces that.
My case is problematic, because there is no possibility to upgrade the service (IBM FileNet P8) but I'll try to report this. Can you show me the workaround how to inject some code to the WstxInputFactory or StreamReader to convert this pair?

EDIT: The next problem with reader comes from Exchange Web Service, where email message contains character &#x5; which causes com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0x5
I think that would be good to add P_INPUT_INVALID_CHAR_HANDLER config property for WstxInputFactory like in WstxOutputFactory

@cowtowncoder
Copy link
Member

@Magmaruss yes, it's common to have legacy systems that cannot really be fixed.

I don't have a good recipe for this: if this was lower level, you could implement a wrapping Reader or InputStream, but given these come via character entities that wouldn't work. Unfortunately, too, configurability in Woodstox is focused more on general parsed and external entities; I blogged about these settings at some point:

https://medium.com/@cowtowncoder/configuring-woodstox-xml-parser-woodstox-specific-properties-1ce5030a5173

but I don't think anything in there (or in 2 earlier ones linked from it) would help. May be worth reading just in case.

If you have time and interest, adding new configuration property that would allow inclusion of surrogates could be acceptable as well: something disabled by default, but that can be enabled. It'd then produce what looks, I think, like valid pair of Java chars (UCS-2, variation of UTF-16), and would essentially "work" (although violating XML spec).
I don't think I have time to work on this, but if you or someone else had time I'd be happy to help get contribution in.
And I think it would be a valuable addition.

@cowtowncoder
Copy link
Member

Merged, to be included in 6.6.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants