Can´t parse odt file #7091

hcf-n · 2021-02-08T13:29:29Z

I'm trying to parse a simple odt file but Pandoc tells me Couldn't parse odt file.

Is there a way to get more informative output from the pandoc reader? I have trouble figurering out why the parser fails.

I have attached failing file. Please rename to mwe.odt

mwe.zip

The text was updated successfully, but these errors were encountered:

mb21 · 2021-02-08T14:02:13Z

Weird indeed... https://odfvalidator.org says the file is conformant as well...

mb21 · 2021-02-08T14:18:43Z

Source seems to be this line... so looks like toArchiveOrFail from Codec.Archive.Zip package doesn't like the zip file... how was it created? using what program?

jgm · 2021-02-08T17:14:07Z

Actually it comes from line 87. (I tried changing the earlier line to give more informative output, and this showed me.)

See #7091.

jgm · 2021-02-08T17:42:56Z

I've added some more fine-grained error reporting to the reader, and now I can see it's failing on

runConverter' read_body startState contentElem

But unfortunately it's not easy to get more information than that without more radical changes. (The author of this module included a Failure type but currently it's just ()!)

I do note something odd about the content.xml file in the container, the line:

<?xtpipes  file='oo-text.4xt'?>

When I remove this, pandoc can convert the file.
I'm not sure why this is there??

jgm · 2021-02-08T17:54:56Z

Maybe @MarLinn can help with this.

hcf-n · 2021-02-08T18:17:32Z

The odt file is produced by make4ht. I guess that explains why <?xtpipes file='oo-text.4xt'?>is there.

jgm · 2021-02-08T18:58:19Z

Aha, that makes sense. Our parser shouldn't fall apart because of that one line. But the odt reader is foreign territory for me; I'm hoping @MarLinn can see a way to fix this easily.

michal-h21 · 2021-02-08T19:13:35Z

Thanks for pointer to the <?xtpipes ?> instructions. It is internal instruction used by TeX4ht for some XML post-processing. It is unnecessary in the resulting ODT files, so make4ht will remove it from the final XML.

MarLinn · 2021-02-08T20:10:27Z

The ODT reader is based on the Text.XML.Light library. Which, from a few quick tests, seems to misunderstand processing instructions like this one. So in turn the reader does as well.
It never occurred to me to test this before because I naïvely assumed this to be basic functionality for an XML library. The easiest fix would probably be a pull request against Text.XML.Light.
Or, seeing how ODT files rarely contain processing instructions, we could wait and see if the summer-of-code xml library idea turns into something good.

jgm · 2021-02-08T23:37:54Z

@MarLinn, I tried running xml-light on this input, and it parsed the processing instruction as a regular element whose name begins with ?; it wasn't clear why the ODT parser would get confused by this parsing, since it could just be skipped. But maybe you saw something different?

Elem (Element {elName = QName {qName = "?xtpipes", qURI = Nothing, qPrefix = Nothing}, elAttribs = [Attr {attrKey = QName {qName = "file", qURI = Nothing, qPrefix = Nothing}, attrVal = "oo-text.4xt"}], elContent = [], elLine = Just 2})

jgm · 2021-02-08T23:39:21Z

Of course, if there is a problem, we could strip out processing instructions before passing the input to xml-light. But I'd like to be convinced first that xml-light is really the problem in this case.

MarLinn · 2021-02-09T00:43:13Z

Here's a slightly larger example:

<div>A<?pi?>B</div>

gets parsed as (simplified)

Element
    { elName = QName "div"
    , elContent =
        [ Text "A"
        , Element
            { elName    = QName "?pi?"
            , elContent = [ Text "B"]
            }
        ]
    }

Notice where B ends up. It's not on the same level as A any more. Instead xml-light interprets the processing instruction as a starting tag with a missing end tag. So all the content after such an instruction up until the next closing tag is interpreted as a child of the processing instruction "element" instead of as a sibling. And that's the structure that gets delivered to the ODT extractor. Now the reader could identify those spurious elements and flatten them away in some preprocessing step or similar, but that's fixing the wrong problem.

jgm · 2021-02-09T01:02:58Z

That does look like a serious issue (and should be reported or fixed in xml-light). But can it be the issue that we're facing here? My test above suggests that, in this case at least, the element is being closed before the rest of the content in the content.xml we have here. Have you tried running parseXML directly on the contents of content.xml in this mwe?

MarLinn · 2021-02-09T02:25:47Z

I tried it now and it seems the issue gets even weirder when the processing instruction is at the top level like in this case.

xml-light has two parse functions, parseXML and parseXMLDoc.
When I parse the mwe with parseXML, the list of content chunks looks correct with no bad nesting. Somehow the processing instruction is handled correctly in this case. But the function we actually want, parseXMLDoc, takes the output of parseXML and throws away everything except for what it thinks is the first element. In this case the "first element" is the processing instruction, that's why it throws most of the document away and returns the result @jgm got.

So there's at least two bugs in the library.

We may be able to fix the exact one from this issue by rolling our own modified parseXMLDoc, but that doesn't fix the second issue if processing instructions occur deeper in the file.

jgm · 2021-02-09T06:22:20Z

I'm going to try to create a quick little bridge module so we can use xml-conduit's parser to parse to an xml-light type. That's an intermediate solution; if it works, we could gradually transition to xml-conduit, which seems well tested and performant. (We already transitively depend on xml-conduit via citeproc.)

This exports a function that uses xml-conduit's parser to produce an Element from Text.XML.Light, so existing pandoc code can be made to use the better parser with a minimum of modification. See #7091.

jgm · 2021-02-09T07:37:53Z

@MarLinn I've pushed some code to the xmlbridge branch if you want to take a look. With this change, the odt reader tests all fail -- I think it's because of changes in the URI (namespace) part of qualified names, but I haven't had a chance to investigate yet.

jgm · 2021-02-09T17:15:44Z

@MarLinn I see what is happening; my new parseXMLDoc based on xml-conduit differs in one important way from the old one. The old one parses all the xmlns: prefixed attributes on the root element as attributes. The new one leaves these off, since the information is used in interpreting the prefixes while the document is being parsed. Your code seems to presuppose that these attributes will be there -- I wonder if there's a simple way to change that; then I think this will work.

To clarify, attributes like this are present on the root note with xml-light's parser, but not with xml-conduit's:

Attr {attrKey = QName {qName = "style", qURI = Nothing, qPrefix = Just "xmlns"}, attrVal = "urn:oasis:names:tc:opendocument:xmlns:style:1.0"},

MarLinn · 2021-02-09T23:40:47Z

Ah, good old namespaces.
Odt has a lot of them, so they are deeply integrated into the reader, no element or attribute converter is defined without one. To facilitate that, the reader has its own namespace management. Another reason is that xml-light doesn't have one.
The upside is that there's only a small hand full of places deep in the bowels where xml namespaces and reader managed namespaces are tied together and that would have to be changed.

What I'm not sure about yet is if xml-conduit keeps enough information to be integrated there. If it does, this should be doable.

jgm · 2021-02-10T01:37:35Z

Ah never mind, I've figured out a way to restore compatibility!

jgm · 2021-02-10T02:06:22Z

Got it working now, and in addition to parsing the mwe.odt correctly, we now get a big performance boost.
I'm going to change some of the other parsers to use this first before merging with master branch.

mb21 added format:ODT reader labels Feb 8, 2021

jgm added a commit that referenced this issue Feb 8, 2021

ODT reader: finer-grained errors on parse failure.

f70795d

See #7091.

hcf-n mentioned this issue Feb 8, 2021

Problems with .odt when converting to docx with LibreOffice michal-h21/make4ht#43

Closed

jgm closed this as completed in 8ca1916 Feb 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can´t parse odt file #7091

Can´t parse odt file #7091

hcf-n commented Feb 8, 2021

mb21 commented Feb 8, 2021

mb21 commented Feb 8, 2021 •

edited

Loading

jgm commented Feb 8, 2021

jgm commented Feb 8, 2021 •

edited

Loading

jgm commented Feb 8, 2021

hcf-n commented Feb 8, 2021

jgm commented Feb 8, 2021

michal-h21 commented Feb 8, 2021

MarLinn commented Feb 8, 2021

jgm commented Feb 8, 2021

jgm commented Feb 8, 2021

MarLinn commented Feb 9, 2021

jgm commented Feb 9, 2021

MarLinn commented Feb 9, 2021

jgm commented Feb 9, 2021 •

edited

Loading

jgm commented Feb 9, 2021

jgm commented Feb 9, 2021 •

edited

Loading

MarLinn commented Feb 9, 2021

jgm commented Feb 10, 2021

jgm commented Feb 10, 2021

Can´t parse odt file #7091

Can´t parse odt file #7091

Comments

hcf-n commented Feb 8, 2021

mb21 commented Feb 8, 2021

mb21 commented Feb 8, 2021 • edited Loading

jgm commented Feb 8, 2021

jgm commented Feb 8, 2021 • edited Loading

jgm commented Feb 8, 2021

hcf-n commented Feb 8, 2021

jgm commented Feb 8, 2021

michal-h21 commented Feb 8, 2021

MarLinn commented Feb 8, 2021

jgm commented Feb 8, 2021

jgm commented Feb 8, 2021

MarLinn commented Feb 9, 2021

jgm commented Feb 9, 2021

MarLinn commented Feb 9, 2021

jgm commented Feb 9, 2021 • edited Loading

jgm commented Feb 9, 2021

jgm commented Feb 9, 2021 • edited Loading

MarLinn commented Feb 9, 2021

jgm commented Feb 10, 2021

jgm commented Feb 10, 2021

mb21 commented Feb 8, 2021 •

edited

Loading

jgm commented Feb 8, 2021 •

edited

Loading

jgm commented Feb 9, 2021 •

edited

Loading

jgm commented Feb 9, 2021 •

edited

Loading