Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can´t parse odt file #7091

Closed
hcf-n opened this issue Feb 8, 2021 · 20 comments
Closed

Can´t parse odt file #7091

hcf-n opened this issue Feb 8, 2021 · 20 comments

Comments

@hcf-n
Copy link

hcf-n commented Feb 8, 2021

I'm trying to parse a simple odt file but Pandoc tells me Couldn't parse odt file.

Is there a way to get more informative output from the pandoc reader? I have trouble figurering out why the parser fails.

I have attached failing file. Please rename to mwe.odt

mwe.zip

@mb21
Copy link
Collaborator

mb21 commented Feb 8, 2021

Weird indeed... https://odfvalidator.org says the file is conformant as well...

@mb21
Copy link
Collaborator

mb21 commented Feb 8, 2021

Source seems to be this line... so looks like toArchiveOrFail from Codec.Archive.Zip package doesn't like the zip file... how was it created? using what program?

@jgm
Copy link
Owner

jgm commented Feb 8, 2021

Actually it comes from line 87. (I tried changing the earlier line to give more informative output, and this showed me.)

jgm added a commit that referenced this issue Feb 8, 2021
@jgm
Copy link
Owner

jgm commented Feb 8, 2021

I've added some more fine-grained error reporting to the reader, and now I can see it's failing on

runConverter' read_body startState contentElem

But unfortunately it's not easy to get more information than that without more radical changes. (The author of this module included a Failure type but currently it's just ()!)

I do note something odd about the content.xml file in the container, the line:

<?xtpipes  file='oo-text.4xt'?> 

When I remove this, pandoc can convert the file.
I'm not sure why this is there??

@jgm
Copy link
Owner

jgm commented Feb 8, 2021

Maybe @MarLinn can help with this.

@hcf-n
Copy link
Author

hcf-n commented Feb 8, 2021

The odt file is produced by make4ht. I guess that explains why <?xtpipes file='oo-text.4xt'?>is there.

@jgm
Copy link
Owner

jgm commented Feb 8, 2021

Aha, that makes sense. Our parser shouldn't fall apart because of that one line. But the odt reader is foreign territory for me; I'm hoping @MarLinn can see a way to fix this easily.

@michal-h21
Copy link

Thanks for pointer to the <?xtpipes ?> instructions. It is internal instruction used by TeX4ht for some XML post-processing. It is unnecessary in the resulting ODT files, so make4ht will remove it from the final XML.

@MarLinn
Copy link
Contributor

MarLinn commented Feb 8, 2021

The ODT reader is based on the Text.XML.Light library. Which, from a few quick tests, seems to misunderstand processing instructions like this one. So in turn the reader does as well.
It never occurred to me to test this before because I naïvely assumed this to be basic functionality for an XML library. The easiest fix would probably be a pull request against Text.XML.Light.
Or, seeing how ODT files rarely contain processing instructions, we could wait and see if the summer-of-code xml library idea turns into something good.

@jgm
Copy link
Owner

jgm commented Feb 8, 2021

@MarLinn, I tried running xml-light on this input, and it parsed the processing instruction as a regular element whose name begins with ?; it wasn't clear why the ODT parser would get confused by this parsing, since it could just be skipped. But maybe you saw something different?

Elem (Element {elName = QName {qName = "?xtpipes", qURI = Nothing, qPrefix = Nothing}, elAttribs = [Attr {attrKey = QName {qName = "file", qURI = Nothing, qPrefix = Nothing}, attrVal = "oo-text.4xt"}], elContent = [], elLine = Just 2})

@jgm
Copy link
Owner

jgm commented Feb 8, 2021

Of course, if there is a problem, we could strip out processing instructions before passing the input to xml-light. But I'd like to be convinced first that xml-light is really the problem in this case.

@MarLinn
Copy link
Contributor

MarLinn commented Feb 9, 2021

Here's a slightly larger example:

<div>A<?pi?>B</div>

gets parsed as (simplified)

Element
    { elName = QName "div"
    , elContent =
        [ Text "A"
        , Element
            { elName    = QName "?pi?"
            , elContent = [ Text "B"]
            }
        ]
    }

Notice where B ends up. It's not on the same level as A any more. Instead xml-light interprets the processing instruction as a starting tag with a missing end tag. So all the content after such an instruction up until the next closing tag is interpreted as a child of the processing instruction "element" instead of as a sibling. And that's the structure that gets delivered to the ODT extractor. Now the reader could identify those spurious elements and flatten them away in some preprocessing step or similar, but that's fixing the wrong problem.

@jgm
Copy link
Owner

jgm commented Feb 9, 2021

That does look like a serious issue (and should be reported or fixed in xml-light). But can it be the issue that we're facing here? My test above suggests that, in this case at least, the element is being closed before the rest of the content in the content.xml we have here. Have you tried running parseXML directly on the contents of content.xml in this mwe?

@MarLinn
Copy link
Contributor

MarLinn commented Feb 9, 2021

I tried it now and it seems the issue gets even weirder when the processing instruction is at the top level like in this case.

xml-light has two parse functions, parseXML and parseXMLDoc.
When I parse the mwe with parseXML, the list of content chunks looks correct with no bad nesting. Somehow the processing instruction is handled correctly in this case. But the function we actually want, parseXMLDoc, takes the output of parseXML and throws away everything except for what it thinks is the first element. In this case the "first element" is the processing instruction, that's why it throws most of the document away and returns the result @jgm got.

So there's at least two bugs in the library.

We may be able to fix the exact one from this issue by rolling our own modified parseXMLDoc, but that doesn't fix the second issue if processing instructions occur deeper in the file.

@jgm
Copy link
Owner

jgm commented Feb 9, 2021

I'm going to try to create a quick little bridge module so we can use xml-conduit's parser to parse to an xml-light type. That's an intermediate solution; if it works, we could gradually transition to xml-conduit, which seems well tested and performant. (We already transitively depend on xml-conduit via citeproc.)

jgm added a commit that referenced this issue Feb 9, 2021
This exports a function that uses xml-conduit's parser to
produce an Element from Text.XML.Light, so existing pandoc
code can be made to use the better parser with a minimum
of modification.

See #7091.
@jgm
Copy link
Owner

jgm commented Feb 9, 2021

@MarLinn I've pushed some code to the xmlbridge branch if you want to take a look. With this change, the odt reader tests all fail -- I think it's because of changes in the URI (namespace) part of qualified names, but I haven't had a chance to investigate yet.

@jgm
Copy link
Owner

jgm commented Feb 9, 2021

@MarLinn I see what is happening; my new parseXMLDoc based on xml-conduit differs in one important way from the old one. The old one parses all the xmlns: prefixed attributes on the root element as attributes. The new one leaves these off, since the information is used in interpreting the prefixes while the document is being parsed. Your code seems to presuppose that these attributes will be there -- I wonder if there's a simple way to change that; then I think this will work.

To clarify, attributes like this are present on the root note with xml-light's parser, but not with xml-conduit's:

Attr {attrKey = QName {qName = "style", qURI = Nothing, qPrefix = Just "xmlns"}, attrVal = "urn:oasis:names:tc:opendocument:xmlns:style:1.0"},

@MarLinn
Copy link
Contributor

MarLinn commented Feb 9, 2021

Ah, good old namespaces.
Odt has a lot of them, so they are deeply integrated into the reader, no element or attribute converter is defined without one. To facilitate that, the reader has its own namespace management. Another reason is that xml-light doesn't have one.
The upside is that there's only a small hand full of places deep in the bowels where xml namespaces and reader managed namespaces are tied together and that would have to be changed.

What I'm not sure about yet is if xml-conduit keeps enough information to be integrated there. If it does, this should be doable.

@jgm
Copy link
Owner

jgm commented Feb 10, 2021

Ah never mind, I've figured out a way to restore compatibility!

@jgm
Copy link
Owner

jgm commented Feb 10, 2021

Got it working now, and in addition to parsing the mwe.odt correctly, we now get a big performance boost.
I'm going to change some of the other parsers to use this first before merging with master branch.

@jgm jgm closed this as completed in 8ca1916 Feb 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants