-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can´t parse odt file #7091
Comments
Weird indeed... https://odfvalidator.org says the file is conformant as well... |
Source seems to be this line... so looks like |
Actually it comes from line 87. (I tried changing the earlier line to give more informative output, and this showed me.) |
I've added some more fine-grained error reporting to the reader, and now I can see it's failing on runConverter' read_body startState contentElem But unfortunately it's not easy to get more information than that without more radical changes. (The author of this module included a Failure type but currently it's just ()!) I do note something odd about the content.xml file in the container, the line:
When I remove this, pandoc can convert the file. |
Maybe @MarLinn can help with this. |
The odt file is produced by make4ht. I guess that explains why |
Aha, that makes sense. Our parser shouldn't fall apart because of that one line. But the odt reader is foreign territory for me; I'm hoping @MarLinn can see a way to fix this easily. |
Thanks for pointer to the |
The ODT reader is based on the Text.XML.Light library. Which, from a few quick tests, seems to misunderstand processing instructions like this one. So in turn the reader does as well. |
@MarLinn, I tried running xml-light on this input, and it parsed the processing instruction as a regular element whose name begins with Elem (Element {elName = QName {qName = "?xtpipes", qURI = Nothing, qPrefix = Nothing}, elAttribs = [Attr {attrKey = QName {qName = "file", qURI = Nothing, qPrefix = Nothing}, attrVal = "oo-text.4xt"}], elContent = [], elLine = Just 2}) |
Of course, if there is a problem, we could strip out processing instructions before passing the input to xml-light. But I'd like to be convinced first that xml-light is really the problem in this case. |
Here's a slightly larger example:
gets parsed as (simplified)
Notice where |
That does look like a serious issue (and should be reported or fixed in xml-light). But can it be the issue that we're facing here? My test above suggests that, in this case at least, the element is being closed before the rest of the content in the content.xml we have here. Have you tried running parseXML directly on the contents of content.xml in this mwe? |
I tried it now and it seems the issue gets even weirder when the processing instruction is at the top level like in this case. xml-light has two parse functions, parseXML and parseXMLDoc. So there's at least two bugs in the library. We may be able to fix the exact one from this issue by rolling our own modified parseXMLDoc, but that doesn't fix the second issue if processing instructions occur deeper in the file. |
I'm going to try to create a quick little bridge module so we can use xml-conduit's parser to parse to an xml-light type. That's an intermediate solution; if it works, we could gradually transition to xml-conduit, which seems well tested and performant. (We already transitively depend on xml-conduit via citeproc.) |
This exports a function that uses xml-conduit's parser to produce an Element from Text.XML.Light, so existing pandoc code can be made to use the better parser with a minimum of modification. See #7091.
@MarLinn I've pushed some code to the |
@MarLinn I see what is happening; my new To clarify, attributes like this are present on the root note with xml-light's parser, but not with xml-conduit's: Attr {attrKey = QName {qName = "style", qURI = Nothing, qPrefix = Just "xmlns"}, attrVal = "urn:oasis:names:tc:opendocument:xmlns:style:1.0"}, |
Ah, good old namespaces. What I'm not sure about yet is if xml-conduit keeps enough information to be integrated there. If it does, this should be doable. |
Ah never mind, I've figured out a way to restore compatibility! |
Got it working now, and in addition to parsing the mwe.odt correctly, we now get a big performance boost. |
I'm trying to parse a simple odt file but Pandoc tells me
Couldn't parse odt file
.Is there a way to get more informative output from the pandoc reader? I have trouble figurering out why the parser fails.
I have attached failing file. Please rename to mwe.odt
mwe.zip
The text was updated successfully, but these errors were encountered: