Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ODT proposal #136

Open
7 tasks
hmdne opened this issue Sep 30, 2024 · 1 comment
Open
7 tasks

ODT proposal #136

hmdne opened this issue Sep 30, 2024 · 1 comment

Comments

@hmdne
Copy link
Contributor

hmdne commented Sep 30, 2024

While working on #135, I have realized the idea is solid. This issue is to describe shortly what I plan to do; the milestones will need to change a little though.

The idea in short: for DOCX files support, I plan to implement an ODT parser and converter to Coradoc. This will not get rid of LibreOffice dependency (unless user generates ODT file himself). In my experience, ODT is very close to HTML, yet it preserves a lot more semantic than LibreOffice HTML, so this should be fairly easy to do (at least, compared to DOCX - I would describe the difference as follows: the ODT format was designed for document interchange, the DOCX format was designed to represent internal MS Word structures serialized to XML - and as @opoudjis noted, this isn't even well documented).

The plan is as follows:

  • vendor in word-to-markdown dependency (part of Remove unsuitable gem dependencies #121 )
    • the rationale for that:
      • while our new implementation will parse ODT directly, there will always be LibreOffice HTML documents in the wild
      • this implementation is in use and, for the most part, it works
      • there are some (small) issues with word-to-markdown that we may be able to fix locally
      • it's not a big thing, most of the work is done in HTML already
  • create a gem, that will map ODT format using Rubyzip and Lutaml::Model
    • Rubyzip won't work with Opal, but we would be able to polyfill it with some Node.js library; or not ship this part
  • use the above gem to create Coradoc::Input::Odt (would supersede Ability to convert Word into Coradoc (and to adoc) #115 ; I recommend to read discussion on that issue, as it refers to this one)
  • benchmark the implementation using ISO Simple Template (Update implementation to be able to transform the ISO Simple Template docx #87)
  • ensure the implementation works with MS Word-generated ODT files
    • optional, but would require me to buy MS Word license
    • rationale:
      • would allow users to export ODT directly from MS Word
      • we could perhaps script in the future an option to export ODT using MS Word executable
  • switch default of DOCX from current Coradoc::Input::Docx to Coradoc::Input::Odt
    • I think even at this point, we should keep the old implementation, so that users will be able to choose another if the first one breaks (those implementations could be called descriptively DocxViaHtml and DocxViaOdt).

Any opinions on that plan?

@ronaldtse @ReesePlews @opoudjis @webdev778 @xyz65535

@ronaldtse
Copy link
Contributor

@hmdne I think this is doable, but I don't want to spend too much resources in doing this, given we have other priorities.

create a gem, that will map ODT format

Technically this means we create an ODT gem that can read (and possibly, write) ODT, using lutaml-model and rubyzip. This is reasonable and contained as a task (and allows contained testing).

the DOCX format was designed to represent internal MS Word structures serialized to XML

Nonetheless, the ultimate goal remains that we need to support DOCX format input. At this moment I would consider ODT a "easier of the two evils" -- an intermediary step between Coradoc and DOCX. I really think DOCX is within reach.

The current mechanism of html2doc (MHT) already prohibits people with Windows Word from directly loading files generated by Metanorma. Microsoft has removed MHT functionality from Windows Word, and therefore we must switch to generating DOCX in the future.

Resources

How long do you think this will take?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants