Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML document work #61

Merged
merged 44 commits into from
May 29, 2024
Merged

HTML document work #61

merged 44 commits into from
May 29, 2024

Conversation

hmdne
Copy link
Contributor

@hmdne hmdne commented May 23, 2024

This branch is aiming to be able to convert a HTML file from metanorma/reverse_adoc#90.

Metanorma PR checklist

Copy link

codecov bot commented May 23, 2024

Codecov Report

Attention: Patch coverage is 97.44246% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 98.46%. Comparing base (defb04a) to head (d8963e8).
Report is 13 commits behind head on main.

Files Patch % Lines
lib/coradoc/reverse_adoc/html_converter.rb 87.50% 8 Missing ⚠️
lib/coradoc/reverse_adoc/converters/table.rb 97.97% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      metanorma/reverse_adoc#61      +/-   ##
==========================================
+ Coverage   96.67%   98.46%   +1.78%     
==========================================
  Files          42       46       +4     
  Lines        1054     1306     +252     
==========================================
+ Hits         1019     1286     +267     
+ Misses         35       20      -15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hmdne
Copy link
Contributor Author

hmdne commented May 23, 2024

I use AsciiDoctor to round-trip a document. This is one of the first issues I found that turned out to be an issue with AsciiDoctor actually (unless I am mistaken and this is not possible in AsciiDoc):

asciidoctor/asciidoctor#4595

Anyway, the document round trips successfully at this point, though there are still a lot of issues remaining.

@ronaldtse
Copy link
Contributor

That's fine. We will need to ensure we test Coradoc against AsciiDoctor behavior.

Coradoc is meant to be a replacement to AsciiDoctor:

  • Coradoc should parse HTML and return AsciiDoc
  • Coradoc should parse AsciiDoc and return the resulting Document model tree
  • Coradoc should convert that Document model tree into other formats, including HTML (AsciiDoctor processes AsciiDoc into HTML)

@ronaldtse
Copy link
Contributor

I use AsciiDoctor to round-trip a document. This is one of the first issues I found that turned out to be an issue with AsciiDoctor actually (unless I am mistaken and this is not possible in AsciiDoc):

asciidoctor/asciidoctor#4595

A normal AsciiDoctor table cell is plain text only. To allow the image in a table cell you need to specify as an "AsciiDoc table cell".

[cols="1,1"]
|===
|cell1
a|image::images/004.webp["",200,100]
|===

@hmdne
Copy link
Contributor Author

hmdne commented May 23, 2024

I just realized this was a bogus issue report, and it's an issue on our side actually.

@ronaldtse
Copy link
Contributor

Let's gather up any questions within Coradoc first and the team will answer any questions so we don't affect others' repositories.

cc @opoudjis @Intelligent2013 @manuelfuenmayor @anermina

@hmdne
Copy link
Contributor Author

hmdne commented May 23, 2024

6c4a059 makes it so that tables are now computed correctly (mostly, still in testing).

This makes the following fragment:

image

Being roundtripped into:

image

What's apparent is a difference between the column widths (I add to a table an attribute cols="3*", for instance), which makes the resulting HTML syntax having predefined column widths. The original document just relies on a web browser to deduce column widths. I have found no way to disable this behavior.

Another difference is a lack of BGCOLOR. Should I pass this attribute along? Perhaps when some setting is enabled?

@hmdne
Copy link
Contributor Author

hmdne commented May 23, 2024

After this commit, the document is mostly readable in my opinion. There are still some crucial issues that I can see, but the document is now, let's say, testable.

Note: I still haven't implemented --split-sections option, so there's just a single .adoc file being output.

Below is an archive that contains an adoc file created using this branch and also a html file that is a result of AsciiDoctor processing of that file:
document.tar.gz

@ronaldtse
Copy link
Contributor

Thanks @hmdne , this is respectable progress!

The only thing is that the document is to be tested using Metanorma, not AsciiDoctor. The sample document for that is in the mn-samples-plateau repository (001-v3 is the v3 of this document, the new HTML version is 001-v4)

This HTML document was developed to adhere to Metanorma styling.

hmdne added 2 commits May 24, 2024 03:29
By this, we mean - if before a link there's a space, or beginning
of a block, we don't need to add another space.

In fact, we shouldn't, because in a case of code like...

<div><a href="test">test</a></div>

If we add a space before a link, we open a code block and thus
we just get a source code and not a link.
In particular, I was curious what caused a performance problem on
a large document I'm working on. Turned out, it was a
remove_inner_whitespace procedure in Cleaner. With a simple fix I
managed to make it finish in 1 second, instead of 170s.

All the rest of the processing combined takes 10s, so we will be
able to progress much faster on next issues.
hmdne added 3 commits May 24, 2024 04:23
Happened to me once, but could happen at any time in production.
The idea here, is that HTML content generators may often introduce
a lot of unnecessary markup, that only makes sense in the HTML+CSS
context. The idea is that certain cases can be simplified, making it
so that the result is equivalent, but much simpler, allowing us to
generate a nicer AsciiDoc syntax for those cases.
@hmdne
Copy link
Contributor Author

hmdne commented May 24, 2024

@ronaldtse Thanks for clarification. I will take a deeper look at how they compare. For now, I need to work a little bit more on tables, so that we will produce necessarily correct AsciiDoc output.

@hmdne
Copy link
Contributor Author

hmdne commented May 24, 2024

@ronaldtse A question - this document is not necessarily a semantic HTML, it sometimes uses styling. For instance:

Instead of <h2> it does <div class="subtitledata">. Instead of <th> it does <td BGCOLOR="#dddddd">

Creating a proper document won't be possible with that in mind. We can't add exceptions like this to reverse_adoc logic, since this is internal to just this document and its styling (or should we? I think the purpose of reverse_adoc is to be agnostic to formats). Otherwise, we will need to add a script to preprocess it and perhaps even postprocess it if Metanorma-compatible content is desired. Can you perhaps provide us some hints on that? (As in, is it a scope of this task, in which repo should such pre/postprocessors land, etc.)

hmdne added 4 commits May 24, 2024 05:22
Let's move the logic of delimiting tables to Coradoc, as I think it
makes more sense to be there. This changes semantics a little - now
one-line rows are generated if there are any AsciiDoc cells. Before
that, it was a logic of Cell to decide if it wanted to be generated
multiline or not. This results in nicer tables.
@hmdne
Copy link
Contributor Author

hmdne commented May 25, 2024

@ronaldtse Handling lists was very tricky, but it's ready now. I have also uncovered something like a definition list in 7.2.4, but since their use of markup (.text2data, .text3data) is not consistent, I can't reliably detect them.

What I can see as remaining tasks to be done in this PR:

  • Investigate what to do with .text2data and .text3data
  • Correct an issue with \<< Something >> and with \n +
  • Split sections into files
  • Correct an edge case with table column size computation
  • Add some tests for new features introduced

@hmdne
Copy link
Contributor Author

hmdne commented May 25, 2024

To make things easier, I'm uploading the current version of the document generated:

document.tar.gz

I plan to continue development tomorrow (Sunday) on 4-6 AM GMT+2.

@hmdne
Copy link
Contributor Author

hmdne commented May 26, 2024

We have generated a section tree at this point, so we may split sections into individual files. I am not entirely sure this approach will correctly translate into all documents, not only the one we are working on.

@hmdne
Copy link
Contributor Author

hmdne commented May 27, 2024

Thanks to a suggestion from @xyz65535 I have handled indentation in the document with [none] unordered lists. This should preserve as much semantics from the incoming document as possible.

In addition, I finalized a plugin implementation. It is now possible to plug-in at any meaningful state of AsciiDoc generation. I suppose this could be used to add something like a Metanorma plug-in, that would for instance try to extract and produce data that is meaningful to Metanorma, but not necessarily in the AsciiDoc standard. The plugin architecture should support multiple plugins to be used for any conversion.

@hmdne
Copy link
Contributor Author

hmdne commented May 27, 2024

Here's some example from 7.1.2.4:

Original document:

image

Our document:

image

AsciiDoc for that fragment:

image

@ronaldtse
Copy link
Contributor

@hmdne the ideal AsciiDoc encoding:

==== 変換規則

===== スキーマ変換規則

* スキーマ変換規則は、1-UR3.0及びCityGML2.0に従う。
* なお、標準製品仕様書は、応用スキーマクラス図及びこれに対応するXMLSchemaを新規に作成するのではなく、1-UR3.0及びCityGML2.0から必要な部分のみを選択し、使用している。
* 応用スキーマクラス図に示す、クラス名、属性名及び関連役割名は、1-UR3.0及びCityGML2.0において定義されたタグに一致させている。
* また、複数の名前空間から選択しているため、全てのクラス名に、エ-UR3.0又はCityGML2.0名前空間の接頭辞を付ける。

===== インスタンス変換規則

GMLに準拠する。

* オブジェクト識別子(gml:id)
+
--
データ製品に含まれる全ての地物には、gml:idによる識別可能な値を与えることとし、その値には[接頭辞]_[UUID]を使用する。

[接頭辞]は、CityGML及びューURの各パッケージに与えられた接頭辞(表7-4)を使用する。

[UUID]は、Universally Unique Identifier(UUID)[2]とする。UUIDとは、ソフトウェア上でオブジェクトを一意に識別するための識別子であり、128ビット(16バイト)の値で表す。先頭から4ビットごとに16進数の値(0~f)に変換し、8桁-4桁-4桁-4桁-12桁に切って表現する。
--

* 集成の実装
+
--
応用スキーマに示された地物間の集成は、部品となるオブジェクトを、全体となるオブジェクトの子要素として記述する。

この時、部品となるオブジェクトの識別子(gm1:id)を、全体となるオブジェクト以外のオブジェクトが参照してもよい。
--

* 空間参照系の識別
+
--
幾何オブジェクトに適用される空間参照系は、都市モデル(core:CityModel)に挿入されるEnvelop要素の属性snsNameにおいて、以下のEPSGコードを挿入することにより識別する。

[cols="9,4"]
|===
| 空間参照系の名称 | srsNameに挿入する値

| 日本測地系2011における経緯度座標系と東京湾平均海面を基準とする標高の複合座標参照系
| http://www.opengis.net/def/crs/EPSG/0/6697
|===
--

* schemaLocationの指定
+
i-URの符号化様は、30都市モデル内のschemasフォルダ(7.2.4)に格納したXMLSchemaファイルへの相対パスによりschemaLocationを指定する。

The interesting thing about the PLATEAU documents is they use the clause scheme like this:

Screenshot 2024-05-27 at 5 53 35 PM

So the Level 4 and Level 5 are actually not lists, they are clauses (sections).

@hmdne
Copy link
Contributor Author

hmdne commented May 27, 2024

The last clause level is not something we can extract programmatically, as the only class we have available is "text2data" - all we can deduce from that is that the author intended a "level 2 indentation". This class is used a lot in the document, for instance the underlined parts are also "text2data":

image

While this example in particular we handle specially as per your request, it's compiled into a numbered list, in other part of the document, those are "text2data":

image

I see no way from this how to interpret "text2data" in any other way, programmatically, as "level 2 indentation" and that's what I try to accomplish with lists.

@ronaldtse
Copy link
Contributor

@hmdne there are always a balance between automated processing and manual processing, and I do agree that there are some portions we have to manually fix up after automated processing. As long as we know what work remains (ping @metanorma/editors ) that's fine.

@hmdne
Copy link
Contributor Author

hmdne commented May 27, 2024

I have completed the last task on this issue. This will still need some testing, but other than that, I don't see any more remaining problems with conversion.

Below is the (hopefully) final version of document, ready for review:

document.tar.gz

@hmdne hmdne marked this pull request as ready for review May 27, 2024 17:08
@hmdne
Copy link
Contributor Author

hmdne commented May 27, 2024

@ronaldtse There was a minor fix uncovered by the test suite, but it doesn't affect the document. I think this PR is ready.

@ronaldtse
Copy link
Contributor

@hmdne can you let me know how you've tested the feature?

This is what I used.

$ bundle exec reverse_adoc -rcoradoc/reverse_adoc/plugins/plateau --split-sections 2 --external-images -o plateau/index.adoc index.html

I have additional issues that I will file separately now.

Copy link
Contributor

@ronaldtse ronaldtse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @hmdne!

@ronaldtse ronaldtse merged commit e85eaa8 into metanorma:main May 29, 2024
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants