feat :: add soft-hyphen for multi-cell #308

oleksii-shyman · 2021-12-19T22:10:28Z

Close #291

refactor existing codebase for readability
introduce automatic line break at soft-hyphen location

Checklist:

The GitHub pipeline is OK (green),
meaning that both pylint (static code analyzer) and black (code formatter) are happy with the changes of this PR.
A unit test is covering the code added / modified by this PR
This PR is ready to be merged
In case of a new feature, docstrings have been added, with also some documentation in the docs/ folder
A mention of the change is present in CHANGELOG.md
The PR description or comment contains a picture of a cute animal (not mandatory but encouraged 😉)

- refactor existing codebase for readability - introduce automatic line break at soft-hyphen location

oleksii-shyman · 2021-12-19T22:17:12Z

At this point it is in draft state.
Logics is expected to be valid , but I wasn't able to run / test my code yet

- able to render 20+ lines of lorem ipsum with refactored multi-line - tests are failing

oleksii-shyman · 2021-12-20T05:03:54Z

basic test is working now. able to print 20+ pages of lorem ipsum using refactored approach.
output looks the same to the master version, but tests are failing. identifying the root cause

gmischler · 2021-12-20T22:43:04Z

That looks promising, very cool!

output looks the same to the master version, but tests are failing. identifying the root cause

Tests breaking on a code change are often caused by non-semantic differences in the PostScript code of the PDF (sometimes simply changing insignificant whitespace).
If you look in the temporary directory where the test output is placed, then you'll find versions of both old and new PDF with the PostScript sections decompressed, which you can compare in a text editor.

The previous code uses self._out("0 Tw") several times to reset the word spacing, while yours doesn't. That would probably account for a technical difference in the output. If you find that this doesn't change the appearance of the output, then everything is fine, and you can replace the PDF files in the repository with your new ones (found in the same temp dir) to make the tests pass. Of course, if you should find any visible differences, you'd have to fix those first.

Lucas-C · 2021-12-21T16:17:07Z

Thank you for your contribution @oleksii-shyman!

I second @gmischler answer.
You'll also find more information on tests in the documentation:
https://pyfpdf.github.io/fpdf2/Development.html#why-is-a-test-failing

- only 6 testcases fail

oleksii-shyman · 2022-01-02T03:28:37Z

I was able to fix most of the cases, only 1 pdf updated, visually validated that it has no differences with the original one.

Currently, only 2 markdown related cases are failing:

FAILED test/cells/test_multi_cell_markdown.py::test_multi_cell_markdown
FAILED test/cells/test_multi_cell_markdown.py::test_multi_cell_markdown_with_ttf_fonts

I was able to investigate the root cause, and understand why does it fail.
I originally designed this solution to consist of 2 stages:

Break normalized string into feasible chunks
Wrap feasible chunk with self.cell calls

However, this approach doesn't work on markdown strings, because markdown special characters (*, _) can affect the style of the font. And the style of the font can affect the width of any character.

I want to discuss how to properly solve this issue. One easy way to do it - to give up on the idea of 2 stages, just issue a call to self.cell as soon as feasible substring is found. In other words - keep the existing approach.
However, there are 2 problems with the existing approach:

underscore(_) and asterisk (*) characters width is included into the total width of the cell here -> https://github.com/oleksii-shyman/fpdf2/blob/master/fpdf/fpdf.py#L2341-L2344 . But these characters are not displayed to the user, so they should be ignored.
splitting the markdown string without immediately taking into account the style change leads to improper width calculation. In the existing approach style is changed only after the call to self.cell method.

Based on the problems with the existing approach i propose another way to do it:

Parse normalized string into markdown chunks with self._preload_font_styles(normalized_string, markdown). This function returns chunks of text with associated style. Also underscores and asterisks are removed -> (('Lorem ipsum dolor amet, ', '', False), ('consectetur adipiscing', 'B', False), (' elit, sed do eiusmod ', '', False), ('tempor incididunt', 'I', False), (' ut labore et dolore magna aliqua.', '', False))
Break the string into feasible chunks (taking into the accont the style)
Wrap feasible chunk with self.cell call.

The problem here is that self.cell accepts raw text (unparsed markdown) instead of chunks of text with associated style. I plan to split self.cell to 2 methods

self._cell - method that accepts parsed markdown as the argument.
self.cell - with exactly the same interface as before, no caller will spot the difference. This method will parse the markdown and pass it to self._cell method

@gmischler @Lucas-C What do you think?

oleksii-shyman · 2022-01-02T04:16:44Z

Examples (ttf \ no ttf) of basic test case with soft hyphen

text = (  # Some text where styling occur over line breaks:
    "Lorem ipsum dolor amet, con\u00adsec\u00adte\u00adtur adipiscing elit,"
    " sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
)

As you can see - there are 3 soft hyphens in the same word, but only one is present.

Lucas-C · 2022-01-02T14:20:13Z

Your latest demo-screenshot looks very promising @oleksii-shyman !
And good job really on the thorough code & options analysis!

I introduced _preload_font_styles() myself a few months ago.
I understand the need here to extract a sub-method from FPDF.cell(), that could be called by multi_cell().
I think a better naming for it could be _render_styled_cell_text().
I guess this method would basically contains all the code in the if txt block at line 1963.
AFAIC, you can proceed down this road 😉 And thanks for your time and efforts!

@all-contributors please add @oleksii-shyman for code, design, ideas

allcontributors · 2022-01-02T14:20:22Z

@Lucas-C

I've put up a pull request to add @oleksii-shyman! 🎉

gmischler · 2022-01-04T10:34:06Z

Yeah, I expected that to be a non-trivial task, but you're getting there!

The problem here is that self.cell accepts raw text (unparsed markdown) instead of chunks of text with associated style. I plan to split self.cell to 2 methods
1. `self._cell` - method that accepts parsed markdown as the argument.

2. `self.cell` - with exactly the same interface as before, no caller will spot the difference. This method will parse the markdown and pass it to `self._cell` method

Now allow me to make things even more complicated... 😁

You may have noticed that .write() does something very similar to .multicell(), mostly just with different horizontal boundaries and different wrapping rules. It currently has its own version of the line wrapping code, most of which is a redundant copy of .multicell(). Eventually, the two should really use the same code, for simplicity and maintainability. No, I'm not suggesting you should do this now, that's a seperate task.
But...

When chosing your approach for .multicell(), you might want to check your solution does not unnecessarily stand in the way of doing that later. The approach of .write() is different, in that it doesn't give you all the parts in one go as with the markup option, but it lets the user feed in differently formatted chunks of text one after the other.

I haven't looked at it very closely, but I expect that both .multicell() and .write() would eventually use a common internal method to actually do the wrapping.

Given that you need to do the markdown splitting at a very early stage, wrapping .cell() inside ._cell() (or ._cell_row()/ ._render_styled_cell_text()) seems to make a lot of sense. It can also decide which of its cell()s need border lines on which sides.

As an organisational detail, if you create new classes to do the formatting, then it might be a good idea to place those in a seperate file (assuming they don't use nested scoping). fpdf.py is already long enough, so I think we should take every opportunity to farm out code to other modules.

oleksii-shyman · 2022-01-04T17:47:42Z

I haven't looked at it very closely, but I expect that both .multicell() and .write() would eventually use a common internal method to actually do the wrapping.

Sounds reasonable, I think it is realistic to do it this way

- basic test cases are added - breaking works with variable length (for compatibily with fpdf.write method) - testcases are passing

…break

oleksii-shyman · 2022-01-26T07:29:49Z

The list of things I changed:

moved line breaking functionality into separate module. It is completely decoupled from main class. I've added some basic cases to show the usage and to indicate how other unit tests will look like
added _render_styled_cell_text function (it fixed the bug with markdown rendering)
multi line breaking functionality should be compatible with .write(), because it returns chunks of variable length.

I think this patch is ready for high level review. Please let me know if you have any objections or concerns, and I will add more tests and documentation.

codecov · 2022-01-26T07:55:15Z

Codecov Report

Merging #308 (257dca3) into master (174775a) will increase coverage by 0.44%.
The diff coverage is 95.69%.

❗ Current head 257dca3 differs from pull request most recent head 688c9a9. Consider uploading reports for the commit 688c9a9 to get more accurate results

@@            Coverage Diff             @@
##           master     #308      +/-   ##
==========================================
+ Coverage   89.99%   90.43%   +0.44%     
==========================================
  Files          19       20       +1     
  Lines        5815     5803      -12     
  Branches     1167     1148      -19     
==========================================
+ Hits         5233     5248      +15     
+ Misses        365      339      -26     
+ Partials      217      216       -1

Impacted Files	Coverage Δ
fpdf/fpdf.py	`85.49% <91.66%> (+1.19%)`	⬆️
fpdf/line_break.py	`99.01% <99.01%> (ø)`
fpdf/drawing.py	`93.62% <0.00%> (-0.35%)`	⬇️
fpdf/__init__.py	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 174775a...688c9a9. Read the comment docs.

Lucas-C

You really did an excellent job here!
The code will be much more cleaner after merging this.
The tests are also great 👍
I made a few comments, but overall this is very good.
Could you add a mention of this change in CHANGELOG.md as part of this PR please?

fpdf/line_break.py

test/test_line_break.py

fpdf/line_break.py

oleksii-shyman · 2022-01-28T05:58:13Z

You really did an excellent job here! The code will be much more cleaner after merging this. The tests are also great +1 I made a few comments, but overall this is very good. Could you add a mention of this change in CHANGELOG.md as part of this PR please?

Thank you for your feedback, will do my changes shortly

gmischler · 2022-01-29T05:59:49Z

When I submitted #291, I was prepared to wait quite a bit, as I realized it required some substantial refactoring to get it to work.
And just three months later, here we are with an evidently quite competently made implementation.
Color me impressed! 👍

This will serve as a solid basis for someone later adapting .write(), so it can also align its output. Doing that will probably involve caching text chunks until each line is complete. And to make sure the last line gets processed as well, we may need some kind of "text block" context manager to finalize that. Actually, I think such text block could also manage the outline geometry, and solve #245 at the same time.
I wonder how long that would take... 😉

oleksii-shyman · 2022-01-29T15:34:40Z

When I submitted #291, I was prepared to wait quite a bit, as I realized it required some substantial refactoring to get it to work. And just three months later, here we are with an evidently quite competently made implementation. Color me impressed! +1

This will serve as a solid basis for someone later adapting .write(), so it can also align its output. Doing that will probably involve caching text chunks until each line is complete. And to make sure the last line gets processed as well, we may need some kind of "text block" context manager to finalize that. Actually, I think such text block could also manage the outline geometry, and solve #245 at the same time. I wonder how long that would take... wink

I have no bandwidth to handle another change of scope. Do you think that this basis is solid enough for someone later adapting .write() method?

Lucas-C · 2022-01-29T16:32:44Z

I have no bandwidth to handle another change of scope. Do you think that this basis is solid enough for someone later adapting .write() method?

Yes, I think it's better to get this merged first, than anyone who's willing to improve things could submit other PRs.
Please just address the review feedbacks and I'll merge this.
You will also need to rebase your branch on master due to this bugfix I merged earlier today: #333

gmischler · 2022-01-29T17:54:42Z

I have no bandwidth to handle another change of scope. Do you think that this basis is solid enough for someone later adapting .write() method?

I didn't mean to imply that any changes to .write() should be a part of this PR. But yes, your clean and seperated line wrapping implementation should make them quite straightforward.

Lucas-C · 2022-02-03T10:02:40Z

@oleksii-shyman please ping me if I can ever help 😊

- add more tests - address review comments

…break

- updated one markdown test pdf file - removed unused line

oleksii-shyman · 2022-02-03T21:30:09Z

@oleksii-shyman please ping me if I can ever help blush

Thank you! I was able to added much more tests, and update my code (according to review comments)

Lucas-C · 2022-02-04T11:13:24Z

This looks perfect, merging the PR now!

Lucas-C · 2022-02-07T10:55:55Z

I think we should add a mention of soft-hyphen support in the docs, but I'm not sure where...

feat :: add soft-hyphen for multi-cell

2d460cd

- refactor existing codebase for readability - introduce automatic line break at soft-hyphen location

oleksii-shyman mentioned this pull request Dec 19, 2021

Feature Request: Support soft hyphen in Multicell #291

Closed

fix :: basic case is working

3fa53f2

- able to render 20+ lines of lorem ipsum with refactored multi-line - tests are failing

oleksii-shyman added 3 commits December 31, 2021 15:05

fix :: refactor + bugfix

91cbe77

- only 6 testcases fail

fix :: change height of the last cell

3d799ee

fix :: pylint + black

b3fcb66

fix :: make basic case for hyphen work

49a1706

allcontributors bot mentioned this pull request Jan 2, 2022

docs: add oleksii-shyman as a contributor for code, design, ideas #312

Merged

Lucas-C mentioned this pull request Jan 22, 2022

Issue with the use of markdown and justified text (align="J") #327

Closed

oleksii-shyman added 2 commits January 25, 2022 22:57

fix :: move breaking to separate module

d009f74

- basic test cases are added - breaking works with variable length (for compatibily with fpdf.write method) - testcases are passing

Merge remote-tracking branch 'upstream/master' into soft_hyphen_line_…

c501266

…break

oleksii-shyman marked this pull request as ready for review January 26, 2022 07:30

oleksii-shyman requested a review from Lucas-C as a code owner January 26, 2022 07:30

Lucas-C reviewed Jan 27, 2022

View reviewed changes

oleksii-shyman added 3 commits February 3, 2022 12:50

fix :: address review comments

1625377

- add more tests - address review comments

Merge remote-tracking branch 'upstream/master' into soft_hyphen_line_…

adc8bca

…break

fix :: update pdf for failing test

2f272b9

- updated one markdown test pdf file - removed unused line

fix :: update CHANGELOG.md

688c9a9

Lucas-C merged commit 430ff28 into py-pdf:master Feb 4, 2022

This was referenced Feb 7, 2022

WIP: multi-line cells in table #213

Closed

Support multi-lines text in table cells with write_html - fix #91 #211

Closed

Make FPDF.write() use _render_styled_cell_text() #340

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat :: add soft-hyphen for multi-cell #308

feat :: add soft-hyphen for multi-cell #308

oleksii-shyman commented Dec 19, 2021 •

edited by Lucas-C

Loading

oleksii-shyman commented Dec 19, 2021

oleksii-shyman commented Dec 20, 2021

gmischler commented Dec 20, 2021

Lucas-C commented Dec 21, 2021

oleksii-shyman commented Jan 2, 2022

oleksii-shyman commented Jan 2, 2022

Lucas-C commented Jan 2, 2022

allcontributors bot commented Jan 2, 2022

gmischler commented Jan 4, 2022

oleksii-shyman commented Jan 4, 2022

oleksii-shyman commented Jan 26, 2022

codecov bot commented Jan 26, 2022 •

edited

Loading

Lucas-C left a comment •

edited

Loading

oleksii-shyman commented Jan 28, 2022

gmischler commented Jan 29, 2022

oleksii-shyman commented Jan 29, 2022

Lucas-C commented Jan 29, 2022

gmischler commented Jan 29, 2022

Lucas-C commented Feb 3, 2022

oleksii-shyman commented Feb 3, 2022

Lucas-C commented Feb 4, 2022

Lucas-C commented Feb 7, 2022

feat :: add soft-hyphen for multi-cell #308

feat :: add soft-hyphen for multi-cell #308

Conversation

oleksii-shyman commented Dec 19, 2021 • edited by Lucas-C Loading

oleksii-shyman commented Dec 19, 2021

oleksii-shyman commented Dec 20, 2021

gmischler commented Dec 20, 2021

Lucas-C commented Dec 21, 2021

oleksii-shyman commented Jan 2, 2022

oleksii-shyman commented Jan 2, 2022

Lucas-C commented Jan 2, 2022

allcontributors bot commented Jan 2, 2022

gmischler commented Jan 4, 2022

oleksii-shyman commented Jan 4, 2022

oleksii-shyman commented Jan 26, 2022

codecov bot commented Jan 26, 2022 • edited Loading

Codecov Report

Lucas-C left a comment • edited Loading

Choose a reason for hiding this comment

oleksii-shyman commented Jan 28, 2022

gmischler commented Jan 29, 2022

oleksii-shyman commented Jan 29, 2022

Lucas-C commented Jan 29, 2022

gmischler commented Jan 29, 2022

Lucas-C commented Feb 3, 2022

oleksii-shyman commented Feb 3, 2022

Lucas-C commented Feb 4, 2022

Lucas-C commented Feb 7, 2022

oleksii-shyman commented Dec 19, 2021 •

edited by Lucas-C

Loading

codecov bot commented Jan 26, 2022 •

edited

Loading

Lucas-C left a comment •

edited

Loading