Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat :: add soft-hyphen for multi-cell #308

Merged
merged 12 commits into from
Feb 4, 2022
Merged

feat :: add soft-hyphen for multi-cell #308

merged 12 commits into from
Feb 4, 2022

Conversation

oleksii-shyman
Copy link

@oleksii-shyman oleksii-shyman commented Dec 19, 2021

Close #291

  • refactor existing codebase for readability
  • introduce automatic line break at soft-hyphen location

Checklist:

  • The GitHub pipeline is OK (green),
    meaning that both pylint (static code analyzer) and black (code formatter) are happy with the changes of this PR.

  • A unit test is covering the code added / modified by this PR

  • This PR is ready to be merged

  • In case of a new feature, docstrings have been added, with also some documentation in the docs/ folder

  • A mention of the change is present in CHANGELOG.md

  • The PR description or comment contains a picture of a cute animal (not mandatory but encouraged 😉)

  - refactor existing codebase for readability
  - introduce automatic line break at soft-hyphen location
@oleksii-shyman
Copy link
Author

At this point it is in draft state.
Logics is expected to be valid , but I wasn't able to run / test my code yet

  - able to render 20+ lines of lorem ipsum with refactored multi-line
  - tests are failing
@oleksii-shyman
Copy link
Author

basic test is working now. able to print 20+ pages of lorem ipsum using refactored approach.
output looks the same to the master version, but tests are failing. identifying the root cause

@gmischler
Copy link
Collaborator

That looks promising, very cool!

output looks the same to the master version, but tests are failing. identifying the root cause

Tests breaking on a code change are often caused by non-semantic differences in the PostScript code of the PDF (sometimes simply changing insignificant whitespace).
If you look in the temporary directory where the test output is placed, then you'll find versions of both old and new PDF with the PostScript sections decompressed, which you can compare in a text editor.

The previous code uses self._out("0 Tw") several times to reset the word spacing, while yours doesn't. That would probably account for a technical difference in the output. If you find that this doesn't change the appearance of the output, then everything is fine, and you can replace the PDF files in the repository with your new ones (found in the same temp dir) to make the tests pass. Of course, if you should find any visible differences, you'd have to fix those first.

@Lucas-C
Copy link
Member

Lucas-C commented Dec 21, 2021

Thank you for your contribution @oleksii-shyman!

I second @gmischler answer.
You'll also find more information on tests in the documentation:
https://pyfpdf.github.io/fpdf2/Development.html#why-is-a-test-failing

@oleksii-shyman
Copy link
Author

I was able to fix most of the cases, only 1 pdf updated, visually validated that it has no differences with the original one.

Currently, only 2 markdown related cases are failing:

FAILED test/cells/test_multi_cell_markdown.py::test_multi_cell_markdown
FAILED test/cells/test_multi_cell_markdown.py::test_multi_cell_markdown_with_ttf_fonts

I was able to investigate the root cause, and understand why does it fail.
I originally designed this solution to consist of 2 stages:

  1. Break normalized string into feasible chunks
  2. Wrap feasible chunk with self.cell calls

However, this approach doesn't work on markdown strings, because markdown special characters (*, _) can affect the style of the font. And the style of the font can affect the width of any character.

I want to discuss how to properly solve this issue. One easy way to do it - to give up on the idea of 2 stages, just issue a call to self.cell as soon as feasible substring is found. In other words - keep the existing approach.
However, there are 2 problems with the existing approach:

  1. underscore(_) and asterisk (*) characters width is included into the total width of the cell here -> https://github.com/oleksii-shyman/fpdf2/blob/master/fpdf/fpdf.py#L2341-L2344 . But these characters are not displayed to the user, so they should be ignored.
  2. splitting the markdown string without immediately taking into account the style change leads to improper width calculation. In the existing approach style is changed only after the call to self.cell method.

Based on the problems with the existing approach i propose another way to do it:

  1. Parse normalized string into markdown chunks with self._preload_font_styles(normalized_string, markdown). This function returns chunks of text with associated style. Also underscores and asterisks are removed -> (('Lorem ipsum dolor amet, ', '', False), ('consectetur adipiscing', 'B', False), (' elit, sed do eiusmod ', '', False), ('tempor incididunt', 'I', False), (' ut labore et dolore magna aliqua.', '', False))
  2. Break the string into feasible chunks (taking into the accont the style)
  3. Wrap feasible chunk with self.cell call.

The problem here is that self.cell accepts raw text (unparsed markdown) instead of chunks of text with associated style. I plan to split self.cell to 2 methods

  1. self._cell - method that accepts parsed markdown as the argument.
  2. self.cell - with exactly the same interface as before, no caller will spot the difference. This method will parse the markdown and pass it to self._cell method

@gmischler @Lucas-C What do you think?

@oleksii-shyman
Copy link
Author

image
image

Examples (ttf \ no ttf) of basic test case with soft hyphen

text = (  # Some text where styling occur over line breaks:
    "Lorem ipsum dolor amet, con\u00adsec\u00adte\u00adtur adipiscing elit,"
    " sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
) 

As you can see - there are 3 soft hyphens in the same word, but only one is present.

@Lucas-C
Copy link
Member

Lucas-C commented Jan 2, 2022

Your latest demo-screenshot looks very promising @oleksii-shyman !
And good job really on the thorough code & options analysis!

I introduced _preload_font_styles() myself a few months ago.
I understand the need here to extract a sub-method from FPDF.cell(), that could be called by multi_cell().
I think a better naming for it could be _render_styled_cell_text().
I guess this method would basically contains all the code in the if txt block at line 1963.
AFAIC, you can proceed down this road 😉 And thanks for your time and efforts!

@all-contributors please add @oleksii-shyman for code, design, ideas

@allcontributors
Copy link

@Lucas-C

I've put up a pull request to add @oleksii-shyman! 🎉

@gmischler
Copy link
Collaborator

Yeah, I expected that to be a non-trivial task, but you're getting there!

The problem here is that self.cell accepts raw text (unparsed markdown) instead of chunks of text with associated style. I plan to split self.cell to 2 methods

1. `self._cell` - method that accepts parsed markdown as the argument.

2. `self.cell` - with exactly the same interface as before, no caller will spot the difference. This method will parse the markdown and pass it to `self._cell` method

Now allow me to make things even more complicated... 😁

You may have noticed that .write() does something very similar to .multicell(), mostly just with different horizontal boundaries and different wrapping rules. It currently has its own version of the line wrapping code, most of which is a redundant copy of .multicell(). Eventually, the two should really use the same code, for simplicity and maintainability. No, I'm not suggesting you should do this now, that's a seperate task.
But...

When chosing your approach for .multicell(), you might want to check your solution does not unnecessarily stand in the way of doing that later. The approach of .write() is different, in that it doesn't give you all the parts in one go as with the markup option, but it lets the user feed in differently formatted chunks of text one after the other.

I haven't looked at it very closely, but I expect that both .multicell() and .write() would eventually use a common internal method to actually do the wrapping.

Given that you need to do the markdown splitting at a very early stage, wrapping .cell() inside ._cell() (or ._cell_row()/ ._render_styled_cell_text()) seems to make a lot of sense. It can also decide which of its cell()s need border lines on which sides.

As an organisational detail, if you create new classes to do the formatting, then it might be a good idea to place those in a seperate file (assuming they don't use nested scoping). fpdf.py is already long enough, so I think we should take every opportunity to farm out code to other modules.

@oleksii-shyman
Copy link
Author

I haven't looked at it very closely, but I expect that both .multicell() and .write() would eventually use a common internal method to actually do the wrapping.

Sounds reasonable, I think it is realistic to do it this way

  - basic test cases are added
  - breaking works with variable length (for compatibily with fpdf.write
    method)
  - testcases are passing
@oleksii-shyman
Copy link
Author

The list of things I changed:

  1. moved line breaking functionality into separate module. It is completely decoupled from main class. I've added some basic cases to show the usage and to indicate how other unit tests will look like
  2. added _render_styled_cell_text function (it fixed the bug with markdown rendering)
  3. multi line breaking functionality should be compatible with .write(), because it returns chunks of variable length.

I think this patch is ready for high level review. Please let me know if you have any objections or concerns, and I will add more tests and documentation.

@oleksii-shyman oleksii-shyman marked this pull request as ready for review January 26, 2022 07:30
@codecov
Copy link

codecov bot commented Jan 26, 2022

Codecov Report

Merging #308 (257dca3) into master (174775a) will increase coverage by 0.44%.
The diff coverage is 95.69%.

❗ Current head 257dca3 differs from pull request most recent head 688c9a9. Consider uploading reports for the commit 688c9a9 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #308      +/-   ##
==========================================
+ Coverage   89.99%   90.43%   +0.44%     
==========================================
  Files          19       20       +1     
  Lines        5815     5803      -12     
  Branches     1167     1148      -19     
==========================================
+ Hits         5233     5248      +15     
+ Misses        365      339      -26     
+ Partials      217      216       -1     
Impacted Files Coverage Δ
fpdf/fpdf.py 85.49% <91.66%> (+1.19%) ⬆️
fpdf/line_break.py 99.01% <99.01%> (ø)
fpdf/drawing.py 93.62% <0.00%> (-0.35%) ⬇️
fpdf/__init__.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 174775a...688c9a9. Read the comment docs.

Copy link
Member

@Lucas-C Lucas-C left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You really did an excellent job here!
The code will be much more cleaner after merging this.
The tests are also great 👍
I made a few comments, but overall this is very good.
Could you add a mention of this change in CHANGELOG.md as part of this PR please?

fpdf/line_break.py Show resolved Hide resolved
fpdf/line_break.py Outdated Show resolved Hide resolved
fpdf/line_break.py Show resolved Hide resolved
fpdf/line_break.py Outdated Show resolved Hide resolved
fpdf/line_break.py Show resolved Hide resolved
fpdf/line_break.py Show resolved Hide resolved
fpdf/line_break.py Show resolved Hide resolved
test/test_line_break.py Outdated Show resolved Hide resolved
fpdf/line_break.py Show resolved Hide resolved
@oleksii-shyman
Copy link
Author

You really did an excellent job here! The code will be much more cleaner after merging this. The tests are also great +1 I made a few comments, but overall this is very good. Could you add a mention of this change in CHANGELOG.md as part of this PR please?

Thank you for your feedback, will do my changes shortly

@gmischler
Copy link
Collaborator

When I submitted #291, I was prepared to wait quite a bit, as I realized it required some substantial refactoring to get it to work.
And just three months later, here we are with an evidently quite competently made implementation.
Color me impressed! 👍

This will serve as a solid basis for someone later adapting .write(), so it can also align its output. Doing that will probably involve caching text chunks until each line is complete. And to make sure the last line gets processed as well, we may need some kind of "text block" context manager to finalize that. Actually, I think such text block could also manage the outline geometry, and solve #245 at the same time.
I wonder how long that would take... 😉

@oleksii-shyman
Copy link
Author

When I submitted #291, I was prepared to wait quite a bit, as I realized it required some substantial refactoring to get it to work. And just three months later, here we are with an evidently quite competently made implementation. Color me impressed! +1

This will serve as a solid basis for someone later adapting .write(), so it can also align its output. Doing that will probably involve caching text chunks until each line is complete. And to make sure the last line gets processed as well, we may need some kind of "text block" context manager to finalize that. Actually, I think such text block could also manage the outline geometry, and solve #245 at the same time. I wonder how long that would take... wink

I have no bandwidth to handle another change of scope. Do you think that this basis is solid enough for someone later adapting .write() method?

@Lucas-C
Copy link
Member

Lucas-C commented Jan 29, 2022

I have no bandwidth to handle another change of scope. Do you think that this basis is solid enough for someone later adapting .write() method?

Yes, I think it's better to get this merged first, than anyone who's willing to improve things could submit other PRs.
Please just address the review feedbacks and I'll merge this.
You will also need to rebase your branch on master due to this bugfix I merged earlier today: #333

@gmischler
Copy link
Collaborator

I have no bandwidth to handle another change of scope. Do you think that this basis is solid enough for someone later adapting .write() method?

I didn't mean to imply that any changes to .write() should be a part of this PR. But yes, your clean and seperated line wrapping implementation should make them quite straightforward.

@Lucas-C
Copy link
Member

Lucas-C commented Feb 3, 2022

@oleksii-shyman please ping me if I can ever help 😊

  - add more tests
  - address review comments
  - updated one markdown test pdf file
  - removed unused line
@oleksii-shyman
Copy link
Author

@oleksii-shyman please ping me if I can ever help blush

Thank you! I was able to added much more tests, and update my code (according to review comments)

@Lucas-C
Copy link
Member

Lucas-C commented Feb 4, 2022

This looks perfect, merging the PR now!

@Lucas-C
Copy link
Member

Lucas-C commented Feb 7, 2022

I think we should add a mention of soft-hyphen support in the docs, but I'm not sure where...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Support soft hyphen in Multicell
3 participants