Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding non-breakable spaces lua filter #119

Open
wants to merge 37 commits into
base: master
Choose a base branch
from

Conversation

Delanii
Copy link

@Delanii Delanii commented Oct 13, 2020

This adresses #114 .

Its first pull request I have ever made, so hopefully I did not messed anything up. I know it is a whole bunch of files, which should be avoided, but I think that in this case it should be put-and-packaged as one ...

Per #114 it still has not implemented the change in tables and presence-check, since I am having trouble understanding that syntax (idiom) and using it. Hopefully it doesnt mind too much.

Regards, Tomas

@Delanii
Copy link
Author

Delanii commented Oct 13, 2020

OK, it seems that somehow I did not upload the makefile?

I have added makefile to my fork, which looks like added the makefile also here ... Well, I have to learn about git and GitHub :D

@alerque
Copy link
Collaborator

alerque commented Oct 13, 2020

On GitHub a Pull Request is a request to merge one branch into another, in this case your fork's master branch into this repository's. You can update PR's by adding, removing, or rebasing commits in a branch. Typically (and to get more GitHub tooling) you'd open a PR from a branch on your fork other than master. In that case maintainers of the target project can also make changes to the PR before merging. As it is only you can, but anything you do in your master branch will continue to change this PR until it is merged or closed.

@Delanii
Copy link
Author

Delanii commented Oct 13, 2020

Oh, I wasnt sure if I should create new branch in my fork or not. Next time I do better.
Thank you for clarification.

Do I understand you correctly, that now nobody (except me) can make edit to this PR? I thought, that the "Allow edits by maintainers" checker is to control this. I have it checked, which I hoped is for making it possible for project maintainers to correct any errors in this PR (well, if they prefer that over telling me what should I fix).

@alerque
Copy link
Collaborator

alerque commented Oct 13, 2020

I thought, that the "Allow edits by maintainers" checker is to control this. I have it checked

Really? Last I checked (less than a week ago) that checkbox was disabled for all PR's from master branches. If you see the box and have it checked then GitHub has updated something recently, and that's a nice fix that I'm glad to hear about.

@Delanii
Copy link
Author

Delanii commented Oct 13, 2020

I have used this manual:

Pull request from a fork

Step no. 7 is the one which we are talking about, I hope. Except I did the exact opposite: Left it checked to allow maintainers to do with this PR anything they deem neccessary ...

Btw. It seems that bibliography test are misbehaving; maybe because of newest update?

@stroobandt
Copy link
Contributor

Interesting. Today, I was looking at some legacy production code with the exact same purpose; thinking by myself that I better replace it by a Lua filter. The current code is a monster of a nested sed that lives in my makefile.

The sed command creates secondary sed substitution commands from a list of so called short begin words.
These secondary sed commands will replace the ordinary space after a short begin word at the beginning of a sentence with a non-breaking space.

The list of short begin words is language dependent. Here are the English and Dutch lists.

Finally, we do not want such substitutions to occur in <PRE>-formatted code blocks.
That is where the /```/,/```/! is for.

@Delanii
Copy link
Author

Delanii commented Oct 14, 2020

@stroobandt Thank you for posting that!

Sadly, I am not enough versed in makefiles or sed to make much of it, but from the lists I see that you went in simillar direction. I am actually not sure how to implement different behavior according to language (well, I would have to read somehow the metadata variable lang).

If you are concerned about code-block (I was too) I have tested that (and it is even in test file of the PR) - codeblocks and inline code is recognized well and no space substitution happens there.

@tarleb
Copy link
Member

tarleb commented Oct 18, 2020

Thanks for the updates! I'll take a closer look soon, I just need to fix the tests first. This may take a while, as I'm going to spend less time in front of the screen for the next couple of days.

The filter name "nonbreakingspace" is a bit generic and somewhat cryptic. Can we find something more specific? How about using the same name as the TeX package, "vlna"? Would that be too confusing? Maybe "west-slavic-nbsp"? I'm open for ideas.

@Delanii
Copy link
Author

Delanii commented Oct 19, 2020

Well, "pandoc-vlna" in itself could be good, but I would think that it would introduce different "crypticity;" that is, that LaTeX packages will be known only to LaTeX users; and most probably to the subset that doesnt already used theyre own homemade solutions (I would guess that those with homemade solutions will be actually plenty).

I thought that naming the filter by what it does as much as possible will prevent its name from being misleading. Very often I am taking inspiration from lua filter pagebreak - which does exactly what it is named after: inserts pagebreak.

But in the spirit of your previous comment, also "insert-nbsp" (or no-abbreviated form "insert-nonbreakable-space") could work too. I would like to aviod specifying any nationalities or geografic locations to the filter, because it might be offputting to someone (or even any way worse). The code state is mostly caused by me being lua newbie and natively Czech, which leads me to make the default setting suitable for Czech language requirements. I hope that the filter as it is is customizable enough for anybody to make changes according to his native language requirements.

If anybody would know how to make this filter conformant to more than one language, I would be happy to make more improvements. In this regard, my only idea is to have multiple (well, depending on how many languages a lot) tables for each language and then (somehow) read metadata lang variable. But that is highly beyond my linguistic and pandoc-hacking skills ...

@tarleb Let me know which identification is more suitable:

  • pandoc-vlna
  • insert-nbsp
  • insert-nonbreakable-space

At the end of the day, as famous classic says:

"A rose by any other name would smell as sweet"

... as long as its well described in README.md (hopefully it is).

Regards, Tomas

@tarleb
Copy link
Member

tarleb commented Oct 28, 2020

I'd like to ask a favor. I compiled a new contributing guide over the last two weeks and would like to get some feedback on it. I'd appreciate if you could take a look at the document and tell me about all passages that are unclear or that you feel should be improved for other reasons.

I am not commenting on the code yet, as I'm also trying to evaluate if the information I added to the guide is sufficient to prevent some common issues which often come up during code review. Hope that's ok with you?

As for the name: each options seems fine. Since Czech is the only supported language, we could also be more explicit about the filter's purpose and include czeck in the name. I'll leave the choice to you.

@Delanii
Copy link
Author

Delanii commented Nov 1, 2020

Considering the new guide, everything mentioned seems pretty completely clear and OK to me. I can accomodate for the ammount of characters per line. Editorconfig seems like a great suggestion, although I am not familiar with that utility.
I take another look at the diagram-generator filter and try to accomodate for setting this fiter according to lang metadata value. I can try to make another set of rules for english (and make that actually default).
That could even make it valid to name the filter pandoc-vlna (well, in my opinion).
I let you know when all changes are finished (could be pretty fast).

Updated to have max 80 chars per line.
Renamed to `sampleCZ` to allow testing of czech language setting.
Another sample for english language testing.
Expected result from compilation of `sampleCZ.md`. Just renamed.
Updated to read `lang` variable setting with default value for english, also with fallback to english if `lang` is not specified. Also added english `prefixes`.
New file for testing after latest filter update.
Added explicit `lang` setting.
Added test result for english `lang` setting.
Added test for english `lang` setting, renamed filter file.
@Delanii
Copy link
Author

Delanii commented Nov 6, 2020

@tarleb I have finished changes according to the newest contributing guide. All should be good to go!
Only thing I realized that in case of Spans or Divs it doesnt check, if those have set theyre lang (attribute?) differently. It is something I can work on when there will be more languages supported (for two languages it might not make much sense).
The only thing is that I am having trouble changing the directory name itself from nonbreakablespace to pandocVlna (now I dare to name it that way) ... :)

Copy link
Member

@tarleb tarleb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes, I like where this is heading! I left some inline comments which I hope are helpful. Ready to merge once these are resolved.

local insert = insert_nonbreakable_space(FORMAT)

for i = 1, #inlines do
if inlines[i].t == 'Space' then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about naming the elements which we are looking at? I find prev or previous much easier to read than inlines[i - 1].

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I can accomodate for that, eventhough in this case for me personally was easier to see exactly what is it I am calling at.

I have made required changes, but I am having trouble assigning the replacement string wtih them, meaning if I write:

currentElement = insert

it doesnt work.

But this works:

inlines[i] = insert

works. It makes sense that these make-up variables dont reassing to original inlines list, but how can I accomodate for that?

I have also noticed another issue - writing SoftBreak element instead of Space in the place, where should be &nbsp wont trigger replacement (of course) - I have remedied for that: Now I am testing for Space of SoftBreak elements.

I have uploaded new files.

Please, let me know what you think about it. I personally in this specific case would prefer writing inlines[i]; but modifying the filter as you suggest would allow me to learn more, so I am open to doing that. After resolving this issue, I will work on the next one.

nonbreakablespace/pandocVlna.lua Outdated Show resolved Hide resolved
nonbreakablespace/pandocVlna.lua Outdated Show resolved Hide resolved
nonbreakablespace/pandocVlna.lua Outdated Show resolved Hide resolved
nonbreakablespace/pandocVlna.lua Outdated Show resolved Hide resolved
nonbreakablespace/pandocVlna.lua Outdated Show resolved Hide resolved
nonbreakablespace/pandocVlna.lua Outdated Show resolved Hide resolved
nonbreakablespace/pandocVlna.lua Outdated Show resolved Hide resolved
nonbreakablespace/pandocVlna.lua Outdated Show resolved Hide resolved
@Delanii
Copy link
Author

Delanii commented Nov 22, 2020

Added another update with all improvements suggested in review. Also updated filter file header according to new contribution guide.

I was returning from Meta function, which caused destruction of all metadata. That should be fixed now, Hopefully it will fix the filter tests too.
The filter failed if last element in paragraph would be "Space" (result of other filter or from writing), similarly to case when "Space" would be first element of par block. Fixed now in replacement loop beginning (line 122)
@netique
Copy link

netique commented May 6, 2021

Many thanks for this filter, @Delanii! I wonder if it is possible to add the % character to the list. In Czech, we write non-breakable space between the number and % sign (when percentages are used as an adjective, there is no space). I've tried to change the filter, but failed miserably. Maybe % needs to be escaped as \%, but that fails too (as well as \\% or \\\% does). Help would be greatly appreciated!

@badumont
Copy link
Contributor

badumont commented May 6, 2021 via email

@Delanii
Copy link
Author

Delanii commented May 6, 2021

Many thanks for this filter, @Delanii! I wonder if it is possible to add the % character to the list. In Czech, we write non-breakable space between the number and % sign (when percentages are used as an adjective, there is no space). I've tried to change the filter, but failed miserably. Maybe % needs to be escaped as \%, but that fails too (as well as \\% or \\\% does). Help would be greatly appreciated!

Hello @netique ,

well, this filter doesnt cover it all. I admit I have settled up to dealing with things that were the easiest -- just list few prepositions and deal with split numbers. The percent character could be thought about as a unit, which has actually the opposite requirement -- has to be coupled with preceeding text element (number, as in "10 %"), not with the following one (word, as in "10 apples").

As I wrote in the beginning, I was and still am a lua newbie, so by cutting off units, chapters (ex: "chapter 9") and such, I have made creating this filter a lot easier.

I am not very sure how I would deal with units and such in this filter. In LaTeX this is being dealt with by means of famous siunitx package, so I could take a peek at its code and try to inspire myself, but I think that that would require actually adding extra markup that would be recognized by filter, something like @Unit{5}{%} or @Unit{10}{m/h}, but I am not sure if everybody would wont to write so much markup just to have nonbreakable spaces between number and unit.

@Delanii
Copy link
Author

Delanii commented May 6, 2021

Just a side note: in lua regexp, % is the escape char, so you have to type %% so that it escapes itself. I wonder if this filter could be made more general by creating some metadata fields taking a list a characters to be associated with a non-breakable space. Something like "non-breakable-space-before", "non-breakable-space-after", "thin-space-before", "thin-space-after". Of course it would be complementary to the built-in configuration files per language that you mentioned. What are your thoughts about that? Le Thursday 06 May 2021 à 03:31:12AM, Jan Netík a écrit :

Hi @badumont !

Well, that is a great idea, I was actually thinking about something like that in the beginning, but then I thought/hoped that this filter could grow by feedback to incorporate more languages with their respective rules. I wanted the filter to be as simple to use as possible so there is no configuration. But the code is open as well, so you can modify it to your respective needs (by just modifying the table with prepositions) to tweak the behaviour as you need.

I also have no idea how to implement configuration of a lua filter. And how would that be stated (configuration file, default file, stating at command line??)

Do you have an idea how to do that?

@badumont
Copy link
Contributor

badumont commented May 6, 2021

I am not an expert either, but here is my guess.

If you wish to support multiple languages, I think that the filter itself should
contain only general-purpose functions, not language-related data, exactly what
you do in l. 117-140. So we could imagine an Inlines function that only calls
functions like: replace-space-by-nbsp and replace-space-by-thin (if you expect
from the user that he or she types a normal space where the nbsp should be, which of
course is debatable). Every function would then read a table containing the
characters that need a nbsp.

For the configuration, there could be two non-exclusive ways: the one that I
mentioned, and another based on configuration files per language. The easiest
way from the programmer's point of view is to write configuration files as
Lua tables, so that they can be directly imported in the lua filter with dofile.
It could look like this (simplified for French):

local nbsp_before = {
  ':',
  '»'
}

local nbsp_after = {
  '«'
}

local thin_before = {
  ';'
}

spaces_config = {
  nbsp_before = nbsp_before,
  nbsp_after = nbsp_after,
  thin_before = thin_before
}

Then simply add dofile "config/FR_config.lua" near the beginning of the file. Then you will be able to get the list of the characters that require a thin space before it with spaces_config.thin_before. Of course, these "characters" can be whole strings like the. That would only require to map the supported IETF language tags to the corresponding configuration files before calling dofile. Such configuration file could be easily written by users by modifying an existing one, I think.

Another thing: why do you define the nbsp character per format? Pandoc does the
conversion itself (try pandoc -t latex -f html <<< '&nbsp').

I can help if you want. As a French user who currently selects manually every special space, I am very interested in this filter! However, in that case, it may be easier to do it on a separate repo for I already have a fork of lua-filters with a pending pull request...

@hugoroy
Copy link

hugoroy commented Aug 31, 2021

Hi both
I am very interested in this as well. I currently do this with a bash script as filter and use sed. Not very clean :-)

What about specifying functions depending on the language set?

For instance if meta.lang = fr then we may provide a list of functions to replace the non-breakable spaces with the right non-breakable space that should appear in French.

Edit: Okay I just saw how this is implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants