Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove substitution of non-breakable spaces #897

Closed
wants to merge 2 commits into from

Conversation

kephas
Copy link

@kephas kephas commented May 22, 2017

Fixes #640
Fixes #363

@joshbruce
Copy link
Member

@kephas: Is it possible add tests to this PR??

@styfle styfle requested review from UziTech and davisjam April 16, 2018 13:08
Copy link
Member

@UziTech UziTech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this was added in c13a7aa to fix #52

Being that there is a work around (using  ) I say we leave it.

@davisjam
Copy link
Contributor

@kephas

I don't know much about this part of the code...

  1. Can you comment on the potential impact on the rest of the lexer? I imagine that it will all "just work" and the \u00a0 will be preserved in the generated HTML to be rendered by the browser?
  2. @UziTech If \u00a0 is a distinct Unicode character with special behavior not equivalent to plain whitespace, why would we eliminate it? We could instead replace \u00a0 with &nbsp, I suppose, but that seems worse than the approach @kephas has proposed.

@UziTech
Copy link
Member

UziTech commented Jun 27, 2018

@davisjam
It seems this issue happens when taking input from a content editable div. Typing two or more consecutive spaces turns the extra spaces into \u00a0 which @OscarGodson (#52) believes should be interpreted as whitespace and @kephas (#640) believes should be interpreted as non-breaking space characters.

@Martii
Copy link
Contributor

Martii commented Jun 27, 2018

So here's a thought... if someone uses https://github.github.com/gfm/#indented-code-blocks then the "content editable div" will convert those 4 normal typed spaces into 1 normal space and 3 \u00a0 (non-breaking spaces) and then marked currently converts those back to regular spaces (as if they were typed that way and they were from what I understand reading this). Removal of that seems like it would break that part of the spec if they weren't normal spaces. Just something to chew on I hope.

@UziTech
Copy link
Member

UziTech commented Jun 27, 2018

I agree with @Martii, I would think that most people who enter a \u00a0 mean for it to be a regular space and not a non-breaking space character.

If someone is going to enter \u00a0 intentionally they can just as easily enter  

@davisjam
Copy link
Contributor

So let's see if I understand the situation: we can't tell with complete certainty whether a \u00a0 was generated automatically from whitespace (so replacing it with whitespace is fine) or whether it was deliberately entered (so replacing it with whitespace is surprising).

If the majority of the \u00a0 we see are auto-generated from whitespace, then I think returning them to whitespace by default (i.e. the current behavior) is fine.

I suspect that most of the \u00a0's we see are auto-generated. Markdown is generally produced by humans typing into text editors, and none of the humans I know inject unusual Unicode characters into their markdown files. And @UziTech is correct that an &nbsp is a viable way to insert non-breaking spaces if someone really wants them.

@kephas
Copy link
Author

kephas commented Jun 29, 2018

I fear that you are basing your assumptions on english writers. In French, for example, it is a requirement to have a space before double punctuations like [;:?!]. Some text editors actually do this for you and do this, with good reason, not with a normal space but an unbreakable space, because that punctation should not be orphaned, typographically speaking.

@UziTech
Copy link
Member

UziTech commented Jun 29, 2018

Possible solutions that work for both parties:

  1. Leave marked as is and recommend anyone who needs \u00a0 to be a non-breaking space to use
    marked(markdownString.replace(/\u00a0/g, ' '))
  2. Commit this PR and recommend anyone who needs \u00a0 to be whitespace to use
    marked(markdownString.replace(/\u00a0/g, ' '))
  3. Create an option to let the user convert \u00a0 to ' ' or not

@Martii
Copy link
Contributor

Martii commented Jun 29, 2018

3)) NON-BREAKING Could work but a default of 1)) 👍

1)) NON-BREAKING ❓ Couldn't those few languages that punctuate uniquely write their own Lexer.prototype.lex or is that not possible? If it is possible it would make more sense for them to utilize that with the French language.

2)) BREAKING 👎 Mentioned at #897 (comment) and that's probably not the only one that needs real spaces.

Misc note: Russian, Spanish, etc. doesn't use an extra space and it's definitely not English. So moot point.

A)) NON-BREAKING ❓ Seems like CSS and/or the browser default language via should handle this although I've never seen a rule like this.

B)) NON-BREAKING ❓ Some html metadata... Ex. http://nimbupani.com/declaring-languages-in-html-5.html

@davisjam
Copy link
Contributor

@kephas You are correct that I was only considering English-language practices in my summary.

@UziTech Thanks for the proposals.

I like the "offer an option for this" approach (3), and I think @Martii's suggestion of using substitution as the default is the right choice.

I don't want to ask users to supply a mix of options and ad hoc substitutions, as suggested in (1).

@mccraveiro
Copy link
Contributor

Option 1 means we would never be 100% spec compliant as stated on #1398
Option 2 makes a lot of sense and goes in the same direction as #252. I mean: the fewer config options we have the better; and everything thats better done outside the marked core shouldn't be included on the core.

Also, we could create a FAQ section on README to include this case and #252.

@Feder1co5oave
Copy link
Contributor

tl;dr let's just stop replacing \u00a0 with spaces, maybe on the next development branch

I understand the wish for backward compatibility, even for existing "broken" practices like typing nbspaces instead of regular spaces (even though it seems we wouldn't bother as many people as we might think, as @UziTech pointed out in the linked issue), but I'm more in favor of letting the user decide when to use nbspaces, without having to use any hack like String.replace (which tie him to html). This makes us more typography friendly, binary transparent, Unicode aware, and commonmark compliant. And I definitely wouldn't add an option only for this.

If we'd like to retain some backward compatibility, I'm wondering if we should branch the development process into two parallel lines, whereas we would introduce fixes and breaking changes on the e.g. 1.0-dev branch, and cherry pick only bugfixes and non-breaking improvements into 0.5-dev or 0.6-dev. Obviously this puts some additional burden on the maintainers...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Non-breakable spaces don't work in marked Marked removes non-breaking spaces in the original text
7 participants