Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use unicode-slugify to generate Version.slug #5186

Merged
merged 3 commits into from
Feb 19, 2019

Conversation

humitos
Copy link
Member

@humitos humitos commented Jan 28, 2019

Our regex were replacing "all invalid chars by -" which I think it's not a very good pattern because we ended up with slugs like d-----branch---a (multiple dashes together).

This implementation only replaces /, %, ! and ? with dashes to keep the behavior written in tests. I think that / is a good one to replace because it's a common way to separate words, but I'm not sold with the rest ones.

Besides, this PR uses unicode-slugify which work better with unicode project names using it's "best" ASCII representation.

Closes #1410

Uses ``unicode-slugify`` to generate the slug.
"""

ok_chars = '-._' # dash, dot, underscore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the rfc https://tools.ietf.org/html/rfc1035

The following syntax will result in fewer problems with many

applications that use domain names (e.g., mail, TELNET).

<domain> ::= <subdomain> | " "

<subdomain> ::= <label> | <subdomain> "." <label>

<label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]

<ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str>

<let-dig-hyp> ::= <let-dig> | "-"

<let-dig> ::= <letter> | <digit>

_ underscore isn't allowed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it should start with a letter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I didn't change this behavior. We could consider.

Our current regex was working like that:

# Regex breakdown:
#   [a-z0-9] -- start with alphanumeric value
#   [-._a-z0-9] -- allow dash, dot, underscore, digit, lowercase ascii
#   *? -- allow multiple of those, but be not greedy about the matching
#   (?: ... ) -- wrap everything so that the pattern cannot escape when used in
#                regexes.
VERSION_SLUG_REGEX = '(?:[a-z0-9A-Z][-._a-z0-9A-Z]*?)'

Basically, it does not allow ., - and _ as starting character, but it does allow a letter (lower and upper case) and a number.

If we want to be more strict and follow those directions, I suppose we should not allow the . (dot) either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, this only affects the version's slug, not the project's slug. We are safe here, sorry for the noise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use the same logic for project slugs if it's not explicitly defined by the user?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean removing the _ and .. I mean, "why we don't use the all same logic for generating the slug for a Version than the slug for a Project?"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Project's are used in DNS, but Versions aren't. They are different things, so we don't need the exact same restrictions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I understand that.

I'm suggesting to use the same logic (unicode-slugify to generate the slug) but apply different restrictions.

Anyway, if so, it's for a different PR.

@davidfischer
Copy link
Contributor

While I don't have a very strong opinion here, the 3rd party slugify method seems to be almost the same as Django's built-in one. Unless there's a good reason, I don't think we need to add a dependency here.

@humitos
Copy link
Member Author

humitos commented Jan 28, 2019

@davidfischer if you take a look at the examples in the README, it works way better with unicode chars (Chinese, for example)

I put the same example at #1410 (comment)

NOTE: that particular is currently broken, but it seems to way better to handle those characters than the one from Django that just drops them.

In [3]: slugify('von außen arbeiten', only_ascii=True)                                                                                                                                        
Out[3]: 'von-aussen-arbeiten'  # more similar to the German pronunciation

In [7]: slugify_django('von außen arbeiten')                                                                                                                                                  
Out[7]: 'von-auen-arbeiten'  # just drops the 'ß' char

another example in Japanese:

In [11]: slugify('家', only_ascii=True)
Out[11]: 'Jia '  # similar to the pronunciation

In [12]: slugify_django('家')
Out[12]: ''

Currently, the Chinese and Japanese example are broken. I reported this at mozilla/unicode-slugify#34 but I think that we should generate proper slug from the these names instead of an empty string.

I'm blocking this PR because of the bug in the 3rd party library for now, but I'm open to more discussion on this PR.

@humitos humitos added the Status: blocked Issue is blocked on another issue label Jan 28, 2019
@humitos
Copy link
Member Author

humitos commented Jan 28, 2019

The trick is at this line: https://github.com/mozilla/unicode-slugify/blob/master/slugify/__init__.py#L80 that uses unidecode.

In [14]: unidecode('家')                                                                                                                                                                      
Out[14]: 'Jia '

Django version doesn't do that.

@humitos
Copy link
Member Author

humitos commented Jan 28, 2019

Sorry, the broken examples are broken only in version 0.1.3 (which is the latest release). Under 0.1.5 they work properly and generate valid slugs:

In [1]: from slugify import slugify                                                                                                                                                           

In [2]: slugify('家', only_ascii=True)                                                                                                                                                        
Out[2]: 'jia'

In [3]: slugify('外から働く', only_ascii=True)                                                                                                                                                
Out[3]: 'wai-karadong-ku'

In [4]: slugify('von außen arbeiten', only_ascii=True)                                                                                                                                        
Out[4]: 'von-aussen-arbeiten'

In [5]:                                                                                                                                                                                       

@humitos humitos removed the Status: blocked Issue is blocked on another issue label Jan 28, 2019
@davidfischer
Copy link
Contributor

I see what you're saying. I take back what I said and I think I'm good with this. Version slugs should have the decimal point and a few other characters.

@humitos
Copy link
Member Author

humitos commented Jan 28, 2019

Version slugs should have the decimal point and a few other characters.

It's configured for [a-z0-9._-] at the moment. Do you think we should have something else? We may consider @stsewd's comment, though: #5186 (comment)

@humitos humitos requested a review from a team January 31, 2019 10:58
@humitos
Copy link
Member Author

humitos commented Jan 31, 2019

We may consider @stsewd's comment, though: #5186 (comment)

Already discussed that: project slug =! version slug since project slug is used as a subdomain and it has bigger restrictions. If we can use the same logic to do "the best" conversion from a Unicode char to its ASCII representation, we should do that in another PR. My idea here is to use "the same logic" but after that "apply different restrictions" to version slug than project slug to fulfill each needing.

ericholscher
ericholscher previously approved these changes Feb 14, 2019
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good change. This will only effect versions going forward right? Any time we change slugs I get scared. Did you test this with an existing project with the old slugs, and doing a build and making sure we didn't regenerate or create a bunch of new versions?

@@ -26,6 +26,7 @@

from django.db import models
from django.utils.encoding import force_text
from slugify import slugify
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have a different import for this:

Suggested change
from slugify import slugify
from slugify import slugify as unicode_slugify

@@ -78,13 +81,37 @@ def get_queryset(self, model_cls, slug_field):
return model._default_manager.all()
return model_cls._default_manager.all()

def _normalize(self, content):
"""
Normalize some invalid characters to become a dash (``-``).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a better docstring -- this tells me what not _why. Is there a reason we only do this for a subset of characters, but not others?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the why: which is basically to keep compatibility.

@@ -68,6 +68,9 @@ django-kombu==0.9.4
mock==2.0.0
stripe==2.19.0

# unicode-slugify==0.1.5 is not released on PyPI yet
git+https://github.com/mozilla/unicode-slugify@b696c37#egg=unicode-slugify
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙁

@ericholscher
Copy link
Member

ericholscher commented Feb 14, 2019

I tested this locally and it works 👍

We only compare the version name and id, not the slug.

@humitos
Copy link
Member Author

humitos commented Feb 18, 2019

This will only effect versions going forward right?

Yes. No migration or anything like that is applied.

Any time we change slugs I get scared. Did you test this with an existing project with the old slugs, and doing a build and making sure we didn't regenerate or create a bunch of new versions?

Yes. I tested triggering a build of latest which also syncs the versions and disabling/enabling an old version with unicode chars affected by our old slugify and it was not modified. It cloned and built correctly.

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I'm a little hesitant to depend on a prerelease version, but I guess it's fine.

@humitos humitos merged commit 88cffda into master Feb 19, 2019
@delete-merged-branch delete-merged-branch bot deleted the humitos/use-unicode-slugify branch February 19, 2019 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants