Use unicode-slugify to generate Version.slug #5186

humitos · 2019-01-28T14:29:45Z

Our regex were replacing "all invalid chars by -" which I think it's not a very good pattern because we ended up with slugs like d-----branch---a (multiple dashes together).

This implementation only replaces /, %, ! and ? with dashes to keep the behavior written in tests. I think that / is a good one to replace because it's a common way to separate words, but I'm not sold with the rest ones.

Besides, this PR uses unicode-slugify which work better with unicode project names using it's "best" ASCII representation.

Closes #1410

stsewd · 2019-01-28T16:31:10Z

readthedocs/builds/version_slug.py

+    Uses ``unicode-slugify`` to generate the slug.
+    """
+
+    ok_chars = '-._'  # dash, dot, underscore


From the rfc https://tools.ietf.org/html/rfc1035

The following syntax will result in fewer problems with many applications that use domain names (e.g., mail, TELNET). <domain> ::= <subdomain> | " " <subdomain> ::= <label> | <subdomain> "." <label> <label> ::= <letter> [ [ <ldh-str> ] <let-dig> ] <ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str> <let-dig-hyp> ::= <let-dig> | "-" <let-dig> ::= <letter> | <digit>

_ underscore isn't allowed

And it should start with a letter

Well, I didn't change this behavior. We could consider.

Our current regex was working like that:

# Regex breakdown: # [a-z0-9] -- start with alphanumeric value # [-._a-z0-9] -- allow dash, dot, underscore, digit, lowercase ascii # *? -- allow multiple of those, but be not greedy about the matching # (?: ... ) -- wrap everything so that the pattern cannot escape when used in # regexes. VERSION_SLUG_REGEX = '(?:[a-z0-9A-Z][-._a-z0-9A-Z]*?)'

Basically, it does not allow ., - and _ as starting character, but it does allow a letter (lower and upper case) and a number.

If we want to be more strict and follow those directions, I suppose we should not allow the . (dot) either.

Wait, this only affects the version's slug, not the project's slug. We are safe here, sorry for the noise.

Shouldn't we use the same logic for project slugs if it's not explicitly defined by the user?

Yeah, we use this for project's slug https://github.com/stsewd/readthedocs.org/blob/f1fd0c9827c10834277d63aa35cb4511ee3ae5ed/readthedocs/core/utils/__init__.py#L209-L219

I don't mean removing the _ and .. I mean, "why we don't use the all same logic for generating the slug for a Version than the slug for a Project?"

Because Project's are used in DNS, but Versions aren't. They are different things, so we don't need the exact same restrictions.

Yeah, I understand that.

I'm suggesting to use the same logic (unicode-slugify to generate the slug) but apply different restrictions.

Anyway, if so, it's for a different PR.

davidfischer · 2019-01-28T16:51:57Z

While I don't have a very strong opinion here, the 3rd party slugify method seems to be almost the same as Django's built-in one. Unless there's a good reason, I don't think we need to add a dependency here.

humitos · 2019-01-28T19:41:26Z

@davidfischer if you take a look at the examples in the README, it works way better with unicode chars (Chinese, for example)

I put the same example at #1410 (comment)

NOTE: that particular is currently broken, but it seems to way better to handle those characters than the one from Django that just drops them.

In [3]: slugify('von außen arbeiten', only_ascii=True)                                                                                                                                        
Out[3]: 'von-aussen-arbeiten'  # more similar to the German pronunciation

In [7]: slugify_django('von außen arbeiten')                                                                                                                                                  
Out[7]: 'von-auen-arbeiten'  # just drops the 'ß' char

another example in Japanese:

In [11]: slugify('家', only_ascii=True)
Out[11]: 'Jia '  # similar to the pronunciation

In [12]: slugify_django('家')
Out[12]: ''

Currently, the Chinese and Japanese example are broken. I reported this at mozilla/unicode-slugify#34 but I think that we should generate proper slug from the these names instead of an empty string.

I'm blocking this PR because of the bug in the 3rd party library for now, but I'm open to more discussion on this PR.

humitos · 2019-01-28T19:47:22Z

The trick is at this line: https://github.com/mozilla/unicode-slugify/blob/master/slugify/__init__.py#L80 that uses unidecode.

In [14]: unidecode('家')                                                                                                                                                                      
Out[14]: 'Jia '

Django version doesn't do that.

humitos · 2019-01-28T19:50:27Z

Sorry, the broken examples are broken only in version 0.1.3 (which is the latest release). Under 0.1.5 they work properly and generate valid slugs:

In [1]: from slugify import slugify                                                                                                                                                           

In [2]: slugify('家', only_ascii=True)                                                                                                                                                        
Out[2]: 'jia'

In [3]: slugify('外から働く', only_ascii=True)                                                                                                                                                
Out[3]: 'wai-karadong-ku'

In [4]: slugify('von außen arbeiten', only_ascii=True)                                                                                                                                        
Out[4]: 'von-aussen-arbeiten'

In [5]:

davidfischer · 2019-01-28T20:20:06Z

I see what you're saying. I take back what I said and I think I'm good with this. Version slugs should have the decimal point and a few other characters.

humitos · 2019-01-28T20:24:55Z

Version slugs should have the decimal point and a few other characters.

It's configured for [a-z0-9._-] at the moment. Do you think we should have something else? We may consider @stsewd's comment, though: #5186 (comment)

humitos · 2019-01-31T11:00:45Z

We may consider @stsewd's comment, though: #5186 (comment)

Already discussed that: project slug =! version slug since project slug is used as a subdomain and it has bigger restrictions. If we can use the same logic to do "the best" conversion from a Unicode char to its ASCII representation, we should do that in another PR. My idea here is to use "the same logic" but after that "apply different restrictions" to version slug than project slug to fulfill each needing.

ericholscher

This looks like a good change. This will only effect versions going forward right? Any time we change slugs I get scared. Did you test this with an existing project with the old slugs, and doing a build and making sure we didn't regenerate or create a bunch of new versions?

ericholscher · 2019-02-14T13:51:09Z

readthedocs/builds/version_slug.py

@@ -26,6 +26,7 @@

 from django.db import models
 from django.utils.encoding import force_text
+from slugify import slugify


I wonder if we should have a different import for this:

Suggested change

from slugify import slugify

from slugify import slugify as unicode_slugify

ericholscher · 2019-02-14T13:52:23Z

readthedocs/builds/version_slug.py

@@ -78,13 +81,37 @@ def get_queryset(self, model_cls, slug_field):
                return model._default_manager.all()
        return model_cls._default_manager.all()

+    def _normalize(self, content):
+        """
+        Normalize some invalid characters to become a dash (``-``).


Could use a better docstring -- this tells me what not _why. Is there a reason we only do this for a subset of characters, but not others?

I added the why: which is basically to keep compatibility.

ericholscher · 2019-02-14T13:52:40Z

requirements/pip.txt

@@ -68,6 +68,9 @@ django-kombu==0.9.4
 mock==2.0.0
 stripe==2.19.0

+# unicode-slugify==0.1.5 is not released on PyPI yet
+git+https://github.com/mozilla/unicode-slugify@b696c37#egg=unicode-slugify


ericholscher · 2019-02-14T14:01:29Z

I tested this locally and it works 👍

We only compare the version name and id, not the slug.

humitos · 2019-02-18T11:52:36Z

This will only effect versions going forward right?

Yes. No migration or anything like that is applied.

Any time we change slugs I get scared. Did you test this with an existing project with the old slugs, and doing a build and making sure we didn't regenerate or create a bunch of new versions?

Yes. I tested triggering a build of latest which also syncs the versions and disabling/enabling an old version with unicode chars affected by our old slugify and it was not modified. It cloned and built correctly.

ericholscher

Looks good. I'm a little hesitant to depend on a prerelease version, but I guess it's fine.

humitos requested a review from a team January 28, 2019 14:29

humitos mentioned this pull request Jan 28, 2019

UTF8 characters on version slugging -- or slugging in general #1410

Closed

humitos force-pushed the humitos/use-unicode-slugify branch from 8991505 to 514bfc3 Compare January 28, 2019 14:33

stsewd reviewed Jan 28, 2019

View reviewed changes

humitos added the Status: blocked Issue is blocked on another issue label Jan 28, 2019

humitos removed the Status: blocked Issue is blocked on another issue label Jan 28, 2019

Use unicode-slugify to generate Version.slug

2cc7d59

humitos force-pushed the humitos/use-unicode-slugify branch from 514bfc3 to 2cc7d59 Compare January 31, 2019 10:57

humitos requested a review from a team January 31, 2019 10:58

ericholscher previously approved these changes Feb 14, 2019

View reviewed changes

humitos added 2 commits February 18, 2019 12:42

Make the usage of unicode slugify explicit

ffe52f4

Improve docstring for normalize method

8baf9fd

humitos dismissed ericholscher’s stale review via 8baf9fd February 18, 2019 11:46

humitos requested a review from ericholscher February 18, 2019 11:53

ericholscher approved these changes Feb 19, 2019

View reviewed changes

humitos merged commit 88cffda into master Feb 19, 2019

delete-merged-branch bot deleted the humitos/use-unicode-slugify branch February 19, 2019 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use unicode-slugify to generate Version.slug #5186

Use unicode-slugify to generate Version.slug #5186

humitos commented Jan 28, 2019

stsewd Jan 28, 2019

stsewd Jan 28, 2019

humitos Jan 28, 2019

stsewd Jan 28, 2019

humitos Jan 28, 2019

stsewd Jan 28, 2019

humitos Jan 29, 2019

ericholscher Jan 29, 2019

humitos Jan 29, 2019

davidfischer commented Jan 28, 2019

humitos commented Jan 28, 2019

humitos commented Jan 28, 2019

humitos commented Jan 28, 2019

davidfischer commented Jan 28, 2019

humitos commented Jan 28, 2019

humitos commented Jan 31, 2019

ericholscher left a comment

ericholscher Feb 14, 2019

ericholscher Feb 14, 2019

humitos Feb 18, 2019

ericholscher Feb 14, 2019

ericholscher commented Feb 14, 2019 •

edited

Loading

humitos commented Feb 18, 2019

ericholscher left a comment

	from slugify import slugify
	from slugify import slugify as unicode_slugify

Use unicode-slugify to generate Version.slug #5186

Use unicode-slugify to generate Version.slug #5186

Conversation

humitos commented Jan 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidfischer commented Jan 28, 2019

humitos commented Jan 28, 2019

humitos commented Jan 28, 2019

humitos commented Jan 28, 2019

davidfischer commented Jan 28, 2019

humitos commented Jan 28, 2019

humitos commented Jan 31, 2019

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericholscher commented Feb 14, 2019 • edited Loading

humitos commented Feb 18, 2019

ericholscher left a comment

Choose a reason for hiding this comment

ericholscher commented Feb 14, 2019 •

edited

Loading