Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError when trying to sort a string of a number and a string #7

Closed
Kwpolska opened this issue May 7, 2014 · 18 comments
Closed

TypeError when trying to sort a string of a number and a string #7

Kwpolska opened this issue May 7, 2014 · 18 comments

Comments

@Kwpolska
Copy link

Kwpolska commented May 7, 2014

Python 3.3 and Python 3.4 fail to sort collections that contain strings that are also numbers:

>>> import natsort
>>> natsort.natsorted(('a', '1'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kwpolska/virtualenvs/nikola-py3/lib/python3.4/site-packages/natsort/natsort.py", line 247, in natsorted
    return sorted(seq, key=lambda x: natsort_key(key(x),
TypeError: unorderable types: float() < str()

This works just fine in Python 2.7.6.

via getnikola/nikola#1275 (cc @devurandom)

@SethMMorton
Copy link
Owner

I was expecting this bug report sooner or later...

The problem unfortunately is that Python3.x changed how it handles sorting objects of different (and unorderable) type. In Python2.x, if sorted is given ('a', 1, ['howdy', 'doody']), it will use the names of the types of the objects to sort them (i.e. "str", "int", "list"). Observe:

>>> a = ('a', 1, ['howdy', 'doody'])
>>> sorted(a)
>>> [1, ['howdy', 'doody'], 'a'] # int, list, str

It was decided that for Python3.x, instead of sorting objects that have no intrinsic order but something arbitrary like the name of the object type, it returns an error. The thinking is that if you are trying sort objects that have no intrinsic order, then you are probably doing something wrong and you should modify your input data. (We could argue if this is the correct approach or not, but either way we must live with this reality).

This affects natsort because under the hood natsort uses python's built-in sorted, but first pre-processes the data so as to give python hints as to sort the data naturally. This means that all strings are split into a tuple of strings and numbers so that python knows to sort the numerical parts numerically, not ASCII-betically. You can observe what natsort is doing under the hood with natsort_key:

>>> [natsort.natsort_key(x) for x in ('a', '1', 1, 'd12g')]
>>> [('a',), (1.0,), (1,), ('d', 12.0, 'g')]

Because of this fact, Python3.x is going to complain if you try and sort ('a', '1'), since what sorted will end up trying to do is compare and "a" to 1, which is not intrinsically orderable. natsort was developed to handle the common problem of sorting strings that are similar that have numbers embedded, and not a combination of numbers and strings (which is not well orderable).

I am reluctant to try and modify natsort to push through this and sort anyway, because it would mean making an arbitrary choice on my part to put numbers either before strings or after strings (and what about numbers to tuples/lists, or some other object I don't know about)?

I am open to suggestions on how to approach this problem. Perhaps you have a more specific example of what your problem is that can more easily be solved? Otherwise, I am afraid that the best solution is to think about what data you are actually trying to sort (as apparently the Python developers feel is the better approach). It might be possible that for this particular set of data sorted might be better.

@Kwpolska
Copy link
Author

Kwpolska commented May 7, 2014

We are sorting user input, names of files/directories. They are best off naturally-sorted — something Windows XP, Mac OS X and other modern OSes do. They can be anything.

@ralsina
Copy link

ralsina commented May 7, 2014

I don't understand the problem. If you want the same behaviour as in python 2, then just make it int < float < string, it's not arbitrary (or at least not more arbitrary than it was in python 2 :-)

@SethMMorton
Copy link
Owner

Unfortunately, implementing the Python2.x behavior on Python3.x will not be easy. The comparisons between the objects are done inside the sorted function, so there is no simple way to override them so that it will compare type names in the even to an unorderable types TypeError.

Python3.x also did away with the cmp argument to sorted. Having, said that, I do notice that functools has a cmp_to_key function that may be promising. I will take a look at how I can use this to simulate the Python2.x behavior.

@ralsina
Copy link

ralsina commented May 7, 2014

You can use key(). For example, you can always compare tuples. The behaviour of sorted() is to compare elements left-to-right, and decide on the first one that's not equal. So, to compare floats and strings, you could compare ('string', '1') to ('float', 2.0). You can even generate the first element by using type()

@SethMMorton
Copy link
Owner

Are you saying that I should take the type of the first element in the tuple and place that in front? I ask because natsort_key currently does the following transformations: "12ha" => (12.0, "ha") and "ha12" => ("ha", 12.0). Is your suggestion that I do "12ha" => ("float", 12.0, "ha") and "ha12" => ("str", "ha", 12.0), or "12ha" => ("float", 12.0, "str", "ha") and "ha12" => ("str", "ha", "float", 12.0)? I'm pretty sure the latter would break my unit tests, but the former might work. It would also be more efficient than creating a cmp function.

@ralsina
Copy link

ralsina commented May 7, 2014

I think the first option you give would make it work pretty much like python 2 :-)

@SethMMorton
Copy link
Owner

Another option might be to make the following transformations: "12ha" => ("", 12.0, "ha") and "ha12" => ("ha", 12.0). This way there will always be a string as the first element, and the empty string will mean it always goes first. It would also avoid the problem of having to append an element to the front of the tuple in the most common case, which is that a non-digit is at the beginning of the string.

I'll give both options a try, and see which is most efficient. Thanks for all your input!

@SethMMorton
Copy link
Owner

OK, I have uploaded a new version of natsort (3.2.0, commit e69383f) that passes the unit tests that I have created to stress the unorderable types issue. Thanks for all the suggestions! I was definitely going to try to solve this in a much more convoluted manner before this discussion.

Can you please try this out and let me know if it solves your problems?

@Kwpolska
Copy link
Author

Kwpolska commented May 8, 2014

It looks sane:

>>> natsort.natsorted(['a2', 'a1', 'a20', 'a11', 'a', '1', '11', '2'])
['1', '2', '11', 'a', 'a1', 'a2', 'a11', 'a20']

The same output is produced by natsort 3.1.1 in Python 2.7. All seems well.

@Kwpolska Kwpolska closed this as completed May 8, 2014
@ralsina
Copy link

ralsina commented May 8, 2014

@SethMMorton cool, thanks for fixing this!

@fake-name
Copy link

I'm sorry to say this issue is not fixed, at least for certain pathological cases:

For the minimal test-case:

import natsort

def go():
    print("Natsort version:", natsort.__version__)
    data = ['c1 -', 'c1-1']
    data = natsort.natsorted(data)


if __name__ == "__main__":
    go()

The output is:


D:\NekoProfile\Desktop>python test.py
Natsort version: 3.2.0
Traceback (most recent call last):
  File "test.py", line 12, in <module>
    go()
  File "test.py", line 8, in go
    data = natsort.natsorted(data)
  File "C:\Python34\lib\site-packages\natsort\natsort.py", line 258, in natsorted
    return sorted(seq, key=lambda x: natsort_key(key(x),
TypeError: unorderable types: float() < str()

Under python 2.7:


D:\NekoProfile\Desktop>py -2 test.py
('Natsort version:', u'3.2.0')

@SethMMorton SethMMorton reopened this Jun 20, 2014
@SethMMorton
Copy link
Owner

I don't have an opportunity to fix now (I'm at work and don't have access to python3), but I know what is going on and will be able to fix it (the previous fix only solved the problem if it occurred at the front of the string, but this issue is in the middle of the string). Having said that, in your case you might reconsider what it is that you are trying to sort, and consider using one of the natsort options to tell it to interpret your data differently.

As you well know because you read the documentation, natsort assumes by default that the '-' and '+' are part of a number if they precede a number. So, when parsing each string, natsort will convert 'c1 -' to ('c', 1, ' -') and 'c1-1' to ('c', 1, -1). Here is the origin of your TypeError, because the third element in the first tuple is a str, and in the second tuple is an int. To get rid of this issue, you should used signed=False option to natsort; this will disable the interpretation of '-' and '+' as part of the number, and will solve your problem in the short term; it will convert 'c1 -' to ('c', 1, ' -') and 'c1-1' to ('c', 1, '-', 1).

The reason I suggest that the problem is less the TypeError and more of an input data problem is because even if we remove the TypeError, you will still get results you don't expect. Consider the following:

>>> a = ['c1 - 4', 'c1 -', 'c1-5', 'c1-1']
>>> natsort.natsorted(a)
['c1-5', 'c1-1', 'c1 -', 'c1 - 4']

I'll wager that you would not have wanted the '5' at the front of the list, but because it is being interpreted as '-5' it gets placed at the front. What happens if we use signed=False?

>>> natsort.natsorted(a, signed=False)
['c1 -', 'c1 - 4', 'c1-1', 'c1-5']

The '5' is correctly placed at the back, but why is the '4' in front of the '1'? Here, it is because you have ' - ' preceding the '4', but only '-' in front of '1' and '5'. Lexicographically, ' ' comes before '-', so the spaces get placed in front.

In summary, I will fix natsort so that the behavior on python3.x mimics that on python2.x, but I would reconsider the data that you are trying to sort and possibly pre-process it so that you are comparing apples to apples when the sorting occurs.

@SethMMorton
Copy link
Owner

I have made the fix. Using your test file, I get the following output:

[Seth@Seths-MacBook-Air natsort]$ python test.py 
('Natsort version:', u'3.2.1')
[Seth@Seths-MacBook-Air natsort]$ python3 test.py 
Natsort version: 3.2.1

Unittests have been added to test this functionality.

@fake-name
Copy link

Cool. To be honest, for certain edge-cases like this, I don't care too much if the sort is correct, just that it actually sorts.

In this case I'm sorting file-names for presentation to the user, and if they have to deal with a slightly awkward sort order because they're using bizarre ordering, that's fine.

@fake-name
Copy link

I'm not sure if you're interested, but I wrote a little fuzz-tester to stress test natsorted:

import string
import natsort
import random
import traceback
random.seed()

def go():
    loops = 0
    while 1:
        stLen = random.randint(3,20)
        str1 = random.sample(string.printable,stLen)
        str2 = str1.copy()
        for dummy_x in range(random.randint(1,5)):
            str2[random.randint(0,stLen-1)] = random.sample(string.printable, 1)[0]

        str1 = "".join(str1)
        str2 = "".join(str2)

        try:
            natsort.natsorted([str1, str2])
        except KeyboardInterrupt:
            print("Breaking")
            break
        except:
            traceback.print_exc()
            print("Str1 '%s'" % str1)
            print("Str2 '%s'" % str2)
            break

        loops += 1
        if loops % 5000 == 0:
            print("Looped", loops)


if __name__ == "__main__":
    go()

It basically generates a random string, permutes a copy of the string, then tries to sort the two strings.

3 million sorted items, and no crashes yet. It's not ideal, but it should probably turn up edge-cases like the one I encountered eventually.

@SethMMorton
Copy link
Owner

I really like that. As of right now, I use doctest to unit test natsort, and things are starting to get a little out of hand. I've been meaning to switch to a legit unit test framework at some point, and with the inclusion of a stress-tester like this natsort could really be rock-solid.

I really appreciate you writing this and letting me know... I can sleep better knowing that it hasn't failed after about 3 million items. I can add it to a more proper test suite by just copying your example, or if you want to make a pull request that's cool too.

@fake-name
Copy link

I actually left it running for a while, and it made it to ~35 million items before I got bored and stopped it.

It doesn't seem like it's really worth bothering pull requesting for, just feel free to copy & paste it wherever.

I'm not sure how one would integrate a actual random-fuzzer into a unit-testing system. They're kind of intrinsically non-repeatable, so they're not guaranteed to catch issues. I guess you could set it up so it just runs a number of iterations. You would need some mechanism to log the string that triggers failure, though.

SethMMorton added a commit that referenced this issue Apr 3, 2015
The fix for issue #7 (preventing the "unorderable types" TypeError
between numbers and strings on Python 3) can raise an "unorderable
types" TypeError between str and bytes on Python 3 if LOCALE is enabled.
This is because the strxfrm function converts str to bytes when
performing the string collation. This is because the empty string that
is inserted for numbers is of the str type but did not get sent to
strxfrm and thus did not get converted to bytes.

The fix was to add either an empty bytes string or str string depending
on if LOCALE is enabled or not.

This fixes #22.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants