TypeError when trying to sort a string of a number and a string #7

Kwpolska · 2014-05-07T14:27:46Z

Python 3.3 and Python 3.4 fail to sort collections that contain strings that are also numbers:

>>> import natsort
>>> natsort.natsorted(('a', '1'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kwpolska/virtualenvs/nikola-py3/lib/python3.4/site-packages/natsort/natsort.py", line 247, in natsorted
    return sorted(seq, key=lambda x: natsort_key(key(x),
TypeError: unorderable types: float() < str()

This works just fine in Python 2.7.6.

via getnikola/nikola#1275 (cc @devurandom)

SethMMorton · 2014-05-07T18:57:25Z

I was expecting this bug report sooner or later...

The problem unfortunately is that Python3.x changed how it handles sorting objects of different (and unorderable) type. In Python2.x, if sorted is given ('a', 1, ['howdy', 'doody']), it will use the names of the types of the objects to sort them (i.e. "str", "int", "list"). Observe:

>>> a = ('a', 1, ['howdy', 'doody'])
>>> sorted(a)
>>> [1, ['howdy', 'doody'], 'a'] # int, list, str

It was decided that for Python3.x, instead of sorting objects that have no intrinsic order but something arbitrary like the name of the object type, it returns an error. The thinking is that if you are trying sort objects that have no intrinsic order, then you are probably doing something wrong and you should modify your input data. (We could argue if this is the correct approach or not, but either way we must live with this reality).

This affects natsort because under the hood natsort uses python's built-in sorted, but first pre-processes the data so as to give python hints as to sort the data naturally. This means that all strings are split into a tuple of strings and numbers so that python knows to sort the numerical parts numerically, not ASCII-betically. You can observe what natsort is doing under the hood with natsort_key:

>>> [natsort.natsort_key(x) for x in ('a', '1', 1, 'd12g')]
>>> [('a',), (1.0,), (1,), ('d', 12.0, 'g')]

Because of this fact, Python3.x is going to complain if you try and sort ('a', '1'), since what sorted will end up trying to do is compare and "a" to 1, which is not intrinsically orderable. natsort was developed to handle the common problem of sorting strings that are similar that have numbers embedded, and not a combination of numbers and strings (which is not well orderable).

I am reluctant to try and modify natsort to push through this and sort anyway, because it would mean making an arbitrary choice on my part to put numbers either before strings or after strings (and what about numbers to tuples/lists, or some other object I don't know about)?

I am open to suggestions on how to approach this problem. Perhaps you have a more specific example of what your problem is that can more easily be solved? Otherwise, I am afraid that the best solution is to think about what data you are actually trying to sort (as apparently the Python developers feel is the better approach). It might be possible that for this particular set of data sorted might be better.

Kwpolska · 2014-05-07T19:02:58Z

We are sorting user input, names of files/directories. They are best off naturally-sorted — something Windows XP, Mac OS X and other modern OSes do. They can be anything.

ralsina · 2014-05-07T19:13:07Z

I don't understand the problem. If you want the same behaviour as in python 2, then just make it int < float < string, it's not arbitrary (or at least not more arbitrary than it was in python 2 :-)

SethMMorton · 2014-05-07T19:41:54Z

Unfortunately, implementing the Python2.x behavior on Python3.x will not be easy. The comparisons between the objects are done inside the sorted function, so there is no simple way to override them so that it will compare type names in the even to an unorderable types TypeError.

Python3.x also did away with the cmp argument to sorted. Having, said that, I do notice that functools has a cmp_to_key function that may be promising. I will take a look at how I can use this to simulate the Python2.x behavior.

ralsina · 2014-05-07T19:49:38Z

You can use key(). For example, you can always compare tuples. The behaviour of sorted() is to compare elements left-to-right, and decide on the first one that's not equal. So, to compare floats and strings, you could compare ('string', '1') to ('float', 2.0). You can even generate the first element by using type()

SethMMorton · 2014-05-07T20:09:00Z

Are you saying that I should take the type of the first element in the tuple and place that in front? I ask because natsort_key currently does the following transformations: "12ha" => (12.0, "ha") and "ha12" => ("ha", 12.0). Is your suggestion that I do "12ha" => ("float", 12.0, "ha") and "ha12" => ("str", "ha", 12.0), or "12ha" => ("float", 12.0, "str", "ha") and "ha12" => ("str", "ha", "float", 12.0)? I'm pretty sure the latter would break my unit tests, but the former might work. It would also be more efficient than creating a cmp function.

ralsina · 2014-05-07T20:12:12Z

I think the first option you give would make it work pretty much like python 2 :-)

SethMMorton · 2014-05-07T20:14:27Z

Another option might be to make the following transformations: "12ha" => ("", 12.0, "ha") and "ha12" => ("ha", 12.0). This way there will always be a string as the first element, and the empty string will mean it always goes first. It would also avoid the problem of having to append an element to the front of the tuple in the most common case, which is that a non-digit is at the beginning of the string.

I'll give both options a try, and see which is most efficient. Thanks for all your input!

SethMMorton · 2014-05-08T03:08:38Z

OK, I have uploaded a new version of natsort (3.2.0, commit e69383f) that passes the unit tests that I have created to stress the unorderable types issue. Thanks for all the suggestions! I was definitely going to try to solve this in a much more convoluted manner before this discussion.

Can you please try this out and let me know if it solves your problems?

Kwpolska · 2014-05-08T12:15:18Z

It looks sane:

>>> natsort.natsorted(['a2', 'a1', 'a20', 'a11', 'a', '1', '11', '2'])
['1', '2', '11', 'a', 'a1', 'a2', 'a11', 'a20']

The same output is produced by natsort 3.1.1 in Python 2.7. All seems well.

ralsina · 2014-05-08T14:15:42Z

@SethMMorton cool, thanks for fixing this!

fake-name · 2014-06-20T14:07:23Z

I'm sorry to say this issue is not fixed, at least for certain pathological cases:

For the minimal test-case:

import natsort

def go():
    print("Natsort version:", natsort.__version__)
    data = ['c1 -', 'c1-1']
    data = natsort.natsorted(data)


if __name__ == "__main__":
    go()

The output is:


D:\NekoProfile\Desktop>python test.py
Natsort version: 3.2.0
Traceback (most recent call last):
  File "test.py", line 12, in <module>
    go()
  File "test.py", line 8, in go
    data = natsort.natsorted(data)
  File "C:\Python34\lib\site-packages\natsort\natsort.py", line 258, in natsorted
    return sorted(seq, key=lambda x: natsort_key(key(x),
TypeError: unorderable types: float() < str()

Under python 2.7:


D:\NekoProfile\Desktop>py -2 test.py
('Natsort version:', u'3.2.0')

SethMMorton · 2014-06-20T15:57:16Z

I don't have an opportunity to fix now (I'm at work and don't have access to python3), but I know what is going on and will be able to fix it (the previous fix only solved the problem if it occurred at the front of the string, but this issue is in the middle of the string). Having said that, in your case you might reconsider what it is that you are trying to sort, and consider using one of the natsort options to tell it to interpret your data differently.

As you well know because you read the documentation, natsort assumes by default that the '-' and '+' are part of a number if they precede a number. So, when parsing each string, natsort will convert 'c1 -' to ('c', 1, ' -') and 'c1-1' to ('c', 1, -1). Here is the origin of your TypeError, because the third element in the first tuple is a str, and in the second tuple is an int. To get rid of this issue, you should used signed=False option to natsort; this will disable the interpretation of '-' and '+' as part of the number, and will solve your problem in the short term; it will convert 'c1 -' to ('c', 1, ' -') and 'c1-1' to ('c', 1, '-', 1).

The reason I suggest that the problem is less the TypeError and more of an input data problem is because even if we remove the TypeError, you will still get results you don't expect. Consider the following:

>>> a = ['c1 - 4', 'c1 -', 'c1-5', 'c1-1']
>>> natsort.natsorted(a)
['c1-5', 'c1-1', 'c1 -', 'c1 - 4']

I'll wager that you would not have wanted the '5' at the front of the list, but because it is being interpreted as '-5' it gets placed at the front. What happens if we use signed=False?

>>> natsort.natsorted(a, signed=False)
['c1 -', 'c1 - 4', 'c1-1', 'c1-5']

The '5' is correctly placed at the back, but why is the '4' in front of the '1'? Here, it is because you have ' - ' preceding the '4', but only '-' in front of '1' and '5'. Lexicographically, ' ' comes before '-', so the spaces get placed in front.

In summary, I will fix natsort so that the behavior on python3.x mimics that on python2.x, but I would reconsider the data that you are trying to sort and possibly pre-process it so that you are comparing apples to apples when the sorting occurs.

SethMMorton · 2014-06-21T04:46:44Z

I have made the fix. Using your test file, I get the following output:

[Seth@Seths-MacBook-Air natsort]$ python test.py 
('Natsort version:', u'3.2.1')
[Seth@Seths-MacBook-Air natsort]$ python3 test.py 
Natsort version: 3.2.1

Unittests have been added to test this functionality.

fake-name · 2014-06-22T10:19:22Z

Cool. To be honest, for certain edge-cases like this, I don't care too much if the sort is correct, just that it actually sorts.

In this case I'm sorting file-names for presentation to the user, and if they have to deal with a slightly awkward sort order because they're using bizarre ordering, that's fine.

fake-name · 2014-06-23T11:05:11Z

I'm not sure if you're interested, but I wrote a little fuzz-tester to stress test natsorted:

import string
import natsort
import random
import traceback
random.seed()

def go():
    loops = 0
    while 1:
        stLen = random.randint(3,20)
        str1 = random.sample(string.printable,stLen)
        str2 = str1.copy()
        for dummy_x in range(random.randint(1,5)):
            str2[random.randint(0,stLen-1)] = random.sample(string.printable, 1)[0]

        str1 = "".join(str1)
        str2 = "".join(str2)

        try:
            natsort.natsorted([str1, str2])
        except KeyboardInterrupt:
            print("Breaking")
            break
        except:
            traceback.print_exc()
            print("Str1 '%s'" % str1)
            print("Str2 '%s'" % str2)
            break

        loops += 1
        if loops % 5000 == 0:
            print("Looped", loops)


if __name__ == "__main__":
    go()

It basically generates a random string, permutes a copy of the string, then tries to sort the two strings.

3 million sorted items, and no crashes yet. It's not ideal, but it should probably turn up edge-cases like the one I encountered eventually.

SethMMorton · 2014-06-23T15:10:35Z

I really like that. As of right now, I use doctest to unit test natsort, and things are starting to get a little out of hand. I've been meaning to switch to a legit unit test framework at some point, and with the inclusion of a stress-tester like this natsort could really be rock-solid.

I really appreciate you writing this and letting me know... I can sleep better knowing that it hasn't failed after about 3 million items. I can add it to a more proper test suite by just copying your example, or if you want to make a pull request that's cool too.

fake-name · 2014-06-23T15:39:59Z

I actually left it running for a while, and it made it to ~35 million items before I got bored and stopped it.

It doesn't seem like it's really worth bothering pull requesting for, just feel free to copy & paste it wherever.

I'm not sure how one would integrate a actual random-fuzzer into a unit-testing system. They're kind of intrinsically non-repeatable, so they're not guaranteed to catch issues. I guess you could set it up so it just runs a number of iterations. You would need some mechanism to log the string that triggers failure, though.

The fix for issue #7 (preventing the "unorderable types" TypeError between numbers and strings on Python 3) can raise an "unorderable types" TypeError between str and bytes on Python 3 if LOCALE is enabled. This is because the strxfrm function converts str to bytes when performing the string collation. This is because the empty string that is inserted for numbers is of the str type but did not get sent to strxfrm and thus did not get converted to bytes. The fix was to add either an empty bytes string or str string depending on if LOCALE is enabled or not. This fixes #22.

Kwpolska mentioned this issue May 7, 2014

TypeError: unorderable types: str() < float() getnikola/nikola#1275

Closed

SethMMorton added bug labels May 7, 2014

Kwpolska closed this as completed May 8, 2014

SethMMorton reopened this Jun 20, 2014

SethMMorton closed this as completed Jun 22, 2014

SethMMorton mentioned this issue Apr 2, 2015

Python3 "unorderable types" error with LOCALE and PyICU installed. #22

Closed

hholzgra mentioned this issue Mar 30, 2018

Sorting arabic, farsi or hebrew numbers is not "natural" #52

Closed

SethMMorton mentioned this issue Jan 28, 2020

Performance compared to other simpler algorithm, and keygen improvement #110

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError when trying to sort a string of a number and a string #7

TypeError when trying to sort a string of a number and a string #7

Kwpolska commented May 7, 2014

SethMMorton commented May 7, 2014

Kwpolska commented May 7, 2014

ralsina commented May 7, 2014

SethMMorton commented May 7, 2014

ralsina commented May 7, 2014

SethMMorton commented May 7, 2014

ralsina commented May 7, 2014

SethMMorton commented May 7, 2014

SethMMorton commented May 8, 2014

Kwpolska commented May 8, 2014

ralsina commented May 8, 2014

fake-name commented Jun 20, 2014

SethMMorton commented Jun 20, 2014

SethMMorton commented Jun 21, 2014

fake-name commented Jun 22, 2014

fake-name commented Jun 23, 2014

SethMMorton commented Jun 23, 2014

fake-name commented Jun 23, 2014

TypeError when trying to sort a string of a number and a string #7

TypeError when trying to sort a string of a number and a string #7

Comments

Kwpolska commented May 7, 2014

SethMMorton commented May 7, 2014

Kwpolska commented May 7, 2014

ralsina commented May 7, 2014

SethMMorton commented May 7, 2014

ralsina commented May 7, 2014

SethMMorton commented May 7, 2014

ralsina commented May 7, 2014

SethMMorton commented May 7, 2014

SethMMorton commented May 8, 2014

Kwpolska commented May 8, 2014

ralsina commented May 8, 2014

fake-name commented Jun 20, 2014

SethMMorton commented Jun 20, 2014

SethMMorton commented Jun 21, 2014

fake-name commented Jun 22, 2014

fake-name commented Jun 23, 2014

SethMMorton commented Jun 23, 2014

fake-name commented Jun 23, 2014