Unordered-containers sometimes much, much slower than hashmap due to collisions. #121

augustss · 2016-02-12T14:04:54Z

I tried switching from hashmap to u-c for doing CSE in our compiler. I'm not sure how much slower it is, because I didn't have the patience to wait for the compiler to terminate with u-c. With hashmap the entire compilation takes 302s, with u-c I interrupted it after 1800s.

The problem is that it's difficult to produce a good hash value. The CSE basically comparing lambda expressions, but modulo alpha conversion. This means that you have to do something very expensive when finding bound variables, so instead I just give up at that point. This means that the hash values are poor and we get many, many collisions.

With hashmap it degrades gracefully because it uses Data.Map for collisions, whereas u-c uses linear search.

I'm not sure what could reasonably be done, perhaps use linear search up to some small number of collisions (~10), and then switch to an ordmap. Or perhaps just have a warning in the documentation.

tibbe · 2016-02-13T10:14:08Z

I'm not sure if there's anything to be done. For example, you'd get this problem if you used a traditional, imperative hash table too.

This is different from the other case we're discussing, where the default Hashable instances lead to bad behavior for certain distributions of keys (i.e. those that create collisions in many of the least significant bits). The latter we could try to address by changing the hash function in Hashable. Here's not much we can do. Using e.g. a Map for collision handling is too memory intensive (u-c already uses much more memory than its imperative cousins, something I'm trying to address).

I'm in favor of closing this, unless we have an idea of how to improve the situation without hurting the more common case of having decent hash functions (too much).

tibbe · 2016-02-13T10:15:02Z

One question, if it's hard (expensive) to come up with a good hash function for lambdas, isn't it similarly hard (expensive) to come up with a good comparison function for use in the ordered map?

augustss · 2016-02-13T10:26:43Z

Yes, comparing lambdas is quite expensive. Comparison keeps an alpha conversion map as it recurses over the two expressions.

But for comparison the cost is (I think) more incremental as it happens when you compare lambdas (which might not happen), whereas hashing happens upfront. But I might experiment with making a good hash function.

You can close this ticket if you want, but I think the docs should contain a warning (which all hash tables need) about excessive collisions. And I think it's nice that we have hashmap which has the behavior I need (which is a strange hybrid, I admit).

ethercrow · 2017-08-08T17:40:46Z

Using e.g. a Map for collision handling is too memory intensive

What about using a sorted array? Insertions and deletions will still be bad, but lookups should be faster and size will stay the same.

treeowl · 2017-08-08T17:52:01Z

@augustss, I would think a more natural approach might be to convert the bound variables to De Bruijn indices and use the converted lambdas as keys, either for a HashMap or a generalized trie.

augustss · 2017-08-09T16:14:15Z

I'm not a fan of de Bruijn indices. They have their own set of problems.

sjakobi · 2020-06-02T21:22:08Z

If I understand it correctly, #217 attempts to fix the underlying issue here. There's a bit of info on the approach in the change to docs/developer-guide.md.

sjakobi · 2020-06-19T14:23:30Z

If I understand it correctly, #217 attempts to fix the underlying issue here.

That was apparently a misunderstanding of mine.

I quote from an email from @tibbe:

This is an issue with specialization not happening (we think). GHC's specialization/inlining is very brittle and u-c relies on every dictionary being specialized away. This is btw another reason I'm worried about using primitive. It's another layer of even more INLINE pragmas that can fail to fire in the right way.

At this stage I'm not sure what could be done about this issue within u-c. Is this a pitfall worth documenting possibly?

Otherwise I'd tend to simply close this issue.

sjakobi added performance bug labels May 31, 2020

This was referenced Jun 2, 2020

Cuckoo loses data when inserting 65th element via mutate gregorycollins/hashtables#55

Closed

All HashMaps use the same default salt to hash keys #265

Open

sjakobi closed this as completed Feb 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unordered-containers sometimes much, much slower than hashmap due to collisions. #121

Unordered-containers sometimes much, much slower than hashmap due to collisions. #121

augustss commented Feb 12, 2016

tibbe commented Feb 13, 2016

tibbe commented Feb 13, 2016

augustss commented Feb 13, 2016

ethercrow commented Aug 8, 2017

treeowl commented Aug 8, 2017

augustss commented Aug 9, 2017

sjakobi commented Jun 2, 2020

sjakobi commented Jun 19, 2020

Unordered-containers sometimes much, much slower than hashmap due to collisions. #121

Unordered-containers sometimes much, much slower than hashmap due to collisions. #121

Comments

augustss commented Feb 12, 2016

tibbe commented Feb 13, 2016

tibbe commented Feb 13, 2016

augustss commented Feb 13, 2016

ethercrow commented Aug 8, 2017

treeowl commented Aug 8, 2017

augustss commented Aug 9, 2017

sjakobi commented Jun 2, 2020

sjakobi commented Jun 19, 2020