Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unordered-containers sometimes much, much slower than hashmap due to collisions. #121

Closed
augustss opened this issue Feb 12, 2016 · 8 comments

Comments

@augustss
Copy link

I tried switching from hashmap to u-c for doing CSE in our compiler. I'm not sure how much slower it is, because I didn't have the patience to wait for the compiler to terminate with u-c. With hashmap the entire compilation takes 302s, with u-c I interrupted it after 1800s.

The problem is that it's difficult to produce a good hash value. The CSE basically comparing lambda expressions, but modulo alpha conversion. This means that you have to do something very expensive when finding bound variables, so instead I just give up at that point. This means that the hash values are poor and we get many, many collisions.

With hashmap it degrades gracefully because it uses Data.Map for collisions, whereas u-c uses linear search.

I'm not sure what could reasonably be done, perhaps use linear search up to some small number of collisions (~10), and then switch to an ordmap. Or perhaps just have a warning in the documentation.

@tibbe
Copy link
Collaborator

tibbe commented Feb 13, 2016

I'm not sure if there's anything to be done. For example, you'd get this problem if you used a traditional, imperative hash table too.

This is different from the other case we're discussing, where the default Hashable instances lead to bad behavior for certain distributions of keys (i.e. those that create collisions in many of the least significant bits). The latter we could try to address by changing the hash function in Hashable. Here's not much we can do. Using e.g. a Map for collision handling is too memory intensive (u-c already uses much more memory than its imperative cousins, something I'm trying to address).

I'm in favor of closing this, unless we have an idea of how to improve the situation without hurting the more common case of having decent hash functions (too much).

@tibbe
Copy link
Collaborator

tibbe commented Feb 13, 2016

One question, if it's hard (expensive) to come up with a good hash function for lambdas, isn't it similarly hard (expensive) to come up with a good comparison function for use in the ordered map?

@augustss
Copy link
Author

Yes, comparing lambdas is quite expensive. Comparison keeps an alpha conversion map as it recurses over the two expressions.

But for comparison the cost is (I think) more incremental as it happens when you compare lambdas (which might not happen), whereas hashing happens upfront. But I might experiment with making a good hash function.

You can close this ticket if you want, but I think the docs should contain a warning (which all hash tables need) about excessive collisions. And I think it's nice that we have hashmap which has the behavior I need (which is a strange hybrid, I admit).

@ethercrow
Copy link
Contributor

Using e.g. a Map for collision handling is too memory intensive

What about using a sorted array? Insertions and deletions will still be bad, but lookups should be faster and size will stay the same.

@treeowl
Copy link
Collaborator

treeowl commented Aug 8, 2017

@augustss, I would think a more natural approach might be to convert the bound variables to De Bruijn indices and use the converted lambdas as keys, either for a HashMap or a generalized trie.

@augustss
Copy link
Author

augustss commented Aug 9, 2017

I'm not a fan of de Bruijn indices. They have their own set of problems.

@sjakobi
Copy link
Member

sjakobi commented Jun 2, 2020

If I understand it correctly, #217 attempts to fix the underlying issue here. There's a bit of info on the approach in the change to docs/developer-guide.md.

@sjakobi
Copy link
Member

sjakobi commented Jun 19, 2020

If I understand it correctly, #217 attempts to fix the underlying issue here.

That was apparently a misunderstanding of mine.

I quote from an email from @tibbe:

This is an issue with specialization not happening (we think). GHC's specialization/inlining is very brittle and u-c relies on every dictionary being specialized away. This is btw another reason I'm worried about using primitive. It's another layer of even more INLINE pragmas that can fail to fire in the right way.

At this stage I'm not sure what could be done about this issue within u-c. Is this a pitfall worth documenting possibly?

Otherwise I'd tend to simply close this issue.

@sjakobi sjakobi closed this as completed Feb 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants