Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bucket size power of 2 limitation #7

Open
voutcn opened this issue Mar 18, 2015 · 8 comments
Open

bucket size power of 2 limitation #7

voutcn opened this issue Mar 18, 2015 · 8 comments

Comments

@voutcn
Copy link

voutcn commented Mar 18, 2015

Hi all,

If I am correct, the rehashing alway double the bucket size. In practice, if the number of items is slightly larger than power of two, a load factor will be only ~50%, which is a waste of memory. I would like to ask if it is possible that the rehashing can boost the bucket size by a smaller factor, say 1.5 or 1.3, which is more memory-friendly. Thanks.

@manugoyal
Copy link
Contributor

Hi voutcn,

We use powers of 2 to size the table since it's a fairly standard way to grow vectors and other resizable arrays, and because a power-of-two sized table has some convenient properties for computing hashes in our model.

I'm not really sure resizing by a smaller factor would fix your problem, because say we sized by 1.5x. Then if the number of items happened to be slightly larger than a power of 1.5, the load factor would still only be ~50%. Do you have a specific use case where you need to have a large table with slightly more elements than a power of 2? Also remember that it is the bucket size that is a power of 2, not the number of elements. By default, each bucket has 8 items, so we can actually store 2^n * 8 values for each power-of-two sizing of the table.

@dave-andersen
Copy link
Member

I've actually been working on an implementation of fractional resizing for
a couple of days. It's not done yet, & I will likely not have it done until
next week, but I will have something reasonably soon to use to evaluate it.

The trade off is that it will be slower for insertion and access, but it
will have lower memory access. Maybe a 15 or 10 percent slowdown.

On Wed, Mar 18, 2015, 12:04 Manu Goyal [email protected] wrote:

Hi voutcn,

We use powers of 2 to size the table since it's a fairly standard way to
grow vectors and other resizable arrays, and because a power-of-two sized
table has some convenient properties for computing hashes in our model.

I'm not really sure resizing by a smaller factor would fix your problem,
because say we sized by 1.5x. Then if the number of items happened to be
slightly larger than a power of 1.5, the load factor would still only be
~50%. Do you have a specific use case where you need to have a large table
with slightly more elements than a power of 2? Also remember that it is the
bucket size that is a power of 2, not the number of elements. By default,
each bucket has 8 items, so we can actually store 2^n * 8 values for each
power-of-two sizing of the table.

Reply to this email directly or view it on GitHub
#7 (comment).

@manugoyal
Copy link
Contributor

That's interesting. Is the slowdown related to the hashing and bucket
indexing?

-Manu
On Mar 18, 2015 9:07 AM, "David Andersen" [email protected] wrote:

I've actually been working on an implementation of fractional resizing for
a couple of days. It's not done yet, & I will likely not have it done until
next week, but I will have something reasonably soon to use to evaluate it.

The trade off is that it will be slower for insertion and access, but it
will have lower memory access. Maybe a 15 or 10 percent slowdown.

On Wed, Mar 18, 2015, 12:04 Manu Goyal [email protected] wrote:

Hi voutcn,

We use powers of 2 to size the table since it's a fairly standard way to
grow vectors and other resizable arrays, and because a power-of-two sized
table has some convenient properties for computing hashes in our model.

I'm not really sure resizing by a smaller factor would fix your problem,
because say we sized by 1.5x. Then if the number of items happened to be
slightly larger than a power of 1.5, the load factor would still only be
~50%. Do you have a specific use case where you need to have a large
table
with slightly more elements than a power of 2? Also remember that it is
the
bucket size that is a power of 2, not the number of elements. By default,
each bucket has 8 items, so we can actually store 2^n * 8 values for each
power-of-two sizing of the table.

Reply to this email directly or view it on GitHub
#7 (comment).


Reply to this email directly or view it on GitHub
#7 (comment).

@apc999
Copy link
Member

apc999 commented Mar 18, 2015

Hi, Dinghua,

Growing the table size by a smaller factor (rather than 2x everytime ) is
certainly an interesting and attractive feature. Actually I was told by
different people that this could be useful in certain practical
applications. For libcuckoo, we made this choice (growing table size by 2x)
to enjoy some tricks that can speed up the table grow by doing memcpy, and
cleaning up the table in background (compared to locking the table and
rehashing each item). So this is a tradeoff. But if you can finish it with
some clever trick, I am very happy to learn :)

Best,

  • Bin

On Wed, Mar 18, 2015 at 9:04 AM, Manu Goyal [email protected]
wrote:

Hi voutcn,

We use powers of 2 to size the table since it's a fairly standard way to
grow vectors and other resizable arrays, and because a power-of-two sized
table has some convenient properties for computing hashes in our model.

I'm not really sure resizing by a smaller factor would fix your problem,
because say we sized by 1.5x. Then if the number of items happened to be
slightly larger than a power of 1.5, the load factor would still only be
~50%. Do you have a specific use case where you need to have a large table
with slightly more elements than a power of 2? Also remember that it is the
bucket size that is a power of 2, not the number of elements. By default,
each bucket has 8 items, so we can actually store 2^n * 8 values for each
power-of-two sizing of the table.


Reply to this email directly or view it on GitHub
#7 (comment).

Computer Science Department
Carnegie Mellon University

@dave-andersen
Copy link
Member

Yeah - we have to move to mod instead of mask to compute the indexes.

I use a fast computation of the mod for a fixed #, but it's still slower
than mask.

On Wed, Mar 18, 2015 at 2:41 PM Bin Fan [email protected] wrote:

Hi, Dinghua,

Growing the table size by a smaller factor (rather than 2x everytime ) is
certainly an interesting and attractive feature. Actually I was told by
different people that this could be useful in certain practical
applications. For libcuckoo, we made this choice (growing table size by 2x)
to enjoy some tricks that can speed up the table grow by doing memcpy, and
cleaning up the table in background (compared to locking the table and
rehashing each item). So this is a tradeoff. But if you can finish it with
some clever trick, I am very happy to learn :)

Best,

  • Bin

On Wed, Mar 18, 2015 at 9:04 AM, Manu Goyal [email protected]
wrote:

Hi voutcn,

We use powers of 2 to size the table since it's a fairly standard way to
grow vectors and other resizable arrays, and because a power-of-two sized
table has some convenient properties for computing hashes in our model.

I'm not really sure resizing by a smaller factor would fix your problem,
because say we sized by 1.5x. Then if the number of items happened to be
slightly larger than a power of 1.5, the load factor would still only be
~50%. Do you have a specific use case where you need to have a large
table
with slightly more elements than a power of 2? Also remember that it is
the
bucket size that is a power of 2, not the number of elements. By default,
each bucket has 8 items, so we can actually store 2^n * 8 values for each
power-of-two sizing of the table.

Reply to this email directly or view it on GitHub
#7 (comment).

Computer Science Department
Carnegie Mellon University

Reply to this email directly or view it on GitHub
#7 (comment).

@voutcn
Copy link
Author

voutcn commented Mar 19, 2015

Thank you all.

@manugoyal Theoretically resizing by say 1.5x should increase the lower bound the load factor. If the number of elements happen to be slightly larger than power of 1.5, the load factor will ~1/1.5 = 2/3.
@apc999 understand that power of 2 has many good properties, but I didn't go that deep to study those tricks :)
@dave-andersen look forward to your work!

@dave-andersen
Copy link
Member

Following up on this - sorry for the very long delay. I now know how to do this in a quite reasonably efficient way, but it's going to be a somewhat complex implementation path to get this version of it to handle that cleanly. Will start a design doc...

@univerone
Copy link

Following up on this - sorry for the very long delay. I now know how to do this in a quite reasonably efficient way, but it's going to be a somewhat complex implementation path to get this version of it to handle that cleanly. Will start a design doc...

Hi David, may I ask is there any update?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants