bucket size power of 2 limitation #7

voutcn · 2015-03-18T15:49:28Z

Hi all,

If I am correct, the rehashing alway double the bucket size. In practice, if the number of items is slightly larger than power of two, a load factor will be only ~50%, which is a waste of memory. I would like to ask if it is possible that the rehashing can boost the bucket size by a smaller factor, say 1.5 or 1.3, which is more memory-friendly. Thanks.

manugoyal · 2015-03-18T16:04:12Z

Hi voutcn,

We use powers of 2 to size the table since it's a fairly standard way to grow vectors and other resizable arrays, and because a power-of-two sized table has some convenient properties for computing hashes in our model.

I'm not really sure resizing by a smaller factor would fix your problem, because say we sized by 1.5x. Then if the number of items happened to be slightly larger than a power of 1.5, the load factor would still only be ~50%. Do you have a specific use case where you need to have a large table with slightly more elements than a power of 2? Also remember that it is the bucket size that is a power of 2, not the number of elements. By default, each bucket has 8 items, so we can actually store 2^n * 8 values for each power-of-two sizing of the table.

dave-andersen · 2015-03-18T16:07:40Z

I've actually been working on an implementation of fractional resizing for
a couple of days. It's not done yet, & I will likely not have it done until
next week, but I will have something reasonably soon to use to evaluate it.

The trade off is that it will be slower for insertion and access, but it
will have lower memory access. Maybe a 15 or 10 percent slowdown.

On Wed, Mar 18, 2015, 12:04 Manu Goyal [email protected] wrote:

Hi voutcn,

We use powers of 2 to size the table since it's a fairly standard way to
grow vectors and other resizable arrays, and because a power-of-two sized
table has some convenient properties for computing hashes in our model.

I'm not really sure resizing by a smaller factor would fix your problem,
because say we sized by 1.5x. Then if the number of items happened to be
slightly larger than a power of 1.5, the load factor would still only be
~50%. Do you have a specific use case where you need to have a large table
with slightly more elements than a power of 2? Also remember that it is the
bucket size that is a power of 2, not the number of elements. By default,
each bucket has 8 items, so we can actually store 2^n * 8 values for each
power-of-two sizing of the table.

Reply to this email directly or view it on GitHub
#7 (comment).

manugoyal · 2015-03-18T16:15:43Z

That's interesting. Is the slowdown related to the hashing and bucket
indexing?

-Manu
On Mar 18, 2015 9:07 AM, "David Andersen" [email protected] wrote:

I've actually been working on an implementation of fractional resizing for
a couple of days. It's not done yet, & I will likely not have it done until
next week, but I will have something reasonably soon to use to evaluate it.

The trade off is that it will be slower for insertion and access, but it
will have lower memory access. Maybe a 15 or 10 percent slowdown.

On Wed, Mar 18, 2015, 12:04 Manu Goyal [email protected] wrote:

Hi voutcn,

We use powers of 2 to size the table since it's a fairly standard way to
grow vectors and other resizable arrays, and because a power-of-two sized
table has some convenient properties for computing hashes in our model.

I'm not really sure resizing by a smaller factor would fix your problem,
because say we sized by 1.5x. Then if the number of items happened to be
slightly larger than a power of 1.5, the load factor would still only be
~50%. Do you have a specific use case where you need to have a large
table
with slightly more elements than a power of 2? Also remember that it is
the
bucket size that is a power of 2, not the number of elements. By default,
each bucket has 8 items, so we can actually store 2^n * 8 values for each
power-of-two sizing of the table.

Reply to this email directly or view it on GitHub
#7 (comment).

—
Reply to this email directly or view it on GitHub
#7 (comment).

apc999 · 2015-03-18T18:41:36Z

Hi, Dinghua,

Growing the table size by a smaller factor (rather than 2x everytime ) is
certainly an interesting and attractive feature. Actually I was told by
different people that this could be useful in certain practical
applications. For libcuckoo, we made this choice (growing table size by 2x)
to enjoy some tricks that can speed up the table grow by doing memcpy, and
cleaning up the table in background (compared to locking the table and
rehashing each item). So this is a tradeoff. But if you can finish it with
some clever trick, I am very happy to learn :)

Best,

Bin

On Wed, Mar 18, 2015 at 9:04 AM, Manu Goyal [email protected]
wrote:

Hi voutcn,

We use powers of 2 to size the table since it's a fairly standard way to
grow vectors and other resizable arrays, and because a power-of-two sized
table has some convenient properties for computing hashes in our model.

I'm not really sure resizing by a smaller factor would fix your problem,
because say we sized by 1.5x. Then if the number of items happened to be
slightly larger than a power of 1.5, the load factor would still only be
~50%. Do you have a specific use case where you need to have a large table
with slightly more elements than a power of 2? Also remember that it is the
bucket size that is a power of 2, not the number of elements. By default,
each bucket has 8 items, so we can actually store 2^n * 8 values for each
power-of-two sizing of the table.

—
Reply to this email directly or view it on GitHub
#7 (comment).

Computer Science Department
Carnegie Mellon University

dave-andersen · 2015-03-18T18:44:52Z

Yeah - we have to move to mod instead of mask to compute the indexes.

I use a fast computation of the mod for a fixed #, but it's still slower
than mask.

On Wed, Mar 18, 2015 at 2:41 PM Bin Fan [email protected] wrote:

Hi, Dinghua,

Growing the table size by a smaller factor (rather than 2x everytime ) is
certainly an interesting and attractive feature. Actually I was told by
different people that this could be useful in certain practical
applications. For libcuckoo, we made this choice (growing table size by 2x)
to enjoy some tricks that can speed up the table grow by doing memcpy, and
cleaning up the table in background (compared to locking the table and
rehashing each item). So this is a tradeoff. But if you can finish it with
some clever trick, I am very happy to learn :)

Best,

Bin

On Wed, Mar 18, 2015 at 9:04 AM, Manu Goyal [email protected]
wrote:

Hi voutcn,

We use powers of 2 to size the table since it's a fairly standard way to
grow vectors and other resizable arrays, and because a power-of-two sized
table has some convenient properties for computing hashes in our model.

I'm not really sure resizing by a smaller factor would fix your problem,
because say we sized by 1.5x. Then if the number of items happened to be
slightly larger than a power of 1.5, the load factor would still only be
~50%. Do you have a specific use case where you need to have a large
table
with slightly more elements than a power of 2? Also remember that it is
the
bucket size that is a power of 2, not the number of elements. By default,
each bucket has 8 items, so we can actually store 2^n * 8 values for each
power-of-two sizing of the table.

Reply to this email directly or view it on GitHub
#7 (comment).

Computer Science Department
Carnegie Mellon University

Reply to this email directly or view it on GitHub
#7 (comment).

voutcn · 2015-03-19T02:10:19Z

Thank you all.

@manugoyal Theoretically resizing by say 1.5x should increase the lower bound the load factor. If the number of elements happen to be slightly larger than power of 1.5, the load factor will ~1/1.5 = 2/3.
@apc999 understand that power of 2 has many good properties, but I didn't go that deep to study those tricks :)
@dave-andersen look forward to your work!

dave-andersen · 2016-09-15T23:42:40Z

Following up on this - sorry for the very long delay. I now know how to do this in a quite reasonably efficient way, but it's going to be a somewhat complex implementation path to get this version of it to handle that cleanly. Will start a design doc...

univerone · 2023-03-23T03:13:51Z

Following up on this - sorry for the very long delay. I now know how to do this in a quite reasonably efficient way, but it's going to be a somewhat complex implementation path to get this version of it to handle that cleanly. Will start a design doc...

Hi David, may I ask is there any update?

jxiw mentioned this issue May 12, 2017

Assertion `get_segment(i) < allocated_segments_' failed #78

Closed

manugoyal mentioned this issue Dec 31, 2017

Examples of hashset. #92

Closed

Explosiontime202 mentioned this issue Jun 26, 2023

Segmentation fault on inserts #156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bucket size power of 2 limitation #7

bucket size power of 2 limitation #7

voutcn commented Mar 18, 2015

manugoyal commented Mar 18, 2015

dave-andersen commented Mar 18, 2015

manugoyal commented Mar 18, 2015

apc999 commented Mar 18, 2015

dave-andersen commented Mar 18, 2015

voutcn commented Mar 19, 2015

dave-andersen commented Sep 15, 2016

univerone commented Mar 23, 2023

bucket size power of 2 limitation #7

bucket size power of 2 limitation #7

Comments

voutcn commented Mar 18, 2015

manugoyal commented Mar 18, 2015

dave-andersen commented Mar 18, 2015

manugoyal commented Mar 18, 2015

apc999 commented Mar 18, 2015

dave-andersen commented Mar 18, 2015

voutcn commented Mar 19, 2015

dave-andersen commented Sep 15, 2016

univerone commented Mar 23, 2023