-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: row_id range fix for index training on gpu #1663
Conversation
ACTION NEEDED Lance follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this @jerryyifei !
I've left a couple of comments.
python/python/lance/sampler.py
Outdated
buf.extend( | ||
dataset.take( | ||
list(range(offset, offset + chunk_sample_size)), | ||
list(range(max(i, offset), offset + chunk_sample_size)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change here?
python/python/lance/sampler.py
Outdated
columns=columns, | ||
).to_batches() | ||
) | ||
if idx % 50 == 0: | ||
logging.info("Sampled at offset=%s, len=%s", offset, chunk_sample_size) | ||
logging.info("Sampled at offset=%s, len=%s", max(i, offset), chunk_sample_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making a PR! I think there's a simpler change that would do a better job of preserving the probability distribution.
python/python/lance/sampler.py
Outdated
offset = i + np.random.randint(0, chunk_size - chunk_sample_size) | ||
offset = min(total_records - chunk_sample_size, i + np.random.randint(0, chunk_size - chunk_sample_size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of altering the outputs of sampling, why not fix the inputs? I think all you have to do is change chunk_size
in the last iteration to be the remaining size (total_records - i
) here.
local_size = min(chunk_size, total_records - i)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would also produce a more uniform distribution in the last chunk.
num_sampled += chunk_sample_size | ||
|
||
# If we are at the last chunk, we may not have enough records to sample. | ||
local_size = min(chunk_size, total_records - i) | ||
local_sample_size = min(chunk_sample_size, local_size) | ||
|
||
if local_sample_size < local_size: | ||
# Add more randomness within each chunk, if there is room. | ||
offset = i + np.random.randint(0, local_size - local_sample_size) | ||
else: | ||
offset = i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about if we simplify like this?
num_sampled += chunk_sample_size | |
# If we are at the last chunk, we may not have enough records to sample. | |
local_size = min(chunk_size, total_records - i) | |
local_sample_size = min(chunk_sample_size, local_size) | |
if local_sample_size < local_size: | |
# Add more randomness within each chunk, if there is room. | |
offset = i + np.random.randint(0, local_size - local_sample_size) | |
else: | |
offset = i | |
# If we are at the last chunk, we may not have enough records to sample. | |
local_sample_size = min(chunk_sample_size, total_records - i) | |
num_sampled += local_sample_size | |
offset = i + np.random.randint(0, local_sample_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks like a totally different computation for offset
? Offset needs to be constrained such that offset + local_sample_size <= local_size
, otherwise we can get duplicate rows sampled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking:
local_size = min(chunk_size, total_records - i)
local_sample_size = min(chunk_sample_size, local_size)
combining:
local_sample_size = min(chunk_sample_size, min(chunk_size, total_records - i))
simplify:
# because chunk_size > chunk_sample_size
local_sample_size = min(chunk_sample_size, total_records - i)
local_sample_size <= local_size
is always true because local_sample_size = min(chunk_sample_size, local_size)
so we can just say offset = i + np.random.randint(0, local_sample_size)
. Which I find more readable.
I hope I didn't miss something obvious :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the change is good either way, just looking to get rid of an extra variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand why
offset = i + np.random.randint(0, local_sample_size)
is valid.
We are trying to find a random offset
such that a window of size local_sample_size
is inside of the window of size local_size
. Hence the invariant I mentioned:
offset + local_sample_size <= local_size
◄────────────── local_size ──────────────►
┌──────┬────────────────┬────────────────┐
│ │ │ │
│i │i + offset │i + offset + lss│i + local_size
│ │ │ │
└──────┴────────────────┴────────────────┘
◄──┬───► ◄──────┬───────►
│ │
offset │
local_sample_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see! The offset would have to be:
offset = i + np.random.randint(0, max(0, chunk_size - local_sample_size))
This is fine though :)
Thanks for your help @jerryyifei! |
Thanks to @wjones127 @rok for help merge! Sorry I was out for a trip and didn't reply this thread on time. |
The original sampling logic may create row_id range that is out of total_record length.
This diff makes sure the row_ids are within the range
Fixes #1662