Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: row_id range fix for index training on gpu #1663

Merged
merged 3 commits into from
Nov 29, 2023

Conversation

jerryyifei
Copy link
Contributor

@jerryyifei jerryyifei commented Nov 26, 2023

The original sampling logic may create row_id range that is out of total_record length.
This diff makes sure the row_ids are within the range

Fixes #1662

Copy link

ACTION NEEDED

Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@rok rok linked an issue Nov 27, 2023 that may be closed by this pull request
@rok rok changed the title [Fix 1662] row_id range fix for index training on gpu fix: row_id range fix for index training on gpu Nov 27, 2023
Copy link
Contributor

@rok rok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @jerryyifei !
I've left a couple of comments.

python/python/lance/sampler.py Outdated Show resolved Hide resolved
buf.extend(
dataset.take(
list(range(offset, offset + chunk_sample_size)),
list(range(max(i, offset), offset + chunk_sample_size)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change here?

columns=columns,
).to_batches()
)
if idx % 50 == 0:
logging.info("Sampled at offset=%s, len=%s", offset, chunk_sample_size)
logging.info("Sampled at offset=%s, len=%s", max(i, offset), chunk_sample_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making a PR! I think there's a simpler change that would do a better job of preserving the probability distribution.

offset = i + np.random.randint(0, chunk_size - chunk_sample_size)
offset = min(total_records - chunk_sample_size, i + np.random.randint(0, chunk_size - chunk_sample_size))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of altering the outputs of sampling, why not fix the inputs? I think all you have to do is change chunk_size in the last iteration to be the remaining size (total_records - i) here.

local_size = min(chunk_size, total_records - i)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would also produce a more uniform distribution in the last chunk.

Comment on lines +78 to +88
num_sampled += chunk_sample_size

# If we are at the last chunk, we may not have enough records to sample.
local_size = min(chunk_size, total_records - i)
local_sample_size = min(chunk_sample_size, local_size)

if local_sample_size < local_size:
# Add more randomness within each chunk, if there is room.
offset = i + np.random.randint(0, local_size - local_sample_size)
else:
offset = i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about if we simplify like this?

Suggested change
num_sampled += chunk_sample_size
# If we are at the last chunk, we may not have enough records to sample.
local_size = min(chunk_size, total_records - i)
local_sample_size = min(chunk_sample_size, local_size)
if local_sample_size < local_size:
# Add more randomness within each chunk, if there is room.
offset = i + np.random.randint(0, local_size - local_sample_size)
else:
offset = i
# If we are at the last chunk, we may not have enough records to sample.
local_sample_size = min(chunk_sample_size, total_records - i)
num_sampled += local_sample_size
offset = i + np.random.randint(0, local_sample_size)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like a totally different computation for offset? Offset needs to be constrained such that offset + local_sample_size <= local_size, otherwise we can get duplicate rows sampled.

Copy link
Contributor

@rok rok Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking:

        local_size = min(chunk_size, total_records - i)
        local_sample_size = min(chunk_sample_size, local_size)

combining:

        local_sample_size = min(chunk_sample_size, min(chunk_size, total_records - i)) 

simplify:

        # because chunk_size > chunk_sample_size
        local_sample_size = min(chunk_sample_size, total_records - i)

local_sample_size <= local_size is always true because local_sample_size = min(chunk_sample_size, local_size) so we can just say offset = i + np.random.randint(0, local_sample_size). Which I find more readable.
I hope I didn't miss something obvious :).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change is good either way, just looking to get rid of an extra variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand why

offset = i + np.random.randint(0, local_sample_size)

is valid.

We are trying to find a random offset such that a window of size local_sample_size is inside of the window of size local_size. Hence the invariant I mentioned:

offset + local_sample_size <= local_size
 ◄────────────── local_size ──────────────►

 ┌──────┬────────────────┬────────────────┐
 │      │                │                │
 │i     │i + offset      │i + offset + lss│i + local_size
 │      │                │                │
 └──────┴────────────────┴────────────────┘

 ◄──┬───► ◄──────┬───────►
    │            │
 offset          │
         local_sample_size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see! The offset would have to be:

offset = i + np.random.randint(0, max(0, chunk_size - local_sample_size))

This is fine though :)

@wjones127 wjones127 merged commit 8cc3fda into lancedb:main Nov 29, 2023
11 checks passed
@wjones127
Copy link
Contributor

Thanks for your help @jerryyifei!

@jerryyifei
Copy link
Contributor Author

Thanks to @wjones127 @rok for help merge! Sorry I was out for a trip and didn't reply this thread on time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug] Using gpu to train index would raise row_id out of range issue.
3 participants