New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

fix: row_id range fix for index training on gpu #1663

Merged

wjones127 merged 3 commits into lancedb:main from jerryyifei:gpu-train-index-fix

Nov 29, 2023

Contributor

jerryyifei commented Nov 26, 2023 •

edited by wjones127

Loading

The original sampling logic may create row_id range that is out of total_record length.
This diff makes sure the row_ids are within the range

Fixes #1662


          gpu train index row_id range fix

fa03e54

github-actions bot commented Nov 26, 2023

ACTION NEEDED

Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

rok linked an issue

that may be closed by this pull request

[bug] Using gpu to train index would raise row_id out of range issue. #1662

Closed

rok changed the title ~~[Fix 1662] row_id range fix for index training on gpu~~ fix: row_id range fix for index training on gpu

rok suggested changes

View reviewed changes

Contributor

rok left a comment

Thanks for doing this @jerryyifei !
I've left a couple of comments.

python/python/lance/sampler.py Outdated Show resolved Hide resolved

python/python/lance/sampler.py Outdated

                       buf.extend(
                           dataset.take(
-                              list(range(offset, offset + chunk_sample_size)),
+                              list(range(max(i, offset), offset + chunk_sample_size)),

Contributor

rok Nov 27, 2023

Why change here?

python/python/lance/sampler.py Outdated

                               columns=columns,
                           ).to_batches()
                       )
                       if idx % 50 == 0:
-                          logging.info("Sampled at offset=%s, len=%s", offset, chunk_sample_size)
+                          logging.info("Sampled at offset=%s, len=%s", max(i, offset), chunk_sample_size)

Contributor

rok Nov 27, 2023

Ditto

wjones127 reviewed

View reviewed changes

Contributor

wjones127 left a comment

Thanks for making a PR! I think there's a simpler change that would do a better job of preserving the probability distribution.

python/python/lance/sampler.py Outdated

		offset = i + np.random.randint(0, chunk_size - chunk_sample_size)
		offset = min(total_records - chunk_sample_size, i + np.random.randint(0, chunk_size - chunk_sample_size))

Contributor

wjones127 Nov 27, 2023

Instead of altering the outputs of sampling, why not fix the inputs? I think all you have to do is change chunk_size in the last iteration to be the remaining size (total_records - i) here.

local_size = min(chunk_size, total_records - i)

Contributor

rok Nov 27, 2023

That would also produce a more uniform distribution in the last chunk.

wjones127 added 2 commits

November 29, 2023 11:39


          fix: make sampler work for arbitrary number of rows


          format

f7b3092

rok reviewed

View reviewed changes

python/python/lance/sampler.py

Comment on lines +78 to +88

+                      num_sampled += chunk_sample_size
+                      # If we are at the last chunk, we may not have enough records to sample.
+                      local_size = min(chunk_size, total_records - i)
+                      local_sample_size = min(chunk_sample_size, local_size)
+                      if local_sample_size < local_size:
+                          # Add more randomness within each chunk, if there is room.
+                          offset = i + np.random.randint(0, local_size - local_sample_size)
+                      else:
+                          offset = i

Contributor

rok Nov 29, 2023

How about if we simplify like this?

Suggested change

      
                    num_sampled += chunk_sample_size
          
                    # If we are at the last chunk, we may not have enough records to sample.
          
                    local_size = min(chunk_size, total_records - i)
          
                    local_sample_size = min(chunk_sample_size, local_size)
          
                    if local_sample_size < local_size:
          
                        # Add more randomness within each chunk, if there is room.
          
                        offset = i + np.random.randint(0, local_size - local_sample_size)
          
                    else:
          
                        offset = i
          
                    # If we are at the last chunk, we may not have enough records to sample.
          
                    local_sample_size = min(chunk_sample_size, total_records - i)
          
                    num_sampled += local_sample_size
          
                    offset = i + np.random.randint(0, local_sample_size)

Contributor

wjones127 Nov 29, 2023

That looks like a totally different computation for offset? Offset needs to be constrained such that offset + local_sample_size <= local_size, otherwise we can get duplicate rows sampled.

Contributor

rok Nov 29, 2023 •

edited

Loading

I'm thinking:

        local_size = min(chunk_size, total_records - i)
        local_sample_size = min(chunk_sample_size, local_size)

combining:

        local_sample_size = min(chunk_sample_size, min(chunk_size, total_records - i))

simplify:

        # because chunk_size > chunk_sample_size
        local_sample_size = min(chunk_sample_size, total_records - i)

local_sample_size <= local_size is always true because local_sample_size = min(chunk_sample_size, local_size) so we can just say offset = i + np.random.randint(0, local_sample_size). Which I find more readable.
I hope I didn't miss something obvious :).

Contributor

rok Nov 29, 2023

I think the change is good either way, just looking to get rid of an extra variable.

Contributor

wjones127 Nov 29, 2023

I still don't understand why

offset = i + np.random.randint(0, local_sample_size)

is valid.

We are trying to find a random offset such that a window of size local_sample_size is inside of the window of size local_size. Hence the invariant I mentioned:

offset + local_sample_size <= local_size

 ◄────────────── local_size ──────────────►

 ┌──────┬────────────────┬────────────────┐
 │      │                │                │
 │i     │i + offset      │i + offset + lss│i + local_size
 │      │                │                │
 └──────┴────────────────┴────────────────┘

 ◄──┬───► ◄──────┬───────►
    │            │
 offset          │
         local_sample_size

Contributor

rok Nov 29, 2023

Oh I see! The offset would have to be:

offset = i + np.random.randint(0, max(0, chunk_size - local_sample_size))

This is fine though :)

rok approved these changes

View reviewed changes

wjones127 merged commit 8cc3fda into lancedb:main

11 checks passed

Contributor

wjones127 commented Nov 29, 2023

Thanks for your help @jerryyifei!

Contributor Author

jerryyifei commented Nov 30, 2023

Thanks to @wjones127 @rok for help merge! Sorry I was out for a trip and didn't reply this thread on time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet