-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Improve efficiency of .scan
(and .load
) if limit
is set
#2434
refactor: Improve efficiency of .scan
(and .load
) if limit
is set
#2434
Conversation
By never requesting more samples than required
I've taken it a little further than we originally discussed, i.e. I've also improved the performance if |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## develop #2434 +/- ##
===========================================
+ Coverage 91.86% 92.00% +0.14%
===========================================
Files 161 162 +1
Lines 7869 7896 +27
===========================================
+ Hits 7229 7265 +36
+ Misses 640 631 -9
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work, @tomaarsen !!!
Hi, @frascuchon @tomaarsen, currently |
I'm in favour of |
I'm in favor of |
Yes, for me it can be the PR where/when/if we actually start using the |
I think we're best off keeping that separate from this PR for now, but we should work on that soon. Do you want to raise an issue, @davidberenstein1957? (i.e. one issue with two parts: rename |
Yes, I prefer to keep PRs as simple as possible. We can tackle the
UPDATE. It's the same thing that wrote @tomaarsen one hour ago 😄 |
Great minds think alike ;) |
Closes #2454 Hello! ## Pull Request overview * Extend my previous work to expose a `batch_size` parameter to `rg.load` (and `rg.scan`, `_load_records_new_fashion`). * Update the docstrings for the public methods. * The actual implementation is very simple due to my earlier work in #2434: * `batch_size = self.DEFAULT_SCAN_SIZE` -> `batch_size = batch_size or self.DEFAULT_SCAN_SIZE` * Add an (untested!) warning if `batch_size` is supplied with an old server version. * (We don't have compatibility tests for old server versions) * I've simplified `test_scan_efficient_limiting` with `batch_size` rather than using a mocker to update `DEFAULT_SCAN_SIZE`. * I've also extended this test to test `rg.load` in addition to `active_api().datasets.scan`. ## Note Unlike previously discussed with @davidberenstein1957, I have *not* added a warning like so: https://github.com/argilla-io/argilla/blob/f5834a5408051bf1fa60d42aede6b325dc07fdbd/src/argilla/client/client.py#L338-L343 The server limits the maximum batch size to 500 for loading, so there is no need to have a warning when using a batch size of over 5000. ## Discussion 1. Should I include in the docstring that the default batch size is 250? That would be "hardcoded" into the docs, so if we ever change the default (`self.DEFAULT_SCAN_SIZE`), then we would have to remember to update the docs too. 2. Alternatively, should we deprecate `self.DEFAULT_SCAN_SIZE` and enforce that `batch_size` must be set for `datasets.scan`, with a default size of 250 everywhere? Then people can see the default batch size in the signature? I think we should do option 2. Let me know how you feel. --- **Type of change** - [x] New feature (non-breaking change which adds functionality) **How Has This Been Tested** Modified and updated a test, ran all tests. **Checklist** - [x] I have merged the original branch into my forked branch - [x] I added relevant documentation - [x] follows the style guidelines of this project - [x] I did a self-review of my code - [x] I made corresponding changes to the documentation - [x] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works --- - Tom Aarsen
# [1.4.0](v1.3.1...v1.4.0) (2023-03-09) ### Features * `configure_dataset` accepts a workspace as argument ([#2503](#2503)) ([29c9ee3](29c9ee3)), * Add `active_client` function to main argilla module ([#2387](#2387)) ([4e623d4](4e623d4)), closes [#2183](#2183) * Add text2text support for prepare for training spark nlp ([#2466](#2466)) ([21efb83](21efb83)), closes [#2465](#2465) [#2482](#2482) * Allow passing workspace as client param for `rg.log` or `rg.load` ([#2425](#2425)) ([b3b897a](b3b897a)), closes [#2059](#2059) * Bulk annotation improvement ([#2437](#2437)) ([3fce915](3fce915)), closes [#2264](#2264) * Deprecate `chunk_size` in favor of `batch_size` for `rg.log` ([#2455](#2455)) ([3ebea76](3ebea76)), closes [#2453](#2453) * Expose `batch_size` parameter for `rg.load` ([#2460](#2460)) ([e25be3e](e25be3e)), closes [#2454](#2454) [#2434](#2434) * Extend shortcuts to include alphabet for token classification ([#2339](#2339)) ([4a92b35](4a92b35)) ### Bug Fixes * added flexible app redirect to docs page ([#2428](#2428)) ([5600301](5600301)), closes [#2377](#2377) * added regex match to set workspace method ([#2427](#2427)) ([d789fa1](d789fa1)), closes [#2388] * error when loading record with empty string query ([#2429](#2429)) ([fc71c3b](fc71c3b)), closes [#2400](#2400) [#2303](#2303) * Remove extra-action dropdown state after navigation ([#2479](#2479)) ([9328994](9328994)), closes [#2158](#2158) ### Documentation * Add AutoTrain to readme ([7199780](7199780)) * Add migration to label schema section ([#2435](#2435)) ([d57a1e5](d57a1e5)), closes [#2003](#2003) [#2003](#2003) * Adds zero+few shot tutorial with SetFit ([#2409](#2409)) ([6c679ad](6c679ad)) * Update readme with quickstart section and new links to guides ([#2333](#2333)) ([91a77ad](91a77ad)) ## As always, thanks to our amazing contributors! - Documentation update: adding missing n (#2362) by @Gnonpi - feat: Extend shortcuts to include alphabet for token classification (#2339) by @cceyda
Hello!
Pull Request overview
Datasets.scan
to never request more samples than required.batch_size
variable which is now set toself.DEFAULT_SCAN_SIZE
. This parameter can easily be exposed in the future forbatch_size
support, as discussed previously in Slack etc.Details
The core of the changes is that the URL is now built on-the-fly at every request, rather than just once. For each request, the limit for that request is computed using
min(limit, batch_size)
, andlimit
is decremented, allowing it to represent a "remaining samples to fetch". Whenlimit
reaches 0, i.e. 0 more samples to fetch, we return fromscan
.I've also added a check to ensure that the
limit
passed toscan
can't negative.Tests
For tests, I've created a
test_scan_efficient_limiting
function which verifies the new and improved behaviour. It contains two monkeypatches:DEFAULT_SCAN_SIZE
is set to 10. Because our dataset in this test only has 100 samples, we want to ensure that we can't just sample the entire dataset with one request.http_client.post
method is monkeypatched to allow us to track the calls to the server.We test the following scenarios:
limit=23
-> 3 requests: for 10, 10 and 3 samples.limit=20
-> 2 requests: for 10, 10 samples.limit=6
-> 1 request: for 6 samples.There's also a test to cover the new ValueError if
limit
< 0.Effects
Consider the following script:
On this PR, this outputs:
On the
develop
branch, this outputs:This can be repeated for
rg.load(..., limit=1)
, as that relies on.scan
under the hood.Note that this is the most extreme example of performance gains. The performance increases in almost all scenarios if
limit
is set, but the effects are generally not this big. Going from fetching 250 samples 8 times to fetching 250 samples 7 times and 173 once doesn't have as big of an impact.Type of change
How Has This Been Tested
See above.
Checklist