refactor: Improve efficiency of `.scan` (and `.load`) if `limit` is set #2434

tomaarsen · 2023-02-28T10:55:15Z

Hello!

Pull Request overview

Refactor Datasets.scan to never request more samples than required.
Implements a batch_size variable which is now set to self.DEFAULT_SCAN_SIZE. This parameter can easily be exposed in the future for batch_size support, as discussed previously in Slack etc.

Details

The core of the changes is that the URL is now built on-the-fly at every request, rather than just once. For each request, the limit for that request is computed using min(limit, batch_size), and limit is decremented, allowing it to represent a "remaining samples to fetch". When limit reaches 0, i.e. 0 more samples to fetch, we return from scan.

I've also added a check to ensure that the limit passed to scan can't negative.

Tests

For tests, I've created a test_scan_efficient_limiting function which verifies the new and improved behaviour. It contains two monkeypatches:

DEFAULT_SCAN_SIZE is set to 10. Because our dataset in this test only has 100 samples, we want to ensure that we can't just sample the entire dataset with one request.
The http_client.post method is monkeypatched to allow us to track the calls to the server.

We test the following scenarios:

limit=23 -> 3 requests: for 10, 10 and 3 samples.
limit=20 -> 2 requests: for 10, 10 samples.
limit=6 -> 1 request: for 6 samples.

There's also a test to cover the new ValueError if limit < 0.

Effects

Consider the following script:

import argilla as rg
import time

def test_scan_records(fields):
    client = rg.active_client()

    data = client.datasets.scan(
        name="banking77-topics-setfit",
        projection=fields,
        limit=1,  # <- Note, just one sample
    )
    start_t = time.time()
    list(data)
    print(f"load time for 1 sample with fields={fields}: {time.time() - start_t:.4f}s")

test_scan_records(set())
test_scan_records({"text"})
test_scan_records({"tokens"})

On this PR, this outputs:

load time for 1 sample with fields=set(): 0.0774s
load time for 1 sample with fields={'text'}: 0.0646s
load time for 1 sample with fields={'tokens'}: 0.0669s

On the develop branch, this outputs:

load time for 1 sample with fields=set(): 0.1355s
load time for 1 sample with fields={'text'}: 0.1347s
load time for 1 sample with fields={'tokens'}: 0.1173s

This can be repeated for rg.load(..., limit=1), as that relies on .scan under the hood.

Note that this is the most extreme example of performance gains. The performance increases in almost all scenarios if limit is set, but the effects are generally not this big. Going from fetching 250 samples 8 times to fetching 250 samples 7 times and 173 once doesn't have as big of an impact.

Type of change

Refactor (change restructuring the codebase without changing functionality)

How Has This Been Tested

See above.

Checklist

I have merged the original branch into my forked branch
I added relevant documentation
follows the style guidelines of this project
I did a self-review of my code
I made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works

Tom Aarsen

By never requesting more samples than required

tomaarsen · 2023-02-28T10:56:44Z

I've taken it a little further than we originally discussed, i.e. I've also improved the performance if limit > DEFAULT_SCAN_SIZE. However, no new parameters (e.g. batch_size) are introduced yet, as I imagine there should be a discussion on how to introduce that throughout Argilla as a whole.

codecov · 2023-02-28T11:11:25Z

Codecov Report

Patch coverage: 98.20% and project coverage change: +0.14 🎉

Comparison is base (5a8bb28) 91.86% compared to head (e7c356f) 92.00%.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #2434      +/-   ##
===========================================
+ Coverage    91.86%   92.00%   +0.14%     
===========================================
  Files          161      162       +1     
  Lines         7869     7896      +27     
===========================================
+ Hits          7229     7265      +36     
+ Misses         640      631       -9

Flag	Coverage Δ
pytest	`92.00% <98.20%> (+0.14%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/argilla/__init__.py	`86.66% <ø> (ø)`
src/argilla/server/daos/backend/search/model.py	`100.00% <ø> (ø)`
src/argilla/server/daos/datasets.py	`95.45% <ø> (-0.06%)`	⬇️
...ver/services/tasks/text_classification/__init__.py	`100.00% <ø> (ø)`
...gilla/labeling/text_classification/label_errors.py	`90.12% <83.33%> (-0.13%)`	⬇️
src/argilla/client/datasets.py	`74.72% <93.33%> (+0.09%)`	⬆️
src/argilla/utils/dependency.py	`96.15% <96.15%> (ø)`
src/argilla/client/__init__.py	`100.00% <100.00%> (ø)`
src/argilla/client/apis/datasets.py	`97.24% <100.00%> (+1.05%)`	⬆️
src/argilla/client/client.py	`87.13% <100.00%> (-0.42%)`	⬇️
... and 18 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

frascuchon

Good work, @tomaarsen !!!

davidberenstein1957 · 2023-02-28T12:23:45Z

Hi, @frascuchon @tomaarsen, currently rg.log uses chunk_size. I am in favour of deprecating this variable and moving over to the more intuitive batch_size when reviewing this discussion. If you don't agree, we should also use chunk_size within this PR.

tomaarsen · 2023-02-28T12:26:22Z

I'm in favour of batch_size over chunk_size. I think (in a future PR?) we should expose batch_size in both functions.

dvsrepo · 2023-02-28T12:37:09Z

I'm in favor of batch_size too

davidberenstein1957 · 2023-02-28T12:37:39Z

Yes, for me it can be the PR where/when/if we actually start using the batch_size within rg.load

tomaarsen · 2023-02-28T12:53:00Z

I think we're best off keeping that separate from this PR for now, but we should work on that soon. Do you want to raise an issue, @davidberenstein1957? (i.e. one issue with two parts: rename chunk_size to batch_size for rg.log and introduce batch_size for rg.load)

frascuchon · 2023-02-28T14:41:27Z

Yes, I prefer to keep PRs as simple as possible. We can tackle the batch_size in 2 separate PRs:

Deprecate the current chunk_size in favour of batch_size for rg.load
Add a batch_size argument to rg.load to load data (adding also this new argument to datasets.scan method

UPDATE.

It's the same thing that wrote @tomaarsen one hour ago 😄

tomaarsen · 2023-02-28T14:48:51Z

Great minds think alike ;)

@davidberenstein1957

Closes #2454 Hello! ## Pull Request overview * Extend my previous work to expose a `batch_size` parameter to `rg.load` (and `rg.scan`, `_load_records_new_fashion`). * Update the docstrings for the public methods. * The actual implementation is very simple due to my earlier work in #2434: * `batch_size = self.DEFAULT_SCAN_SIZE` -> `batch_size = batch_size or self.DEFAULT_SCAN_SIZE` * Add an (untested!) warning if `batch_size` is supplied with an old server version. * (We don't have compatibility tests for old server versions) * I've simplified `test_scan_efficient_limiting` with `batch_size` rather than using a mocker to update `DEFAULT_SCAN_SIZE`. * I've also extended this test to test `rg.load` in addition to `active_api().datasets.scan`. ## Note Unlike previously discussed with @davidberenstein1957, I have *not* added a warning like so: https://github.com/argilla-io/argilla/blob/f5834a5408051bf1fa60d42aede6b325dc07fdbd/src/argilla/client/client.py#L338-L343 The server limits the maximum batch size to 500 for loading, so there is no need to have a warning when using a batch size of over 5000. ## Discussion 1. Should I include in the docstring that the default batch size is 250? That would be "hardcoded" into the docs, so if we ever change the default (`self.DEFAULT_SCAN_SIZE`), then we would have to remember to update the docs too. 2. Alternatively, should we deprecate `self.DEFAULT_SCAN_SIZE` and enforce that `batch_size` must be set for `datasets.scan`, with a default size of 250 everywhere? Then people can see the default batch size in the signature? I think we should do option 2. Let me know how you feel. --- **Type of change** - [x] New feature (non-breaking change which adds functionality) **How Has This Been Tested** Modified and updated a test, ran all tests. **Checklist** - [x] I have merged the original branch into my forked branch - [x] I added relevant documentation - [x] follows the style guidelines of this project - [x] I did a self-review of my code - [x] I made corresponding changes to the documentation - [x] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works --- - Tom Aarsen

@Gnonpi

# [1.4.0](v1.3.1...v1.4.0) (2023-03-09) ### Features * `configure_dataset` accepts a workspace as argument ([#2503](#2503)) ([29c9ee3](29c9ee3)), * Add `active_client` function to main argilla module ([#2387](#2387)) ([4e623d4](4e623d4)), closes [#2183](#2183) * Add text2text support for prepare for training spark nlp ([#2466](#2466)) ([21efb83](21efb83)), closes [#2465](#2465) [#2482](#2482) * Allow passing workspace as client param for `rg.log` or `rg.load` ([#2425](#2425)) ([b3b897a](b3b897a)), closes [#2059](#2059) * Bulk annotation improvement ([#2437](#2437)) ([3fce915](3fce915)), closes [#2264](#2264) * Deprecate `chunk_size` in favor of `batch_size` for `rg.log` ([#2455](#2455)) ([3ebea76](3ebea76)), closes [#2453](#2453) * Expose `batch_size` parameter for `rg.load` ([#2460](#2460)) ([e25be3e](e25be3e)), closes [#2454](#2454) [#2434](#2434) * Extend shortcuts to include alphabet for token classification ([#2339](#2339)) ([4a92b35](4a92b35)) ### Bug Fixes * added flexible app redirect to docs page ([#2428](#2428)) ([5600301](5600301)), closes [#2377](#2377) * added regex match to set workspace method ([#2427](#2427)) ([d789fa1](d789fa1)), closes [#2388] * error when loading record with empty string query ([#2429](#2429)) ([fc71c3b](fc71c3b)), closes [#2400](#2400) [#2303](#2303) * Remove extra-action dropdown state after navigation ([#2479](#2479)) ([9328994](9328994)), closes [#2158](#2158) ### Documentation * Add AutoTrain to readme ([7199780](7199780)) * Add migration to label schema section ([#2435](#2435)) ([d57a1e5](d57a1e5)), closes [#2003](#2003) [#2003](#2003) * Adds zero+few shot tutorial with SetFit ([#2409](#2409)) ([6c679ad](6c679ad)) * Update readme with quickstart section and new links to guides ([#2333](#2333)) ([91a77ad](91a77ad)) ## As always, thanks to our amazing contributors! - Documentation update: adding missing n (#2362) by @Gnonpi - feat: Extend shortcuts to include alphabet for token classification (#2339) by @cceyda

tomaarsen added 2 commits February 28, 2023 11:31

Improve efficiency of .scan (and .load) if limit is set

cc2d73d

By never requesting more samples than required

Add test case for limit < DEFAULT_SCAN_SIZE

e7c356f

tomaarsen added client labels Feb 28, 2023

tomaarsen requested review from frascuchon and davidberenstein1957 February 28, 2023 10:55

frascuchon added this to the v1.4.0 milestone Feb 28, 2023

frascuchon approved these changes Feb 28, 2023

View reviewed changes

frascuchon merged commit f5834a5 into argilla-io:develop Mar 1, 2023

tomaarsen deleted the refactor/prevent_unnecessary_search branch March 1, 2023 13:04

This was referenced Mar 2, 2023

Deprecate the current chunk_size in favour of batch_size for rg.log #2453

Closed

Introduce a batch_size argument for rg.load #2454

Closed

Expose batch_size parameter for rg.load #2460

Merged

frascuchon mentioned this pull request Mar 8, 2023

Release 1.4.0 #2500

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Improve efficiency of `.scan` (and `.load`) if `limit` is set #2434

refactor: Improve efficiency of `.scan` (and `.load`) if `limit` is set #2434

tomaarsen commented Feb 28, 2023

tomaarsen commented Feb 28, 2023

codecov bot commented Feb 28, 2023 •

edited

Loading

frascuchon left a comment

davidberenstein1957 commented Feb 28, 2023

tomaarsen commented Feb 28, 2023

dvsrepo commented Feb 28, 2023

davidberenstein1957 commented Feb 28, 2023

tomaarsen commented Feb 28, 2023

frascuchon commented Feb 28, 2023 •

edited

Loading

tomaarsen commented Feb 28, 2023

refactor: Improve efficiency of .scan (and .load) if limit is set #2434

refactor: Improve efficiency of .scan (and .load) if limit is set #2434

Conversation

tomaarsen commented Feb 28, 2023

Pull Request overview

Details

Tests

Effects

tomaarsen commented Feb 28, 2023

codecov bot commented Feb 28, 2023 • edited Loading

Codecov Report

frascuchon left a comment

Choose a reason for hiding this comment

davidberenstein1957 commented Feb 28, 2023

tomaarsen commented Feb 28, 2023

dvsrepo commented Feb 28, 2023

davidberenstein1957 commented Feb 28, 2023

tomaarsen commented Feb 28, 2023

frascuchon commented Feb 28, 2023 • edited Loading

tomaarsen commented Feb 28, 2023

refactor: Improve efficiency of `.scan` (and `.load`) if `limit` is set #2434

refactor: Improve efficiency of `.scan` (and `.load`) if `limit` is set #2434

codecov bot commented Feb 28, 2023 •

edited

Loading

frascuchon commented Feb 28, 2023 •

edited

Loading