Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Improve efficiency of .scan (and .load) if limit is set #2434

Merged

Conversation

tomaarsen
Copy link
Contributor

Hello!

Pull Request overview

  • Refactor Datasets.scan to never request more samples than required.
  • Implements a batch_size variable which is now set to self.DEFAULT_SCAN_SIZE. This parameter can easily be exposed in the future for batch_size support, as discussed previously in Slack etc.

Details

The core of the changes is that the URL is now built on-the-fly at every request, rather than just once. For each request, the limit for that request is computed using min(limit, batch_size), and limit is decremented, allowing it to represent a "remaining samples to fetch". When limit reaches 0, i.e. 0 more samples to fetch, we return from scan.

I've also added a check to ensure that the limit passed to scan can't negative.

Tests

For tests, I've created a test_scan_efficient_limiting function which verifies the new and improved behaviour. It contains two monkeypatches:

  1. DEFAULT_SCAN_SIZE is set to 10. Because our dataset in this test only has 100 samples, we want to ensure that we can't just sample the entire dataset with one request.
  2. The http_client.post method is monkeypatched to allow us to track the calls to the server.

We test the following scenarios:

  • limit=23 -> 3 requests: for 10, 10 and 3 samples.
  • limit=20 -> 2 requests: for 10, 10 samples.
  • limit=6 -> 1 request: for 6 samples.

There's also a test to cover the new ValueError if limit < 0.

Effects

Consider the following script:

import argilla as rg
import time

def test_scan_records(fields):
    client = rg.active_client()

    data = client.datasets.scan(
        name="banking77-topics-setfit",
        projection=fields,
        limit=1,  # <- Note, just one sample
    )
    start_t = time.time()
    list(data)
    print(f"load time for 1 sample with fields={fields}: {time.time() - start_t:.4f}s")

test_scan_records(set())
test_scan_records({"text"})
test_scan_records({"tokens"})

On this PR, this outputs:

load time for 1 sample with fields=set(): 0.0774s
load time for 1 sample with fields={'text'}: 0.0646s
load time for 1 sample with fields={'tokens'}: 0.0669s

On the develop branch, this outputs:

load time for 1 sample with fields=set(): 0.1355s
load time for 1 sample with fields={'text'}: 0.1347s
load time for 1 sample with fields={'tokens'}: 0.1173s

This can be repeated for rg.load(..., limit=1), as that relies on .scan under the hood.

Note that this is the most extreme example of performance gains. The performance increases in almost all scenarios if limit is set, but the effects are generally not this big. Going from fetching 250 samples 8 times to fetching 250 samples 7 times and 173 once doesn't have as big of an impact.


Type of change

  • Refactor (change restructuring the codebase without changing functionality)

How Has This Been Tested

See above.

Checklist

  • I have merged the original branch into my forked branch
  • I added relevant documentation
  • follows the style guidelines of this project
  • I did a self-review of my code
  • I made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works

  • Tom Aarsen

@tomaarsen
Copy link
Contributor Author

I've taken it a little further than we originally discussed, i.e. I've also improved the performance if limit > DEFAULT_SCAN_SIZE. However, no new parameters (e.g. batch_size) are introduced yet, as I imagine there should be a discussion on how to introduce that throughout Argilla as a whole.

@codecov
Copy link

codecov bot commented Feb 28, 2023

Codecov Report

Patch coverage: 98.20% and project coverage change: +0.14 🎉

Comparison is base (5a8bb28) 91.86% compared to head (e7c356f) 92.00%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #2434      +/-   ##
===========================================
+ Coverage    91.86%   92.00%   +0.14%     
===========================================
  Files          161      162       +1     
  Lines         7869     7896      +27     
===========================================
+ Hits          7229     7265      +36     
+ Misses         640      631       -9     
Flag Coverage Δ
pytest 92.00% <98.20%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/argilla/__init__.py 86.66% <ø> (ø)
src/argilla/server/daos/backend/search/model.py 100.00% <ø> (ø)
src/argilla/server/daos/datasets.py 95.45% <ø> (-0.06%) ⬇️
...ver/services/tasks/text_classification/__init__.py 100.00% <ø> (ø)
...gilla/labeling/text_classification/label_errors.py 90.12% <83.33%> (-0.13%) ⬇️
src/argilla/client/datasets.py 74.72% <93.33%> (+0.09%) ⬆️
src/argilla/utils/dependency.py 96.15% <96.15%> (ø)
src/argilla/client/__init__.py 100.00% <100.00%> (ø)
src/argilla/client/apis/datasets.py 97.24% <100.00%> (+1.05%) ⬆️
src/argilla/client/client.py 87.13% <100.00%> (-0.42%) ⬇️
... and 18 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@frascuchon frascuchon added this to the v1.4.0 milestone Feb 28, 2023
Copy link
Member

@frascuchon frascuchon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, @tomaarsen !!!

@davidberenstein1957
Copy link
Member

Hi, @frascuchon @tomaarsen, currently rg.log uses chunk_size. I am in favour of deprecating this variable and moving over to the more intuitive batch_size when reviewing this discussion. If you don't agree, we should also use chunk_size within this PR.

@tomaarsen
Copy link
Contributor Author

I'm in favour of batch_size over chunk_size. I think (in a future PR?) we should expose batch_size in both functions.

@dvsrepo
Copy link
Member

dvsrepo commented Feb 28, 2023

I'm in favor of batch_size too

@davidberenstein1957
Copy link
Member

Yes, for me it can be the PR where/when/if we actually start using the batch_size within rg.load

@tomaarsen
Copy link
Contributor Author

I think we're best off keeping that separate from this PR for now, but we should work on that soon. Do you want to raise an issue, @davidberenstein1957? (i.e. one issue with two parts: rename chunk_size to batch_size for rg.log and introduce batch_size for rg.load)

@frascuchon
Copy link
Member

frascuchon commented Feb 28, 2023

Yes, I prefer to keep PRs as simple as possible. We can tackle the batch_size in 2 separate PRs:

  • Deprecate the current chunk_size in favour of batch_size for rg.load
  • Add a batch_size argument to rg.load to load data (adding also this new argument to datasets.scan method

UPDATE.

It's the same thing that wrote @tomaarsen one hour ago 😄

@tomaarsen
Copy link
Contributor Author

Great minds think alike ;)

@frascuchon frascuchon merged commit f5834a5 into argilla-io:develop Mar 1, 2023
@tomaarsen tomaarsen deleted the refactor/prevent_unnecessary_search branch March 1, 2023 13:04
frascuchon pushed a commit that referenced this pull request Mar 6, 2023
Closes #2454 

Hello!

## Pull Request overview
* Extend my previous work to expose a `batch_size` parameter to
`rg.load` (and `rg.scan`, `_load_records_new_fashion`).
  * Update the docstrings for the public methods.
* The actual implementation is very simple due to my earlier work in
#2434:
* `batch_size = self.DEFAULT_SCAN_SIZE` -> `batch_size = batch_size or
self.DEFAULT_SCAN_SIZE`
* Add an (untested!) warning if `batch_size` is supplied with an old
server version.
  * (We don't have compatibility tests for old server versions)
* I've simplified `test_scan_efficient_limiting` with `batch_size`
rather than using a mocker to update `DEFAULT_SCAN_SIZE`.
* I've also extended this test to test `rg.load` in addition to
`active_api().datasets.scan`.

## Note
Unlike previously discussed with @davidberenstein1957, I have *not*
added a warning like so:

https://github.com/argilla-io/argilla/blob/f5834a5408051bf1fa60d42aede6b325dc07fdbd/src/argilla/client/client.py#L338-L343
The server limits the maximum batch size to 500 for loading, so there is
no need to have a warning when using a batch size of over 5000.

## Discussion
1. Should I include in the docstring that the default batch size is 250?
That would be "hardcoded" into the docs, so if we ever change the
default (`self.DEFAULT_SCAN_SIZE`), then we would have to remember to
update the docs too.
2. Alternatively, should we deprecate `self.DEFAULT_SCAN_SIZE` and
enforce that `batch_size` must be set for `datasets.scan`, with a
default size of 250 everywhere? Then people can see the default batch
size in the signature?

I think we should do option 2. Let me know how you feel.

---

**Type of change**

- [x] New feature (non-breaking change which adds functionality)

**How Has This Been Tested**

Modified and updated a test, ran all tests.

**Checklist**

- [x] I have merged the original branch into my forked branch
- [x] I added relevant documentation
- [x] follows the style guidelines of this project
- [x] I did a self-review of my code
- [x] I made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my
feature works

---

- Tom Aarsen
@frascuchon frascuchon mentioned this pull request Mar 8, 2023
frascuchon added a commit that referenced this pull request Mar 9, 2023
# [1.4.0](v1.3.1...v1.4.0)
(2023-03-09)

### Features

* `configure_dataset` accepts a workspace as argument
([#2503](#2503))
([29c9ee3](29c9ee3)),
* Add `active_client` function to main argilla module
([#2387](#2387))
([4e623d4](4e623d4)),
closes [#2183](#2183)
* Add text2text support for prepare for training spark nlp
([#2466](#2466))
([21efb83](21efb83)),
closes [#2465](#2465)
[#2482](#2482)
* Allow passing workspace as client param for `rg.log` or `rg.load`
([#2425](#2425))
([b3b897a](b3b897a)),
closes [#2059](#2059)
* Bulk annotation improvement
([#2437](#2437))
([3fce915](3fce915)),
closes [#2264](#2264)
* Deprecate `chunk_size` in favor of `batch_size` for `rg.log`
([#2455](#2455))
([3ebea76](3ebea76)),
closes [#2453](#2453)
* Expose `batch_size` parameter for `rg.load`
([#2460](#2460))
([e25be3e](e25be3e)),
closes [#2454](#2454)
[#2434](#2434)
* Extend shortcuts to include alphabet for token classification
([#2339](#2339))
([4a92b35](4a92b35))


### Bug Fixes

* added flexible app redirect to docs page
([#2428](#2428))
([5600301](5600301)),
closes [#2377](#2377)
* added regex match to set workspace method
([#2427](#2427))
([d789fa1](d789fa1)),
closes [#2388]
* error when loading record with empty string query
([#2429](#2429))
([fc71c3b](fc71c3b)),
closes [#2400](#2400)
[#2303](#2303)
* Remove extra-action dropdown state after navigation
([#2479](#2479))
([9328994](9328994)),
closes [#2158](#2158)


### Documentation

* Add AutoTrain to readme
([7199780](7199780))
* Add migration to label schema section
([#2435](#2435))
([d57a1e5](d57a1e5)),
closes [#2003](#2003)
[#2003](#2003)
* Adds zero+few shot tutorial with SetFit
([#2409](#2409))
([6c679ad](6c679ad))
* Update readme with quickstart section and new links to guides
([#2333](#2333))
([91a77ad](91a77ad))


## As always, thanks to our amazing contributors!
- Documentation update: adding missing n (#2362) by @Gnonpi
- feat: Extend shortcuts to include alphabet for token classification
(#2339) by @cceyda
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants