Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a batch_size argument for rg.load #2454

Closed
tomaarsen opened this issue Mar 2, 2023 · 0 comments · Fixed by #2460
Closed

Introduce a batch_size argument for rg.load #2454

tomaarsen opened this issue Mar 2, 2023 · 0 comments · Fixed by #2460
Assignees
Labels
type: enhancement Indicates new feature requests
Milestone

Comments

@tomaarsen
Copy link
Contributor

Is your feature request related to a problem? Please describe.
I'd like to be able to specify a batch_size argument in rg.load to have more influence over potential timeouts.

Describe the solution you'd like
I'd like to be able to do:

data = rg.load(name, batch_size=64)

Describe alternatives you've considered
Sticking with the current behaviour, i.e. using a batch size of 250 at all times.

Additional context
See #2434 (comment)

@tomaarsen tomaarsen added the type: enhancement Indicates new feature requests label Mar 2, 2023
@tomaarsen tomaarsen self-assigned this Mar 2, 2023
@frascuchon frascuchon added this to the v1.4.0 milestone Mar 2, 2023
frascuchon pushed a commit that referenced this issue Mar 6, 2023
Closes #2454 

Hello!

## Pull Request overview
* Extend my previous work to expose a `batch_size` parameter to
`rg.load` (and `rg.scan`, `_load_records_new_fashion`).
  * Update the docstrings for the public methods.
* The actual implementation is very simple due to my earlier work in
#2434:
* `batch_size = self.DEFAULT_SCAN_SIZE` -> `batch_size = batch_size or
self.DEFAULT_SCAN_SIZE`
* Add an (untested!) warning if `batch_size` is supplied with an old
server version.
  * (We don't have compatibility tests for old server versions)
* I've simplified `test_scan_efficient_limiting` with `batch_size`
rather than using a mocker to update `DEFAULT_SCAN_SIZE`.
* I've also extended this test to test `rg.load` in addition to
`active_api().datasets.scan`.

## Note
Unlike previously discussed with @davidberenstein1957, I have *not*
added a warning like so:

https://github.com/argilla-io/argilla/blob/f5834a5408051bf1fa60d42aede6b325dc07fdbd/src/argilla/client/client.py#L338-L343
The server limits the maximum batch size to 500 for loading, so there is
no need to have a warning when using a batch size of over 5000.

## Discussion
1. Should I include in the docstring that the default batch size is 250?
That would be "hardcoded" into the docs, so if we ever change the
default (`self.DEFAULT_SCAN_SIZE`), then we would have to remember to
update the docs too.
2. Alternatively, should we deprecate `self.DEFAULT_SCAN_SIZE` and
enforce that `batch_size` must be set for `datasets.scan`, with a
default size of 250 everywhere? Then people can see the default batch
size in the signature?

I think we should do option 2. Let me know how you feel.

---

**Type of change**

- [x] New feature (non-breaking change which adds functionality)

**How Has This Been Tested**

Modified and updated a test, ran all tests.

**Checklist**

- [x] I have merged the original branch into my forked branch
- [x] I added relevant documentation
- [x] follows the style guidelines of this project
- [x] I did a self-review of my code
- [x] I made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my
feature works

---

- Tom Aarsen
frascuchon added a commit that referenced this issue Mar 9, 2023
# [1.4.0](v1.3.1...v1.4.0)
(2023-03-09)

### Features

* `configure_dataset` accepts a workspace as argument
([#2503](#2503))
([29c9ee3](29c9ee3)),
* Add `active_client` function to main argilla module
([#2387](#2387))
([4e623d4](4e623d4)),
closes [#2183](#2183)
* Add text2text support for prepare for training spark nlp
([#2466](#2466))
([21efb83](21efb83)),
closes [#2465](#2465)
[#2482](#2482)
* Allow passing workspace as client param for `rg.log` or `rg.load`
([#2425](#2425))
([b3b897a](b3b897a)),
closes [#2059](#2059)
* Bulk annotation improvement
([#2437](#2437))
([3fce915](3fce915)),
closes [#2264](#2264)
* Deprecate `chunk_size` in favor of `batch_size` for `rg.log`
([#2455](#2455))
([3ebea76](3ebea76)),
closes [#2453](#2453)
* Expose `batch_size` parameter for `rg.load`
([#2460](#2460))
([e25be3e](e25be3e)),
closes [#2454](#2454)
[#2434](#2434)
* Extend shortcuts to include alphabet for token classification
([#2339](#2339))
([4a92b35](4a92b35))


### Bug Fixes

* added flexible app redirect to docs page
([#2428](#2428))
([5600301](5600301)),
closes [#2377](#2377)
* added regex match to set workspace method
([#2427](#2427))
([d789fa1](d789fa1)),
closes [#2388]
* error when loading record with empty string query
([#2429](#2429))
([fc71c3b](fc71c3b)),
closes [#2400](#2400)
[#2303](#2303)
* Remove extra-action dropdown state after navigation
([#2479](#2479))
([9328994](9328994)),
closes [#2158](#2158)


### Documentation

* Add AutoTrain to readme
([7199780](7199780))
* Add migration to label schema section
([#2435](#2435))
([d57a1e5](d57a1e5)),
closes [#2003](#2003)
[#2003](#2003)
* Adds zero+few shot tutorial with SetFit
([#2409](#2409))
([6c679ad](6c679ad))
* Update readme with quickstart section and new links to guides
([#2333](#2333))
([91a77ad](91a77ad))


## As always, thanks to our amazing contributors!
- Documentation update: adding missing n (#2362) by @Gnonpi
- feat: Extend shortcuts to include alphabet for token classification
(#2339) by @cceyda
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement Indicates new feature requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants