Introduce a `batch_size` argument for `rg.load` #2454

tomaarsen · 2023-03-02T08:20:11Z

Is your feature request related to a problem? Please describe.
I'd like to be able to specify a batch_size argument in rg.load to have more influence over potential timeouts.

Describe the solution you'd like
I'd like to be able to do:

data = rg.load(name, batch_size=64)

Describe alternatives you've considered
Sticking with the current behaviour, i.e. using a batch size of 250 at all times.

Additional context
See #2434 (comment)

The text was updated successfully, but these errors were encountered:

@davidberenstein1957

Closes #2454 Hello! ## Pull Request overview * Extend my previous work to expose a `batch_size` parameter to `rg.load` (and `rg.scan`, `_load_records_new_fashion`). * Update the docstrings for the public methods. * The actual implementation is very simple due to my earlier work in #2434: * `batch_size = self.DEFAULT_SCAN_SIZE` -> `batch_size = batch_size or self.DEFAULT_SCAN_SIZE` * Add an (untested!) warning if `batch_size` is supplied with an old server version. * (We don't have compatibility tests for old server versions) * I've simplified `test_scan_efficient_limiting` with `batch_size` rather than using a mocker to update `DEFAULT_SCAN_SIZE`. * I've also extended this test to test `rg.load` in addition to `active_api().datasets.scan`. ## Note Unlike previously discussed with @davidberenstein1957, I have *not* added a warning like so: https://github.com/argilla-io/argilla/blob/f5834a5408051bf1fa60d42aede6b325dc07fdbd/src/argilla/client/client.py#L338-L343 The server limits the maximum batch size to 500 for loading, so there is no need to have a warning when using a batch size of over 5000. ## Discussion 1. Should I include in the docstring that the default batch size is 250? That would be "hardcoded" into the docs, so if we ever change the default (`self.DEFAULT_SCAN_SIZE`), then we would have to remember to update the docs too. 2. Alternatively, should we deprecate `self.DEFAULT_SCAN_SIZE` and enforce that `batch_size` must be set for `datasets.scan`, with a default size of 250 everywhere? Then people can see the default batch size in the signature? I think we should do option 2. Let me know how you feel. --- **Type of change** - [x] New feature (non-breaking change which adds functionality) **How Has This Been Tested** Modified and updated a test, ran all tests. **Checklist** - [x] I have merged the original branch into my forked branch - [x] I added relevant documentation - [x] follows the style guidelines of this project - [x] I did a self-review of my code - [x] I made corresponding changes to the documentation - [x] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works --- - Tom Aarsen

@Gnonpi

# [1.4.0](v1.3.1...v1.4.0) (2023-03-09) ### Features * `configure_dataset` accepts a workspace as argument ([#2503](#2503)) ([29c9ee3](29c9ee3)), * Add `active_client` function to main argilla module ([#2387](#2387)) ([4e623d4](4e623d4)), closes [#2183](#2183) * Add text2text support for prepare for training spark nlp ([#2466](#2466)) ([21efb83](21efb83)), closes [#2465](#2465) [#2482](#2482) * Allow passing workspace as client param for `rg.log` or `rg.load` ([#2425](#2425)) ([b3b897a](b3b897a)), closes [#2059](#2059) * Bulk annotation improvement ([#2437](#2437)) ([3fce915](3fce915)), closes [#2264](#2264) * Deprecate `chunk_size` in favor of `batch_size` for `rg.log` ([#2455](#2455)) ([3ebea76](3ebea76)), closes [#2453](#2453) * Expose `batch_size` parameter for `rg.load` ([#2460](#2460)) ([e25be3e](e25be3e)), closes [#2454](#2454) [#2434](#2434) * Extend shortcuts to include alphabet for token classification ([#2339](#2339)) ([4a92b35](4a92b35)) ### Bug Fixes * added flexible app redirect to docs page ([#2428](#2428)) ([5600301](5600301)), closes [#2377](#2377) * added regex match to set workspace method ([#2427](#2427)) ([d789fa1](d789fa1)), closes [#2388] * error when loading record with empty string query ([#2429](#2429)) ([fc71c3b](fc71c3b)), closes [#2400](#2400) [#2303](#2303) * Remove extra-action dropdown state after navigation ([#2479](#2479)) ([9328994](9328994)), closes [#2158](#2158) ### Documentation * Add AutoTrain to readme ([7199780](7199780)) * Add migration to label schema section ([#2435](#2435)) ([d57a1e5](d57a1e5)), closes [#2003](#2003) [#2003](#2003) * Adds zero+few shot tutorial with SetFit ([#2409](#2409)) ([6c679ad](6c679ad)) * Update readme with quickstart section and new links to guides ([#2333](#2333)) ([91a77ad](91a77ad)) ## As always, thanks to our amazing contributors! - Documentation update: adding missing n (#2362) by @Gnonpi - feat: Extend shortcuts to include alphabet for token classification (#2339) by @cceyda

tomaarsen added the type: enhancement Indicates new feature requests label Mar 2, 2023

tomaarsen self-assigned this Mar 2, 2023

This was referenced Mar 2, 2023

evaluate speed and efficiency improvements for rg.load and rg.log #2458

Closed

Deprecate chunk_size in favor of batch_size for rg.log #2455

Merged

tomaarsen mentioned this issue Mar 2, 2023

Expose batch_size parameter for rg.load #2460

Merged

8 tasks

frascuchon added this to the v1.4.0 milestone Mar 2, 2023

frascuchon closed this as completed in #2460 Mar 6, 2023

frascuchon mentioned this issue Mar 8, 2023

Release 1.4.0 #2500

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce a `batch_size` argument for `rg.load` #2454

Introduce a `batch_size` argument for `rg.load` #2454

tomaarsen commented Mar 2, 2023

Introduce a batch_size argument for rg.load #2454

Introduce a batch_size argument for rg.load #2454

Comments

tomaarsen commented Mar 2, 2023

Introduce a `batch_size` argument for `rg.load` #2454

Introduce a `batch_size` argument for `rg.load` #2454