[ENHANCEMENT] [REFACTOR] optimise and refactor SDK ingestion methods #5107

burtenshaw · 2024-06-25T18:20:43Z

This PR refactors refactors the ingestion flow in DatasetRecords by implementing a new class and module IngestedRecordMapper

This PR supports mapping incoming columns/keys to dataset attributes in these two ways:

supports tuple values in the mapping parameter of the log method so that user can specify the two attributes as a tuple.
refactors in the _ingest_records methods so that mapping is performed once before the ingestion loop instead of during.

This PR also optimises the log method so that it takes less time and is easier to work with:

uses tqdm to log status
uses exception to show bad records
iterates over the map not the data

Improvement (change adding some improvement to an existing functionality)

How Has This Been Tested

tests have been modified, deprecated, and updated to support changes in the ingestion flow

Checklist

I added relevant documentation
follows the style guidelines of this project
I did a self-review of my code
I made corresponding changes to the documentation
I confirm My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

…re ingestion loop

…with refactor and tqdm and exceptions

for more information, see https://pre-commit.ci

nataliaElv · 2024-06-26T07:35:33Z

I'd rather have the records ingested and pushed in batches and have an easy way to identify those that threw an error, fix them and try to import those again.
Otherwise it can take ages until I see any records in my dataset.

davidberenstein1957

Left some initial comments.

argilla/src/argilla/records/_dataset_records.py

Reviewing and improving records.log Instead of: <img width="1335" alt="Captura de pantalla 2024-06-26 a las 12 48 14" src="https://github.com/argilla-io/argilla/assets/2518789/02283f4c-fe6a-464f-96b3-36853e6c7622"> for 50 records, records.log can log 1000: <img width="870" alt="Captura de pantalla 2024-06-26 a las 12 48 57" src="https://github.com/argilla-io/argilla/assets/2518789/d20f0469-0b33-427e-aa12-b4b7e1d40cd1">

…o/argilla into spike/mapping-to-tuple

for more information, see https://pre-commit.ci

…o/argilla into spike/mapping-to-tuple

for more information, see https://pre-commit.ci

argilla/docs/how_to_guides/record.md

argilla/src/argilla/records/_dataset_records.py

argilla/src/argilla/records/_mapping.py

for more information, see https://pre-commit.ci

…o/argilla into spike/mapping-to-tuple

…g dict (#5151)  Closes #<issue_number> **Type of change**  - Refactor (change restructuring the codebase without changing functionality) **How Has This Been Tested**  **Checklist**  - I added relevant documentation - follows the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

for more information, see https://pre-commit.ci

…o/argilla into spike/mapping-to-tuple

burtenshaw added 7 commits June 24, 2024 20:29

test: update tests for refactored mapping method

10965d3

refactor: introduce independent mapping method and move logic to befo…

a416a2f

…re ingestion loop

docs: update all doc strings in dataset records

35db9f6

chore: improve typing and docs on type

eae088b

docs: wrong method in records api reference

4490d11

feat: add exception for record ingestion

b5b3396

refactor: improve explainabilitity and readability in ingestion code …

ffeb0b0

…with refactor and tqdm and exceptions

burtenshaw changed the title ~~[FEAT] map incoming columns to multiple dataset attributes~~ [ENHANCEMENT] optimise SDK log method and support mapping incoming columns to multiple dataset attributes Jun 26, 2024

burtenshaw requested review from frascuchon, nataliaElv and davidberenstein1957 June 26, 2024 07:12

pre-commit-ci bot and others added 2 commits June 26, 2024 07:13

[pre-commit.ci] auto fixes from pre-commit.com hooks

594283e

for more information, see https://pre-commit.ci

enhancement: move mapping out of record loop

16f14d1

davidberenstein1957 reviewed Jun 26, 2024

View reviewed changes

frascuchon reviewed Jun 26, 2024

View reviewed changes

argilla/src/argilla/records/_dataset_records.py Outdated Show resolved Hide resolved

frascuchon reviewed Jun 26, 2024

View reviewed changes

argilla/src/argilla/records/_dataset_records.py Outdated Show resolved Hide resolved

frascuchon reviewed Jun 26, 2024

View reviewed changes

argilla/src/argilla/records/_dataset_records.py Outdated Show resolved Hide resolved

frascuchon and others added 7 commits June 27, 2024 10:43

Merge branch 'spike/mapping-to-tuple' of https://github.com/argilla-i…

5f06e20

…o/argilla into spike/mapping-to-tuple

enhancement: use just one progress bar

05df51a

chore: update typing of mapping

863dde2

fix: move render mapping into infer record method

bf9e864

[pre-commit.ci] auto fixes from pre-commit.com hooks

07aa249

for more information, see https://pre-commit.ci

fix: align add records parameters with render function

8a6d484

frascuchon mentioned this pull request Jun 28, 2024

[REFACTOR] argilla: Avoid autofetch when accessing settings #5130

Closed

burtenshaw marked this pull request as draft June 28, 2024 12:33

burtenshaw added 2 commits July 2, 2024 15:21

feat: implement ingestion mapping as class

0b623fd

feat: use ingestion mapping class in dataset records not dataset records

14faccf

burtenshaw force-pushed the spike/mapping-to-tuple branch from 06764aa to 14faccf Compare July 2, 2024 13:23

burtenshaw marked this pull request as ready for review July 2, 2024 15:21

burtenshaw and others added 6 commits July 2, 2024 17:23

chore: tidy imports

8889c0a

Merge branch 'spike/mapping-to-tuple' of https://github.com/argilla-i…

7ad5075

…o/argilla into spike/mapping-to-tuple

docs: update mapping parameters in how to guides

63e0f7b

test: broaden suggestion mapping in test

ecbdd4e

feat: extract dot notation with regex not string splitting

99235b2

[pre-commit.ci] auto fixes from pre-commit.com hooks

3ca8932

for more information, see https://pre-commit.ci