Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENHANCEMENT] [REFACTOR] optimise and refactor SDK ingestion methods #5107

Merged
merged 37 commits into from
Jul 4, 2024

Conversation

burtenshaw
Copy link
Contributor

@burtenshaw burtenshaw commented Jun 25, 2024

This PR refactors refactors the ingestion flow in DatasetRecords by implementing a new class and module IngestedRecordMapper

This PR supports mapping incoming columns/keys to dataset attributes in these two ways:

  • supports tuple values in the mapping parameter of the log method so that user can specify the two attributes as a tuple.
  • refactors in the _ingest_records methods so that mapping is performed once before the ingestion loop instead of during.

This PR also optimises the log method so that it takes less time and is easier to work with:

  • uses tqdm to log status
  • uses exception to show bad records
  • iterates over the map not the data
Screenshot 2024-06-26 at 09 09 11
  • Improvement (change adding some improvement to an existing functionality)

How Has This Been Tested

  • tests have been modified, deprecated, and updated to support changes in the ingestion flow

Checklist

  • I added relevant documentation
  • follows the style guidelines of this project
  • I did a self-review of my code
  • I made corresponding changes to the documentation
  • I confirm My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

@burtenshaw burtenshaw changed the title [FEAT] map incoming columns to multiple dataset attributes [ENHANCEMENT] optimise SDK log method and support mapping incoming columns to multiple dataset attributes Jun 26, 2024
@nataliaElv
Copy link
Member

I'd rather have the records ingested and pushed in batches and have an easy way to identify those that threw an error, fix them and try to import those again.
Otherwise it can take ages until I see any records in my dataset.
Captura de pantalla 2024-06-26 a las 9 23 42

Copy link
Member

@davidberenstein1957 davidberenstein1957 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some initial comments.

argilla/src/argilla/records/_dataset_records.py Outdated Show resolved Hide resolved
argilla/src/argilla/records/_dataset_records.py Outdated Show resolved Hide resolved
argilla/src/argilla/records/_dataset_records.py Outdated Show resolved Hide resolved
argilla/src/argilla/records/_dataset_records.py Outdated Show resolved Hide resolved
@burtenshaw burtenshaw marked this pull request as ready for review July 2, 2024 15:21
burtenshaw and others added 11 commits July 3, 2024 13:43
…g dict (#5151)

<!-- Please include a summary of the changes and the related issue.
Please also include relevant motivation and context. List any
dependencies that are required for this change. -->

Closes #<issue_number>

**Type of change**
<!-- Please delete options that are not relevant. Remember to title the
PR according to the type of change -->

- Refactor (change restructuring the codebase without changing
functionality)

**How Has This Been Tested**
<!-- Please add some reference about how your feature has been tested.
-->

**Checklist**
<!-- Please go over the list and make sure you've taken everything into
account -->

- I added relevant documentation
- follows the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)
@frascuchon frascuchon merged commit 8c808f5 into develop Jul 4, 2024
7 checks passed
@frascuchon frascuchon deleted the spike/mapping-to-tuple branch July 4, 2024 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants