[FEATURE] Review record flatten process #4936

frascuchon · 2024-05-24T13:32:52Z

notes:

user id -> username
response status?
"text" -> "fields.text" ?
"metadata_key" -> "metadata.metadata_key"?

frascuchon · 2024-06-04T09:38:06Z

I think we should double-check the generated structure of to_json or to_list so we're not missing anything.

burtenshaw · 2024-06-11T08:55:14Z

Current nested structure

next(iter(dataset.records(with_responses=True))).to_dict()

{'id': UUID('25b8ca21-d0fb-4135-815d-0887393007b8'),
 'fields': {'post': 'Another clown in favour of more tax in this country. Blows my mind people can be this stupid.'},
 'metadata': {},
 'suggestions': {'is_toxic': {'value': '1', 'score': None, 'agent': None},
  'toxic_spans': {'value': [{'label': 'insult',
     'start': 86,
     'end': 92,
     'score': 0.6666666666666666},
    {'label': 'insult', 'start': 8, 'end': 13, 'score': 0.6666666666666666}],
   'score': None,
   'agent': None}},
 'responses': defaultdict(list, {}),
 'vectors': {},
 '_server_id': '4836f966-0a7f-4d6e-b017-398270265c95'}

Current Flattened structure

from pprint import pprint

pprint(dataset.records.to_list(flatten=False))

[{'_server_id': '0f3fd360-7776-464e-82a2-9654da527212',
  'fields': {'question': 'What is the capital of France?'},
  'id': '1',
  'metadata': {},
  'responses': defaultdict(<class 'list'>, {}),
  'suggestions': {'answer': {'agent': None, 'score': None, 'value': 'F'}},
  'vectors': {}},
 {'_server_id': '5aa81ed0-3cd2-4208-bae0-b8fb0ba8fd1d',
  'fields': {'question': 'What is the capital of Germany?'},
  'id': '2',
  'metadata': {},
  'responses': defaultdict(<class 'list'>, {}),
  'suggestions': {'answer': {'agent': None, 'score': None, 'value': 'Berlin'}},
  'vectors': {}}]

sdiazlor · 2024-06-18T08:22:35Z

@frascuchon @burtenshaw This is the issue I mentioned regarding the to_list flattening and the responses not being added. #5042

frascuchon · 2024-06-28T13:20:18Z

Also, @MoritzLaurer found errors when creating HF datasets with records partially annotated. We need to review this to:

Support export record list partially annotated
Simplify how response values would be read (the generated columns include the user.id)

…st (#5137) This PR changes the structure generated by `to_list(flatten=True)` to simplify reading responses. The response content is split into values and users, so no user ID is defined as part of the column name: The result for the following record: ```python record = rg.Record( fields={"field": "The field"}, metadata={"key": "value"}, responses=[ rg.Response(question_name="q1", value="value", user_id=user_a), rg.Response(question_name="q2", value="value", user_id=user_a), rg.Response(question_name="q2", value="value", user_id=user_b), rg.Response(question_name="q1", value="value", user_id=user_c), ], suggestions=[ rg.Suggestion(question_name="q1", value="value", score=0.1, agent="test"), rg.Suggestion(question_name="q2", value="value", score=0.9), ], ) ``` is : ```python { "id": <record_id>, "_server_id": None, "field": "The field", "key": "value", "q1.responses": ["value", "value"], "q1.responses.users": [str(user_a), str(user_c)], "q2.responses": ["value", "value"], "q2.responses.users": [str(user_a), str(user_b)], "q1.suggestion": "value", "q1.suggestion.score": 0.1, "q1.suggestion.agent": "test", "q2.suggestion": "value", "q2.suggestion.score": 0.9, "q2.suggestion.agent": None, } ``` Refs #4936 **Type of change**  - Improvement (change adding some improvement to an existing functionality) **How Has This Been Tested**  **Checklist**  - I added relevant documentation - follows the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: burtenshaw <[email protected]>

nataliaElv transferred this issue from argilla-io/argilla-python Jun 4, 2024

nataliaElv added this to the v2.0.0 milestone Jun 4, 2024

burtenshaw self-assigned this Jun 10, 2024

nataliaElv added the area: python sdk Indicates that an issue or pull request is related to the Python SDK label Jun 11, 2024

frascuchon self-assigned this Jul 1, 2024

This was referenced Jul 1, 2024

[ENHANCEMENT] argilla: simplify structure for flatten records to list #5137

Merged

[BUGFIX] argilla: normalize records when exporting flatten #5138

Merged

frascuchon closed this as completed in #5138 Jul 3, 2024

frascuchon closed this as completed in 237034e Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Review record flatten process #4936

[FEATURE] Review record flatten process #4936

frascuchon commented May 24, 2024

frascuchon commented Jun 4, 2024

burtenshaw commented Jun 11, 2024 •

edited

Loading

sdiazlor commented Jun 18, 2024

frascuchon commented Jun 28, 2024

[FEATURE] Review record flatten process #4936

[FEATURE] Review record flatten process #4936

Comments

frascuchon commented May 24, 2024

frascuchon commented Jun 4, 2024

burtenshaw commented Jun 11, 2024 • edited Loading

sdiazlor commented Jun 18, 2024

frascuchon commented Jun 28, 2024

burtenshaw commented Jun 11, 2024 •

edited

Loading