Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Review record flatten process #4936

Closed
frascuchon opened this issue May 24, 2024 · 4 comments · Fixed by #5138
Closed

[FEATURE] Review record flatten process #4936

frascuchon opened this issue May 24, 2024 · 4 comments · Fixed by #5138
Assignees
Labels
area: python sdk Indicates that an issue or pull request is related to the Python SDK
Milestone

Comments

@frascuchon
Copy link
Member

notes:

  • user id -> username
  • response status?
  • "text" -> "fields.text" ?
  • "metadata_key" -> "metadata.metadata_key"?
@frascuchon
Copy link
Member Author

I think we should double-check the generated structure of to_json or to_list so we're not missing anything.

@nataliaElv nataliaElv transferred this issue from argilla-io/argilla-python Jun 4, 2024
@nataliaElv nataliaElv added this to the v2.0.0 milestone Jun 4, 2024
@burtenshaw burtenshaw self-assigned this Jun 10, 2024
@nataliaElv nataliaElv added the area: python sdk Indicates that an issue or pull request is related to the Python SDK label Jun 11, 2024
@burtenshaw
Copy link
Contributor

burtenshaw commented Jun 11, 2024

Current nested structure

next(iter(dataset.records(with_responses=True))).to_dict()
{'id': UUID('25b8ca21-d0fb-4135-815d-0887393007b8'),
 'fields': {'post': 'Another clown in favour of more tax in this country. Blows my mind people can be this stupid.'},
 'metadata': {},
 'suggestions': {'is_toxic': {'value': '1', 'score': None, 'agent': None},
  'toxic_spans': {'value': [{'label': 'insult',
     'start': 86,
     'end': 92,
     'score': 0.6666666666666666},
    {'label': 'insult', 'start': 8, 'end': 13, 'score': 0.6666666666666666}],
   'score': None,
   'agent': None}},
 'responses': defaultdict(list, {}),
 'vectors': {},
 '_server_id': '4836f966-0a7f-4d6e-b017-398270265c95'}

Current Flattened structure

from pprint import pprint

pprint(dataset.records.to_list(flatten=False))
[{'_server_id': '0f3fd360-7776-464e-82a2-9654da527212',
  'fields': {'question': 'What is the capital of France?'},
  'id': '1',
  'metadata': {},
  'responses': defaultdict(<class 'list'>, {}),
  'suggestions': {'answer': {'agent': None, 'score': None, 'value': 'F'}},
  'vectors': {}},
 {'_server_id': '5aa81ed0-3cd2-4208-bae0-b8fb0ba8fd1d',
  'fields': {'question': 'What is the capital of Germany?'},
  'id': '2',
  'metadata': {},
  'responses': defaultdict(<class 'list'>, {}),
  'suggestions': {'answer': {'agent': None, 'score': None, 'value': 'Berlin'}},
  'vectors': {}}]

@sdiazlor
Copy link
Contributor

@frascuchon @burtenshaw This is the issue I mentioned regarding the to_list flattening and the responses not being added. #5042

@frascuchon
Copy link
Member Author

Also, @MoritzLaurer found errors when creating HF datasets with records partially annotated. We need to review this to:

  1. Support export record list partially annotated
  2. Simplify how response values would be read (the generated columns include the user.id)

@frascuchon frascuchon self-assigned this Jul 1, 2024
frascuchon added a commit that referenced this issue Jul 3, 2024
…st (#5137)

This PR changes the structure generated by `to_list(flatten=True)` to
simplify reading responses. The response content is split into values
and users, so no user ID is defined as part of the column name:

The result for the following record:

```python

record = rg.Record(
    fields={"field": "The field"},
    metadata={"key": "value"},
    responses=[
        rg.Response(question_name="q1", value="value", user_id=user_a),
        rg.Response(question_name="q2", value="value", user_id=user_a),
        rg.Response(question_name="q2", value="value", user_id=user_b),
        rg.Response(question_name="q1", value="value", user_id=user_c),
    ],
    suggestions=[
        rg.Suggestion(question_name="q1", value="value", score=0.1, agent="test"),
        rg.Suggestion(question_name="q2", value="value", score=0.9),
    ],
)
```
is :
```python
{
    "id": <record_id>,
    "_server_id": None,
    "field": "The field",
    "key": "value",
    "q1.responses": ["value", "value"],
    "q1.responses.users": [str(user_a), str(user_c)],
    "q2.responses": ["value", "value"],
    "q2.responses.users": [str(user_a), str(user_b)],
    "q1.suggestion": "value",
    "q1.suggestion.score": 0.1,
    "q1.suggestion.agent": "test",
    "q2.suggestion": "value",
    "q2.suggestion.score": 0.9,
    "q2.suggestion.agent": None,
}
```

Refs #4936

**Type of change**
<!-- Please delete options that are not relevant. Remember to title the
PR according to the type of change -->

- Improvement (change adding some improvement to an existing
functionality)

**How Has This Been Tested**
<!-- Please add some reference about how your feature has been tested.
-->

**Checklist**
<!-- Please go over the list and make sure you've taken everything into
account -->

- I added relevant documentation
- follows the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)

---------

Co-authored-by: burtenshaw <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: python sdk Indicates that an issue or pull request is related to the Python SDK
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants