Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adding utils module and functions #4121

Merged
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
bc96798
to_html and highlight_tokens
sdiazlor Nov 2, 2023
2fca7e5
reorganization of the module
sdiazlor Nov 6, 2023
eebb43a
organization
sdiazlor Nov 6, 2023
4a29941
new_changes
sdiazlor Nov 7, 2023
e821ebb
definition image/audio/video to html
sdiazlor Nov 7, 2023
780eea7
create_token_highlights rm None
sdiazlor Nov 7, 2023
31eb5f2
making-most-of-markdown-tutorial
sdiazlor Nov 13, 2023
cdafe10
updating tutorial and adding photos
sdiazlor Nov 14, 2023
f07adb7
markdown reference in the documentation
sdiazlor Nov 14, 2023
f86c539
fix private HF reference
sdiazlor Nov 16, 2023
d8410e9
metadata reference in fields and questions
sdiazlor Nov 16, 2023
3b785ef
update utils and utils_test
sdiazlor Nov 20, 2023
eff602b
logic: assig_ records (include info for docs)
sdiazlor Nov 20, 2023
3bec76e
Merge branch 'develop' into feat/4030-feature-add-a-utils-module-to-t…
sdiazlor Nov 20, 2023
8a25e8a
add .py extension
sdiazlor Nov 20, 2023
650aa0a
Merge remote-tracking branch 'origin/develop' into feat/4030-feature-…
sdiazlor Nov 22, 2023
6f1544b
renaming to avoid circular import
sdiazlor Nov 23, 2023
1b4cd97
mv in dev to /dataset as helpers and test_helpers
sdiazlor Nov 23, 2023
09f7f98
update init file
sdiazlor Nov 23, 2023
528b9d8
update assignment
sdiazlor Nov 23, 2023
b7a89ef
adding the tests for assignment
sdiazlor Nov 23, 2023
d88b98d
adding warnings markdown
sdiazlor Nov 23, 2023
157ebdb
update changelog
sdiazlor Nov 23, 2023
2f0b1a2
reference to create_token_highlights
sdiazlor Nov 27, 2023
e86e161
image create_token_highlights
sdiazlor Nov 27, 2023
c026b40
update changelog
sdiazlor Nov 27, 2023
df6624e
update image-support
sdiazlor Nov 27, 2023
bfbb5b0
update assign_records
sdiazlor Nov 27, 2023
aa06a8d
update according to comments
sdiazlor Nov 27, 2023
33158df
Update assign_records.md
davidberenstein1957 Nov 27, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ These are the section headers that we use:
- Added `get_model_kwargs`, `get_trainer_kwargs`, `get_trainer_model`, `get_trainer_tokenizer` and `get_trainer` -methods to the `ArgillaTrainer` to improve interoperability across frameworks. ([#4214](https://github.com/argilla-io/argilla/pull/4214)).
- Added additional formatting checks to the `ArgillaTrainer` to allow for better interoperability of `defaults` and `formatting_func` usage. ([#4214](https://github.com/argilla-io/argilla/pull/4214)).
- Added a warning to the `update_config`-method of `ArgillaTrainer` to emphasize if the `kwargs` were updated correctly. ([#4214](https://github.com/argilla-io/argilla/pull/4214)).
- Added `argilla.client.feedback.utils` module with `html_utils` and `assignments`. ([#4121](https://github.com/argilla-io/argilla/pull/4121))

### Fixed

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 10 additions & 2 deletions docs/_source/practical_guides/create_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,11 @@ You can define the fields using the Python SDK providing the following arguments
- `name`: The name of the field, as it will be seen internally.
- `title` (optional): The name of the field, as it will be displayed in the UI. Defaults to the `name` value, but capitalized.
- `required` (optional): Whether the field is required or not. Defaults to `True`. Note that at least one field must be required.
- `use_markdown`(optional): Specify whether you want markdown rendered in the UI. Defaults to `False`.
- `use_markdown` (optional): Specify whether you want markdown rendered in the UI. Defaults to `False`. If you set it to `True`, you will be able to use all the Markdown features for text formatting and embedded multimedia content. To delve further into the details, please refer to this [tutorial](/tutorials/notebooks/making-most-of-markdown.ipynb).

```{note}
Multimedia in Markdown is here, but it's still in the experimental phase. As we navigate the early stages, there are limits on file sizes due to ElasticSearch constraints, and the visualization and loading times may vary depending on your browser. We're on the case to improve this and welcome your feedback and suggestions!
```

```python
fields = [
Expand Down Expand Up @@ -86,7 +90,11 @@ The following arguments apply to specific question types:
- `values`: In the `RatingQuestion` this will be any list of unique integers that represent the options that annotators can choose from. These values must be defined in the range [1, 10]. In the `RankingQuestion`, values will be a list of strings with the options they will need to rank. If you'd like the text of the options to be different in the UI and internally, you can pass a dictionary instead where the key is the internal name and the value is the text to display in the UI.
- `labels`: In `LabelQuestion` and `MultiLabelQuestion` this is a list of strings with the options for these questions. If you'd like the text of the labels to be different in the UI and internally, you can pass a dictionary instead where the key is the internal name and the value the text to display in the UI.
- `visible_labels` (optional): In `LabelQuestion` and `MultiLabelQuestion` this is the number of labels that will be visible in the UI. By default, the UI will show 20 labels and collapse the rest. Set your preferred number to change this limit or set `visible_labels=None` to show all options.
- `use_markdown` (optional): In `TextQuestion` define whether the field should render markdown text. Defaults to `False`.
- `use_markdown` (optional): In `TextQuestion` define whether the field should render markdown text. Defaults to `False`. If you set it to `True`, you will be able to use all the Markdown features for text formatting and embedded multimedia content. To delve further into the details, please refer to this [tutorial](/tutorials/notebooks/making-most-of-markdown.ipynb).

```{note}
Multimedia in Markdown is here, but it's still in the experimental phase. As we navigate the early stages, there are limits on file sizes due to ElasticSearch constraints, and the visualization and loading times may vary depending on your browser. We're on the case to improve this and welcome your feedback and suggestions!
```

Check out the following tabs to learn how to set up questions according to their type:

Expand Down
1,083 changes: 1,083 additions & 0 deletions docs/_source/tutorials/notebooks/making-most-of-markdown.ipynb

Large diffs are not rendered by default.

39 changes: 39 additions & 0 deletions src/argilla/client/feedback/utils/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Copyright 2021-present, the Recognai S.L. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


from argilla.client.feedback.utils.assignment import (
assign_records,
assign_records_to_groups,
assign_records_to_individuals,
assign_workspaces,
check_user,
check_workspace,
)
from argilla.client.feedback.utils.html_utils import (
audio_to_html,
create_token_highlights,
image_to_html,
media_to_html,
video_to_html,
)

__all__ = [
"audio_to_html",
"video_to_html",
"image_to_html",
"create_token_highlights",
"assign_records",
"assign_workspaces",
]
262 changes: 262 additions & 0 deletions src/argilla/client/feedback/utils/assignment.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,262 @@
# Copyright 2021-present, the Recognai S.L. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


import random
import warnings
from collections import defaultdict
from typing import Any, Dict, List, Union

from rich.progress import Progress

from argilla.client.users import User
from argilla.client.workspaces import Workspace


def check_user(user_to_check: Union[str, User]) -> User:
"""
Helper function to check if the input is a User object. If it's a string, it attempts to retrieve the User object.
If the User does not exist, it creates a new User with a default password and role.

Args:
user_to_check: a user object or a string that represents a username

Returns:
The User object corresponding to the input.
"""
if isinstance(user_to_check, User):
user = user_to_check
else:
try:
user = User.from_name(user_to_check)
except ValueError:
user = User.create(username=user_to_check, first_name=user_to_check, password="12345678", role="annotator")
warnings.warn(
f"The user {user.username} was created with a default password. We recommend you to change it for security reasons.",
UserWarning,
)
return user


def check_workspace(workspace_to_check: str) -> Workspace:
"""
Helper function to check if the workspace exists. If it does not exist, it creates a new one.

Args:
workspace_to_check: a workspace string name

Returns:
The Workspace object corresponding to the input.
"""
try:
workspace = Workspace.from_name(workspace_to_check)
except:
workspace = Workspace.create(workspace_to_check)
return workspace


def assign_records_to_groups(
groups: Dict[str, List[Any]], records: List[Any], overlap: int, shuffle: bool = True
) -> Dict[str, Dict[str, Any]]:
"""
Assign records to predefined groups with controlled overlap (for the groups) and optional shuffle. All members of the same group will annotate the same records.

Args:
groups: A dictionary where keys are group names and values are lists of users names or objects.
records: A list of records to be assigned.
overlap: The number of times each record is assigned to consecutive groups (0 for no overlap).
shuffle: If True, shuffle the records before assignment. Defaults to True.

Returns:
A dictionary where each key is a group and its value is another dictionary, which maps usernames to their respective assigned records.

Raises:
ValueError: If `overlap` is higher than the number of groups or negative.
"""
if overlap < 0 or overlap >= len(groups.keys()):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as below about opverlap in percentage and # per users

raise ValueError("Overlap must be less than the number of groups and must not be negative.")

if len(records) < len(groups.keys()):
warnings.warn(
f"The number of groups is higher than the number of records. Some users will not be assigned any records.",
UserWarning,
)

if shuffle:
random.shuffle(records)

assignments = {}
assignments_grouped = {}
group_names = list(groups.keys())
num_groups = len(group_names)
overlap = 1 if overlap == 0 else overlap

group_records = defaultdict(list)
with Progress() as progress:
task = progress.add_task("[green]Processing records...", total=len(records))

for idx, record in enumerate(records):
for offset in range(overlap):
group_index = (idx + offset) % num_groups
group_name = group_names[group_index]
group_records[group_name].append(record)

progress.update(task, advance=1)

for group, users in groups.items():
users = [check_user(user) for user in users]
for user in users:
assignments[user] = group_records[group]

assignments_grouped[group] = {user.username: assignments.get(user, []) for user in users}

return assignments_grouped


def assign_records_to_individuals(
users: List[Any], records: List[Any], overlap: int, shuffle: bool = True
) -> Dict[str, List[Any]]:
"""
Assign records to users with controlled overlap and optional shuffle.

Args:
users: A list of user objects, each with a 'username' attribute.
records: A list of record objects to be assigned to users.
overlap: The number of times each record is assigned to consecutive users (0 for no overlap).
shuffle: If True, the records list will be shuffled before assignment. Defaults to True.

Returns:
A dictionary where keys are usernames and values are lists of assigned records.

Raises:
ValueError: If `overlap` is higher than the number of users or negative.
"""
if overlap < 0 or overlap >= len(users):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say overlap is a percentage between 0 and 1 right? Do you think it might be easy to add this? @nataliaElv what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you may want to have 2 or 3 annotators giving responses to one record, so this could be greater than 1? But let me know if I'm misunderstanding what the function is doing here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right @nataliaElv, I thought wrongly about this.l

raise ValueError("Overlap must be less than the number of users and must not be negative.")

if len(records) < len(users):
warnings.warn(
f"The number of users is higher than the number of records. Some users will not be assigned any records.",
UserWarning,
)

if shuffle:
random.shuffle(records)

users = [check_user(user) for user in users]
assignments = {user.username: [] for user in users}

num_users = len(users)
overlap = 1 if overlap == 0 else overlap

with Progress() as progress:
task = progress.add_task("[green]Processing records...", total=len(records))

for idx, record in enumerate(records):
for offset in range(overlap):
user_index = (idx + offset) % num_users
user = users[user_index].username
assignments[user].append(record)

progress.update(task, advance=1)

return assignments


def assign_records(
users: Union[Dict[str, List[Any]], List[Any]], records: List[Any], overlap: int, shuffle: bool = True
) -> Union[Dict[str, List[Any]], Dict[str, Dict[str, Any]]]:
"""
Assign records to either groups or individuals, with controlled overlap and optional shuffle.

Args:
users: Either a dictionary of groups or a list of individual user objects.
records: A list of record objects to be assigned.
overlap: The number of times each record is assigned to consecutive users or groups (0 for no overlap).
shuffle: If True, the records list will be shuffled before assignment. Defaults to True.

Returns:
A dictionary where each key is a group and its value is another dictionary, which maps usernames to their respective assigned records.
Or a dictionary where keys are usernames and values are lists of assigned records.

Examples:
>>> from argilla.client.feedback.utils import assign_records
>>> individual_assignments = assign_records([user1, user2, user3], records, 0, False)
>>> group_assignments = assign_records({group1: [user1, user2], group2: [user3]}, records, 1, False)

"""
if isinstance(users, dict):
return assign_records_to_groups(users, records, overlap, shuffle)
elif isinstance(users, list):
return assign_records_to_individuals(users, records, overlap, shuffle)


def assign_workspaces(
assignments: Union[Dict[str, List[Any]], Dict[str, Dict[str, Any]]], workspace_type: str
) -> Dict[str, List[Any]]:
"""
Assign workspaces (and create them if needed) to either groups or individuals.

Args:
assignments: Either a dictionary of groups or a dictionary of users.
workspace_type: Either 'group' (each group in a workspace), 'group_personal' (each member in a workspace) or 'individual' (each person in a workspace).

Returns:
A dictionary where each key is a workspace name and its value is a list of user names.

Examples:
>>> from argilla.client.feedback.utils import assign_workspaces
>>> wk_assignments = assign_workspaces(group_assignments, "group")
>>> wk_assignments = assign_workspaces(group_assignments, "group_personal")
>>> wk_assignments = assign_workspaces(individual_assignments, "individual")

"""
wk_assignments = {}

for group, users in assignments.items():
if workspace_type == "group":
workspace_name = group
user_ids = [check_user(user).id for user in users.keys()]

elif workspace_type == "group_personal":
for user in users.keys():
workspace_name = user
user_ids = [check_user(user).id]
workspace = check_workspace(workspace_name)

for user_id in user_ids:
try:
workspace.add_user(user_id)
except:
pass

wk_assignments[workspace_name] = [User.from_id(user).username for user in workspace.users]

continue

elif workspace_type == "individual":
workspace_name = group
user_ids = [check_user(group).id]

workspace = check_workspace(workspace_name)

for user_id in user_ids:
try:
workspace.add_user(user_id)
except:
pass

wk_assignments[workspace_name] = [User.from_id(user).username for user in workspace.users]

return wk_assignments
Loading
Loading