Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adding utils module and functions #4121

Merged

Conversation

sdiazlor
Copy link
Contributor

@sdiazlor sdiazlor commented Nov 2, 2023

Description

Adding a new utils module :

  • submodule: HTML (media_to_html and create_tokens_highlights)
  • submodule: records (assign records)

Closes #4030
Closes #4003
Closes #3803
Closes #3928
Closes #4031

Type of change

(Please delete options that are not relevant. Remember to title the PR according to the type of change)

  • New feature (non-breaking change which adds functionality)
  • Refactor (change restructuring the codebase without changing functionality)
  • Improvement (change adding some improvement to an existing functionality)

How Has This Been Tested

(Please describe the tests that you ran to verify your changes. And ideally, reference tests)

  • Test A
  • Test B

Checklist

  • I added relevant documentation
  • I followed the style guidelines of this project
  • I did a self-review of my code
  • I made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I filled out the contributor form (see text above)
  • I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

@sdiazlor sdiazlor self-assigned this Nov 2, 2023
@sdiazlor sdiazlor linked an issue Nov 2, 2023 that may be closed by this pull request
@sdiazlor sdiazlor marked this pull request as draft November 2, 2023 16:56
Copy link

github-actions bot commented Nov 2, 2023

The URL of the deployed environment for this PR is https://argilla-quickstart-pr-4121-ki24f765kq-no.a.run.app

Copy link
Member

@davidberenstein1957 davidberenstein1957 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/argilla/client/feedback/utils/html.py Outdated Show resolved Hide resolved


def create_token_highlights(
tokens: List[str], weights: List[float], c_map: Optional[Union[str, Callable]] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tokens: List[str], weights: List[float], c_map: Optional[Union[str, Callable]] = None
tokens: List[str], weights: List[float], c_map: Optional[Union[str, Callable]] = "viridis"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdiazlor we would also need to make some changes later in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidberenstein1957 yeah, of course, let me know to talk more about the approach

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdiazlor we just need to remove the if None check we use lateron

}


def media_to_html(media_type: str, path: str, file_type: Optional[str] = None) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, could we also potentially allow for passing a byte string directly instead of passing a path? Perhaps we can check to allow for a path or non-b64encoded string.

}


def media_to_html(media_type: str, media_source: Union[str, bytes], file_type: Optional[str] = None) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is bytes defined?

]

# Get the color map if set to None or not indicated
if c_map is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this part

@davidberenstein1957
Copy link
Member

Hi @sdiazlor, I think we can just use a small or a subset of the examples to showcase the point without requiring users to download a lot of content from the HF datasets for the docs/_source/tutorials/notebooks/making-most-of-markdown.ipynb. Also, could you include some other reference to multi-modal support to the documentation and potentially some task_templates for for_image_classification/for_audio_classification/multi_model_classification or multi_modal_to_text (not really sure about the naming that covers the use case most explicitly)?

Perhaps we can add a reference to this section to the tutorial about making most of markdown? https://docs.argilla.io/en/latest/practical_guides/create_dataset.html#define-fields

@sdiazlor sdiazlor linked an issue Nov 14, 2023 that may be closed by this pull request
@davidberenstein1957
Copy link
Member

@sdiazlor, this looks great :)

@davidberenstein1957
Copy link
Member

@sdiazlor , can you also add the same md reference in the markdown section for both questions and fields?

@sdiazlor
Copy link
Contributor Author

@davidberenstein1957 I added in assignment.py the whole assign_records logic (including the part for the docs), let me know your comments

# Step 3: push dataset/s


def assign_workspaces(assignments, workspace_type):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thionk we should add some docstrings and type hinting here

for group, users in assignments.items():
if workspace_type == "group":
workspace_name = group
user_ids = [User.from_name(username).id for username in users.keys()]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also check if the user exists

assignments = {user.username: [] for user in users}
num_users = len(users)

for idx, record in enumerate(records):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can use a rich progress bar here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns:
A dictionary where keys are usernames and values are lists of assigned records.
"""
if overlap < 0 or overlap >= len(users):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say overlap is a percentage between 0 and 1 right? Do you think it might be easy to add this? @nataliaElv what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you may want to have 2 or 3 annotators giving responses to one record, so this could be greater than 1? But let me know if I'm misunderstanding what the function is doing here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right @nataliaElv, I thought wrongly about this.l

num_groups = len(group_names)

group_records = defaultdict(list)
for idx, record in enumerate(records):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add a rich progress bar here?

Raises:
ValueError: If `overlap` is higher than the number of groups or negative.
"""
if overlap < 0 or overlap >= len(groups.keys()):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as below about opverlap in percentage and # per users

@sdiazlor sdiazlor marked this pull request as ready for review November 24, 2023 11:13
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. area: python sdk Indicates that an issue or pull request is related to the Python SDK language: python Pull requests or issues that update Python code team: backend Indicates that the issue or pull request is owned by the backend team type: enhancement Indicates new feature requests labels Nov 24, 2023
Copy link

codecov bot commented Nov 24, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (6630d7b) 90.13% compared to head (bfbb5b0) 64.93%.
Report is 489 commits behind head on develop.

❗ Current head bfbb5b0 differs from pull request most recent head 33158df. Consider uploading reports for the commit 33158df to get more accurate results

Additional details and impacted files
@@             Coverage Diff              @@
##           develop    #4121       +/-   ##
============================================
- Coverage    90.13%   64.93%   -25.21%     
============================================
  Files          233      324       +91     
  Lines        12493    18708     +6215     
============================================
+ Hits         11261    12148      +887     
- Misses        1232     6560     +5328     
Flag Coverage Δ
pytest ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@davidberenstein1957 davidberenstein1957 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sdiazlor, this looks great and is almost ready to be merged. Can you also update the docs w.r.t.

https://docs.argilla.io/en/latest/practical_guides/assign_records.html
https://docs.argilla.io/en/latest/getting_started/installation/configurations/image_support.html

Additionally, token highlights have not been referenced anywhere and might be mentioned somewhere in the docs or tutorial. Where do you think it should fit?

Also the changelog can be slightly extended :)

@sdiazlor
Copy link
Contributor Author

Hi @sdiazlor, this looks great and is almost ready to be merged. Can you also update the docs w.r.t.

https://docs.argilla.io/en/latest/practical_guides/assign_records.html https://docs.argilla.io/en/latest/getting_started/installation/configurations/image_support.html

Additionally, token highlights have not been referenced anywhere and might be mentioned somewhere in the docs or tutorial. Where do you think it should fit?

Also the changelog can be slightly extended :)

Regarding the token highlights, I'll mention them in the tutorial, I think that maybe add them when referencing to markdown in create a dataset is too much info as we also mention media support

```{note}
If you haven't done so already, decide on the settings of the project (the `fields`, `questions` and `guidelines`) as detailed in the [Create a Feedback Dataset guide](/practical_guides/create_dataset.md#feedback-dataset) and set those as variables.
from argilla.client.feedback.utils import assign_records
assignments = assign_records(users, records, 1, True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can use the explicit variable names users=users, records=records etc.

```
:::
from argilla.client.feedback.utils import assign_workspaces
wk_assignments = assign_workspaces(assignments, "individual")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assignments like it is use below

fields = [...]
questions = [...]
guidelines = "..."
# assign_workspace(assignments, workspace_type): to check the workspaces

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you can also use explicit variable name

assign_workspace(
    assignments=assignments, 
    workspace_type="group"
)

you can add additional context about the workspace types outside of the code block

@sdiazlor
Copy link
Contributor Author

@davidberenstein1957 I updated the last changes

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 27, 2023
@davidberenstein1957 davidberenstein1957 merged commit 3c78e5a into develop Nov 27, 2023
2 checks passed
@davidberenstein1957 davidberenstein1957 deleted the feat/4030-feature-add-a-utils-module-to-the-package branch November 27, 2023 16:38
davidberenstein1957 added a commit that referenced this pull request Nov 29, 2023
<!-- Thanks for your contribution! As part of our Community Growers
initiative 🌱, we're donating Justdiggit bunds in your name to reforest
sub-Saharan Africa. To claim your Community Growers certificate, please
contact David Berenstein in our Slack community or fill in this form
https://tally.so/r/n9XrxK once your PR has been merged. -->

# Description
Adding a new utils module :
* submodule: HTML (media_to_html and create_tokens_highlights)
* submodule: records (assign records)

Closes #4030 
Closes #4003 
Closes #3803 
Closes #3928
Closes #4031

**Type of change**

(Please delete options that are not relevant. Remember to title the PR
according to the type of change)

- [x] New feature (non-breaking change which adds functionality)
- [ ] Refactor (change restructuring the codebase without changing
functionality)
- [ ] Improvement (change adding some improvement to an existing
functionality)

**How Has This Been Tested**

(Please describe the tests that you ran to verify your changes. And
ideally, reference `tests`)

- [ ] Test A
- [ ] Test B

**Checklist**

- [ ] I added relevant documentation
- [ ] I followed the style guidelines of this project
- [x] I did a self-review of my code
- [ ] I made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK)
(see text above)
- [ ] I have added relevant notes to the `CHANGELOG.md` file (See
https://keepachangelog.com/)

---------

Co-authored-by: David Berenstein <[email protected]>
leiyre pushed a commit that referenced this pull request Nov 29, 2023
* develop: (30 commits)
  chore: increase dev version release to 1.21.0
  fix: responses and suggestions filter QA (#4337)
  feat: delete suggestion from record on search engine (#4336)
  feat: update suggestion from record on search engine (#4339)
  bug: fix bug and update test (#4341)
  fix: preserve `TextClassificationSettings.label_schema` order (#4332)
  Update issue templates
  feat: 🚀 support for filtering and sorting by responses and suggestions (#4160)
  fix: handling errors for non-existing endpoints (#4325)
  feat: adding utils module and functions (#4121)
  Update labels in github workflows (#4315)
  fix: correct unification implementation for `RankingQuestionStrategy` (#4295)
  fix: update to solve the error of integration tests in CI (#4314)
  docs: revisit install process (#4261)
  feat: increase timeout minutes for python tests (#4307)
  docs: docs export dataset does not apply coloring for code snippets (#4296)
  docs: update final section of the rag haystack blog post (#4294)
  feat: add multi_modal templates and update vector setting (#4283)
  feat: better logging bar for FeedbackDataset (#4267)
  refactor: ArgillaTrainer for unified variable usage (#4214)
  ...

# Conflicts:
#	frontend/v1/infrastructure/repositories/RecordRepository.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: python sdk Indicates that an issue or pull request is related to the Python SDK language: python Pull requests or issues that update Python code lgtm This PR has been approved by a maintainer size:XXL This PR changes 1000+ lines, ignoring generated files. team: backend Indicates that the issue or pull request is owned by the backend team type: enhancement Indicates new feature requests
Projects
None yet
3 participants