feat: Allow exporting data for SFT, Reward Modelling (related to RLHF), DPO, rename TrainingTaskMapping #3467

tomaarsen · 2023-07-27T11:31:52Z

Resolves #3379, resolves #3377
Closes #3522

Hello!

Pull Request overview

Prepare data for SFT, RM, DPO in TRL.
Rename TrainingTaskMapping to TrainingTask and task_mapping to task.

Description

Prepare data

from argilla.feedback import TrainingTask

def formatting_func(sample: Dict[str, Any]):
    ...
    yield template.format(
        prompt=sample["prompt"],
        response=sample["response"],
    )

task = TrainingTask.for_supervised_fine_tuning(formatting_func=formatting_func)
ds = fds_dataset.prepare_for_training(framework="trl", task=task)
# -> ds has "text" and "id" columns

Compatible with SFTTrainer.

task = TrainingTask.for_reward_modelling(chosen_rejected_func=chosen_rejected_func)
ds = fds_dataset.prepare_for_training(framework="trl", task=task)
# -> ds has "chosen" and "rejected" columns

Nearly compatible with RewardTrainer.

task = TrainingTask.for_direct_preference_optimization(prompt_chosen_rejected_func=prompt_chosen_rejected_func)
ds = fds_dataset.prepare_for_training(framework="trl", task=task)
# -> ds has "prompt", "chosen" and "rejected" columns

Compatible with DPOTrainer.

Details

I implement this by calling dataset.format_as("datasets") and then passing each sample (a simple dictionary) from this dataset to the function that the user provides. This user provided function can return None, one sample, a list of samples, or yield samples. This allows users to export multiple training samples from a single Argilla record, e.g. when there's multiple annotators that provided useful corrections, or if the annotated record justifies 3 "chosen", "rejected" pairs because there's a ranking between 3 texts.

Rename

TrainingTaskMapping is now TrainingTask - the "mapping" part is just unintuitive to the user. Same for task_mapping to task. Note: If people used task_mapping=... before, that will now fail. I can make this deprecation softer, but then I have to make task optional, which I would rather not do.

TODO:

Add TRL to ArgillaTrainer, allowing:

task = TrainingTask.for_supervised_fine_tuning(
    formatting_func=formatting_func
) # or any other task from this PR
trainer = ArgillaTrainer(
    dataset=fds_dataset,
    task=task,
    framework="trl",
)
trainer.train()

Consider renaming FeedbackDataset.prepare_for_training to FeedbackDataset.export.
New tests
Add documentation

Type of change

New feature

How Has This Been Tested

Not finished yet.

Checklist

I added relevant documentation
follows the style guidelines of this project
I did a self-review of my code
I made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I filled out the contributor form (see text above)
I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

Tom Aarsen

No deprecation strategy in place here, people's code might fail if they use task_mapping=...

…integration/trl_sft

codecov · 2023-07-27T11:53:15Z

Codecov Report

Patch coverage has no change and project coverage change: +0.47% 🎉

Comparison is base (6630d7b) 90.13% compared to head (d88e93b) 90.61%.
Report is 197 commits behind head on develop.

❗ Current head d88e93b differs from pull request most recent head 50b68f2. Consider uploading reports for the commit 50b68f2 to get more accurate results

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #3467      +/-   ##
===========================================
+ Coverage    90.13%   90.61%   +0.47%     
===========================================
  Files          233      262      +29     
  Lines        12493    14137    +1644     
===========================================
+ Hits         11261    12810    +1549     
- Misses        1232     1327      +95

Flag	Coverage Δ
pytest	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

see 186 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…integration/trl_sft

This is easier to parse for users writing formatting functions etc.

src/argilla/client/feedback/integrations/huggingface/dataset.py

tomaarsen · 2023-08-03T15:14:01Z

The CI is failing with this message. I've seen this before I think, is it expected?

Perhaps you've seen it before @gabrielmbmb?

…argilla into integration/trl_sft

…integration/trl_sft

tomaarsen · 2023-08-25T09:10:21Z

~~TODO: Changelog~~

Edit: Done

But keep it in prepare_for_training

Co-authored-by: Alvaro Bartolome <[email protected]>

In favour of only having them in the type hints

davidberenstein1957 · 2023-08-27T11:05:58Z

@tomaarsen I restructured the docs. Could you add some final docs to docs/trl-docs-revisited?

# Description I updated the docs for the `ArgillaTrainer` and the new tasks. Closes #<issue_number> **Type of change** (Remember to title the PR according to the type of change) - [X] Documentation update **How Has This Been Tested** (Please describe the tests that you ran to verify your changes.) - [X] `sphinx-autobuild` (read [Developer Documentation](https://docs.argilla.io/en/latest/community/developer_docs.html#building-the-documentation) for more details) **Checklist** - [X] I added relevant documentation - [X] I followed the style guidelines of this project - [X] I did a self-review of my code - [X] I made corresponding changes to the documentation - [X] My changes generate no new warnings - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [X] I have added relevant notes to the `CHANGELOG.md` file (See https://keepachangelog.com/) --------- Co-authored-by: Tom Aarsen <[email protected]> Co-authored-by: Alvaro Bartolome <[email protected]>

…ing_func by changing tuple type check tests: fix test_prepare_for_training_sft by proper pydantic typing

@tomaarsen

…asetMixin` (argilla-io#3539) # Description This PR addresses the feature mentioned by @tomaarsen at argilla-io#3467, to basically export the `responses` for the existing questions in the `FeedbackDataset` when calling `push_to_huggingface` in a row-based format instead of using the `Sequence` from 🤗`datasets`. This makes the dataset from the HuggingFace Hub more readable, and also easier to use with other frameworks and/or libraries. ```diff - {"user_id": ["A", "B"], "value": [1, 2], "status": ["C", "D"]} + [{"user_id": "A", "value": 1, "status": "C"}, {"user_id": "B", "value": 2, "status": "D"}] ``` Additionally, this PR also ensure that the backwards compatibility is preserved with the previous versions, and assumes the new format as the default one when calling `format_as("datasets")`. Finally, this PR also solves an issue reported by @nataliaElv recently that was affecting the `suggestions` when calling `FeedbackDataset.from_argilla`, as those were just kept when there were `responses`, otherwise, a `continue` statement was being called so the `suggestions` were completely ignored. **Type of change** - [X] Bug fix (non-breaking change which fixes an issue) - [X] New feature (non-breaking change which adds functionality) **How Has This Been Tested** (Please describe the tests that you ran to verify your changes. And ideally, reference `tests`) - [X] Ran the following script and tested different combinations as the connection to HuggingFace is mocked, but we should create a fake/testing user at some point to avoid overloading `argilla-io` ```python import argilla as rg dataset = rg.FeedbackDataset( fields=[ rg.TextField( name="prompt", required=True, ), ], questions=[ rg.TextQuestion( name="response-edit", title="Add or edit the response if necessary", required=True, ), ], ) dataset.add_records( rg.FeedbackRecord( fields={ "prompt": "This is the prompt!", }, suggestions=[ { "question_name": "response-edit", "value": "This is the suggestion!" } ], ) ) dataset.push_to_huggingface("<REPO_ID>") dataset = rg.FeedbackDataset.from_huggingface("<REPO_ID>") assert dataset.records[0].suggestions is not None ``` **Checklist** - [ ] I added relevant documentation - [X] follows the style guidelines of this project - [X] I did a self-review of my code - [ ] I made corresponding changes to the documentation - [X] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [X] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: Gabriel Martin <[email protected]>

@tomaarsen

* chore: bump version to `1.14.0` * chore: add `argilla-office-hours` Calendly link back * fix: `push_to_argilla` to return `RemoteFeedbackDataset` without re-assigning class (argilla-io#3508) # Description This PR adds a `warnings.warn` message to let the users know that the returned object from `FeedbackDataset.push_to_argilla` needs to be handled, otherwise the dataset instance will remain local, as `push_to_argilla` with no arguments won't do anything. So on, before we had to `push_to_argilla` a new `FeedbackDataset` with name and/or workspaces, and then we could `push_to_argilla` with no arguments to push the updates, but we no longer need to do that, as now we can re-use the returned `FeedbackDataset` from `push_to_argilla` to have a `FeedbackDataset` fully integrated with Argilla. ```diff import argilla as rg dataset = rg.FeedbackDataset(...) - dataset.push_to_argilla(name="my-dataset", workspace="my-workspace") + remote_dataset = dataset.push_to_argilla(name="my-dataset", workspace="my-workspace") dataset.add_records(...) - dataset.push_to_argilla() ``` **Type of change** - [X] Bug fix (non-breaking change which fixes an issue) - [X] Breaking change (fix or feature that would cause existing functionality to not work as expected) **How Has This Been Tested** - [X] Catch returned object in `push_to_argilla` and handle `DeprecationWarning` **Checklist** - [ ] I added relevant documentation - [X] follows the style guidelines of this project - [X] I did a self-review of my code - [ ] I made corresponding changes to the documentation - [ ] My changes generate no new warnings - [X] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [ ] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]> * feat: add `delete` method to `FeedbackDataset` in Argilla (argilla-io#3512) # Description This PR adds a `delete` method for `RemoteFeedbackDataset` which is a `FeedbackDataset` that has been pushed to Argilla. So on, the `delete` method deletes a dataset in Argilla, but it's just available for `owner` users and for `admin` users with that dataset within their workspace, otherwise the method won't work and will raise a `PermissionError`. So on, now both `owner` and `admin` users can delete a `FeedbackDataset` from Argilla as: ```python import argilla as rg rg.init(...) dataset = FeedbackDataset.from_argilla(...) dataset.delete() ``` Or alternatively ```python import argilla as rg rg.init(...) dataset = FeedbackDataset(...) remote_dataset = dataset.push_to_argilla(...) remote_dataset.delete() ``` Closes argilla-io#3413 **Type of change** - [X] New feature (non-breaking change which adds functionality) **How Has This Been Tested** - [X] Add integration tests for `RemoteFeedbackDataset.delete` including every role **Checklist** - [ ] I added relevant documentation - [X] follows the style guidelines of this project - [X] I did a self-review of my code - [ ] I made corresponding changes to the documentation - [X] My changes generate no new warnings - [X] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [X] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) * fix: `publish_dataset` to check `required=True` on at least one field/question (argilla-io#3511) # Description This PR fixes a bug with the `PUT /api/v1/datasets/{dataset_id}/publish` endpoint, as there was a missing check on the `required` flag for each field/question which was allowing to publish `FeedbackTask` datasets with no required fields and/or questions. **Type of change** - [X] Bug fix (non-breaking change which fixes an issue) **How Has This Been Tested** - [X] Update existing unit/integration tests - [X] Add unit/integration tests with `required=True` and `required=False` **Checklist** - [ ] I added relevant documentation - [X] follows the style guidelines of this project - [X] I did a self-review of my code - [ ] I made corresponding changes to the documentation - [ ] My changes generate no new warnings - [X] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [ ] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: Gabriel Martin <[email protected]> * docs: fix link to ASGI middleware tutorial (argilla-io#3518) # Description Fix broken link to ASGI middleware tutorial. **Type of change** - [x] Documentation update **How Has This Been Tested** N/A **Checklist** - [ ] I added relevant documentation - [x] follows the style guidelines of this project - [x] I did a self-review of my code - [ ] I made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [ ] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) * fix: `thread_ts` is not always present (argilla-io#3538) # Description The `slack-post-credentials` action is failing from time to time because not all the messages have the `thread_ts` key. This PR fixes this issue using the `ts` key instead of the `thread_ts`. **Type of change** - [x] Bug fix (non-breaking change which fixes an issue) **How Has This Been Tested** (Please describe the tests that you ran to verify your changes. And ideally, reference `tests`) The deployment job got executed fine for this PR: https://github.com/argilla-io/argilla/actions/runs/5818199084/job/15774977297?pr=3538 **Checklist** - [x] I followed the style guidelines of this project - [x] I did a self-review of my code - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [ ] I have added relevant notes to the `CHANGELOG.md` file (See https://keepachangelog.com/) * fix: restore `suggestions` and `responses` as rows in `HuggingFaceDatasetMixin` (argilla-io#3539) # Description This PR addresses the feature mentioned by @tomaarsen at argilla-io#3467, to basically export the `responses` for the existing questions in the `FeedbackDataset` when calling `push_to_huggingface` in a row-based format instead of using the `Sequence` from 🤗`datasets`. This makes the dataset from the HuggingFace Hub more readable, and also easier to use with other frameworks and/or libraries. ```diff - {"user_id": ["A", "B"], "value": [1, 2], "status": ["C", "D"]} + [{"user_id": "A", "value": 1, "status": "C"}, {"user_id": "B", "value": 2, "status": "D"}] ``` Additionally, this PR also ensure that the backwards compatibility is preserved with the previous versions, and assumes the new format as the default one when calling `format_as("datasets")`. Finally, this PR also solves an issue reported by @nataliaElv recently that was affecting the `suggestions` when calling `FeedbackDataset.from_argilla`, as those were just kept when there were `responses`, otherwise, a `continue` statement was being called so the `suggestions` were completely ignored. **Type of change** - [X] Bug fix (non-breaking change which fixes an issue) - [X] New feature (non-breaking change which adds functionality) **How Has This Been Tested** (Please describe the tests that you ran to verify your changes. And ideally, reference `tests`) - [X] Ran the following script and tested different combinations as the connection to HuggingFace is mocked, but we should create a fake/testing user at some point to avoid overloading `argilla-io` ```python import argilla as rg dataset = rg.FeedbackDataset( fields=[ rg.TextField( name="prompt", required=True, ), ], questions=[ rg.TextQuestion( name="response-edit", title="Add or edit the response if necessary", required=True, ), ], ) dataset.add_records( rg.FeedbackRecord( fields={ "prompt": "This is the prompt!", }, suggestions=[ { "question_name": "response-edit", "value": "This is the suggestion!" } ], ) ) dataset.push_to_huggingface("<REPO_ID>") dataset = rg.FeedbackDataset.from_huggingface("<REPO_ID>") assert dataset.records[0].suggestions is not None ``` **Checklist** - [ ] I added relevant documentation - [X] follows the style guidelines of this project - [X] I did a self-review of my code - [ ] I made corresponding changes to the documentation - [X] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [X] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: Gabriel Martin <[email protected]> * docs: Update dataset page (argilla-io#3535)  # Description New page to showcase different workflows to update and make changes to an existing Argilla dataset. Closes argilla-io#3534 **Type of change** (Remember to title the PR according to the type of change) - [x] Documentation update **How Has This Been Tested** (Please describe the tests that you ran to verify your changes.) - [x] `sphinx-autobuild` (read [Developer Documentation](https://docs.argilla.io/en/latest/community/developer_docs.html#building-the-documentation) for more details) **Checklist** - [ ] I added relevant documentation - [x] I followed the style guidelines of this project - [x] I did a self-review of my code - [x] I made corresponding changes to the documentation - [ ] My changes generate no new warnings - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [x] I have added relevant notes to the `CHANGELOG.md` file (See https://keepachangelog.com/) --------- Co-authored-by: Alvaro Bartolome <[email protected]> * docs: align docs with `FeedbackDataset` refactor using `tab-set` (argilla-io#3531) This PR adds the `tab-set` for the outdated code-blocks with the upcoming release of Argilla v1.14.0 so as to showcase both the previous and the new code-blocks for users that are still using Argilla v1.14.0 or lower. Additionally, the style has been fixed in some documents while reviewing the outdated code-blocks, some file paths references have been fixed to point to the HTML files via relative paths, and some minor details. Finally, a `warning` has also been included to let the users know that Argilla v1.14.0 won't work for the moment with the `ArgillaCallbackHandler` in `LangChain`. **Type of change** - [X] Documentation update --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]> Co-authored-by: Natalia Elvira <[email protected]> * fix: Argilla not working behind proxy (argilla-io#3543) # Description This PR updates the URL in which the Argilla App is mounted to be `"/"`, as it's not required to change the URL of the server because the proxy will know what rewrite has to be done. In addition, the entrypoint scripts of both images have been updated to add the `--root-path $ARGILLA_BASE_URL` option to the `uvicorn` command if `ARGILLA_BASE_URL` env variable has been set. More info: [FastAPI - Behind a proxy](https://fastapi.tiangolo.com/advanced/behind-a-proxy/#behind-a-proxy). Closes argilla-io#3542 **Type of change** - [x] Bug fix (non-breaking change which fixes an issue) **How Has This Been Tested** - [x] `localhost/argilla` load in a web browser and can connect using `rg.init` with an NGINX local setup: <details> <summary>Local Nginx</summary> `docker-compose.yaml`: ```yaml version: '3.8' services: argilla: image: argilla/argilla-quickstart:pr-3543 environment: ARGILLA_BASE_URL: /argilla LOAD_DATASETS: none ports: - 6900:6900 nginx: image: nginx:latest ports: - 80:80 volumes: - ./nginx.conf:/etc/nginx/nginx.conf ``` `nginx.conf`: ``` events {} http { server { listen 80; server_name your_server_name_or_ip; location /argilla/ { proxy_pass http://argilla:6900/; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; } } } ``` </details> **Checklist** - [x] I followed the style guidelines of this project - [x] I did a self-review of my code - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [x] I have added relevant notes to the `CHANGELOG.md` file (See https://keepachangelog.com/) * fix: execute build docker only if build python (argilla-io#3547) # Description This PR updates the `package.yaml` workflow to **only** execute the `build_server_docker_image` job if the `build_python_package` has been executed and it's execution has succeeded (as the first job needs the python package artifact generated by the second one) Closes argilla-io#3544 **Type of change** - [x] Bug fix (non-breaking change which fixes an issue) **How Has This Been Tested** N/A **Checklist** - [x] I followed the style guidelines of this project - [x] I did a self-review of my code - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [ ] I have added relevant notes to the `CHANGELOG.md` file (See https://keepachangelog.com/) * chore: bump frontend version to `1.14.0` * fix: `{FieldSchema, QuestionSchema}.name` attribute didn't have regex validation (argilla-io#3550) # Description This PR adds the regex validation to the `{FieldSchema, QuestionSchema}.name` attribute. This regex validation is already present in the server side, adding it to the client will allow to raise a `ValidationError` before calling the server. Closes argilla-io#3548 **Type of change** - [x] Bug fix (non-breaking change which fixes an issue) **How Has This Been Tested** Added new unit tests covering the described issue above. **Checklist** - [x] I followed the style guidelines of this project - [x] I did a self-review of my code - [x] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [x] I have added relevant notes to the `CHANGELOG.md` file (See https://keepachangelog.com/) --------- Co-authored-by: Alvaro Bartolome <[email protected]> --------- Co-authored-by: gabrielmbmb <[email protected]> Co-authored-by: Alvaro Bartolome <[email protected]> Co-authored-by: Gabriel Martin <[email protected]> Co-authored-by: Natalia Elvira <[email protected]> Co-authored-by: Natalia Elvira <[email protected]>

@tomaarsen

* chore: bump version to `1.14.0` * chore: add `argilla-office-hours` Calendly link back * fix: `push_to_argilla` to return `RemoteFeedbackDataset` without re-assigning class (argilla-io#3508) # Description This PR adds a `warnings.warn` message to let the users know that the returned object from `FeedbackDataset.push_to_argilla` needs to be handled, otherwise the dataset instance will remain local, as `push_to_argilla` with no arguments won't do anything. So on, before we had to `push_to_argilla` a new `FeedbackDataset` with name and/or workspaces, and then we could `push_to_argilla` with no arguments to push the updates, but we no longer need to do that, as now we can re-use the returned `FeedbackDataset` from `push_to_argilla` to have a `FeedbackDataset` fully integrated with Argilla. ```diff import argilla as rg dataset = rg.FeedbackDataset(...) - dataset.push_to_argilla(name="my-dataset", workspace="my-workspace") + remote_dataset = dataset.push_to_argilla(name="my-dataset", workspace="my-workspace") dataset.add_records(...) - dataset.push_to_argilla() ``` **Type of change** - [X] Bug fix (non-breaking change which fixes an issue) - [X] Breaking change (fix or feature that would cause existing functionality to not work as expected) **How Has This Been Tested** - [X] Catch returned object in `push_to_argilla` and handle `DeprecationWarning` **Checklist** - [ ] I added relevant documentation - [X] follows the style guidelines of this project - [X] I did a self-review of my code - [ ] I made corresponding changes to the documentation - [ ] My changes generate no new warnings - [X] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [ ] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- * feat: add `delete` method to `FeedbackDataset` in Argilla (argilla-io#3512) # Description This PR adds a `delete` method for `RemoteFeedbackDataset` which is a `FeedbackDataset` that has been pushed to Argilla. So on, the `delete` method deletes a dataset in Argilla, but it's just available for `owner` users and for `admin` users with that dataset within their workspace, otherwise the method won't work and will raise a `PermissionError`. So on, now both `owner` and `admin` users can delete a `FeedbackDataset` from Argilla as: ```python import argilla as rg rg.init(...) dataset = FeedbackDataset.from_argilla(...) dataset.delete() ``` Or alternatively ```python import argilla as rg rg.init(...) dataset = FeedbackDataset(...) remote_dataset = dataset.push_to_argilla(...) remote_dataset.delete() ``` Closes argilla-io#3413 **Type of change** - [X] New feature (non-breaking change which adds functionality) **How Has This Been Tested** - [X] Add integration tests for `RemoteFeedbackDataset.delete` including every role **Checklist** - [ ] I added relevant documentation - [X] follows the style guidelines of this project - [X] I did a self-review of my code - [ ] I made corresponding changes to the documentation - [X] My changes generate no new warnings - [X] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [X] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) * fix: `publish_dataset` to check `required=True` on at least one field/question (argilla-io#3511) # Description This PR fixes a bug with the `PUT /api/v1/datasets/{dataset_id}/publish` endpoint, as there was a missing check on the `required` flag for each field/question which was allowing to publish `FeedbackTask` datasets with no required fields and/or questions. **Type of change** - [X] Bug fix (non-breaking change which fixes an issue) **How Has This Been Tested** - [X] Update existing unit/integration tests - [X] Add unit/integration tests with `required=True` and `required=False` **Checklist** - [ ] I added relevant documentation - [X] follows the style guidelines of this project - [X] I did a self-review of my code - [ ] I made corresponding changes to the documentation - [ ] My changes generate no new warnings - [X] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [ ] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- * docs: fix link to ASGI middleware tutorial (argilla-io#3518) # Description Fix broken link to ASGI middleware tutorial. **Type of change** - [x] Documentation update **How Has This Been Tested** N/A **Checklist** - [ ] I added relevant documentation - [x] follows the style guidelines of this project - [x] I did a self-review of my code - [ ] I made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [ ] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) * fix: `thread_ts` is not always present (argilla-io#3538) # Description The `slack-post-credentials` action is failing from time to time because not all the messages have the `thread_ts` key. This PR fixes this issue using the `ts` key instead of the `thread_ts`. **Type of change** - [x] Bug fix (non-breaking change which fixes an issue) **How Has This Been Tested** (Please describe the tests that you ran to verify your changes. And ideally, reference `tests`) The deployment job got executed fine for this PR: https://github.com/argilla-io/argilla/actions/runs/5818199084/job/15774977297?pr=3538 **Checklist** - [x] I followed the style guidelines of this project - [x] I did a self-review of my code - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [ ] I have added relevant notes to the `CHANGELOG.md` file (See https://keepachangelog.com/) * fix: restore `suggestions` and `responses` as rows in `HuggingFaceDatasetMixin` (argilla-io#3539) # Description This PR addresses the feature mentioned by @tomaarsen at argilla-io#3467, to basically export the `responses` for the existing questions in the `FeedbackDataset` when calling `push_to_huggingface` in a row-based format instead of using the `Sequence` from 🤗`datasets`. This makes the dataset from the HuggingFace Hub more readable, and also easier to use with other frameworks and/or libraries. ```diff - {"user_id": ["A", "B"], "value": [1, 2], "status": ["C", "D"]} + [{"user_id": "A", "value": 1, "status": "C"}, {"user_id": "B", "value": 2, "status": "D"}] ``` Additionally, this PR also ensure that the backwards compatibility is preserved with the previous versions, and assumes the new format as the default one when calling `format_as("datasets")`. Finally, this PR also solves an issue reported by @nataliaElv recently that was affecting the `suggestions` when calling `FeedbackDataset.from_argilla`, as those were just kept when there were `responses`, otherwise, a `continue` statement was being called so the `suggestions` were completely ignored. **Type of change** - [X] Bug fix (non-breaking change which fixes an issue) - [X] New feature (non-breaking change which adds functionality) **How Has This Been Tested** (Please describe the tests that you ran to verify your changes. And ideally, reference `tests`) - [X] Ran the following script and tested different combinations as the connection to HuggingFace is mocked, but we should create a fake/testing user at some point to avoid overloading `argilla-io` ```python import argilla as rg dataset = rg.FeedbackDataset( fields=[ rg.TextField( name="prompt", required=True, ), ], questions=[ rg.TextQuestion( name="response-edit", title="Add or edit the response if necessary", required=True, ), ], ) dataset.add_records( rg.FeedbackRecord( fields={ "prompt": "This is the prompt!", }, suggestions=[ { "question_name": "response-edit", "value": "This is the suggestion!" } ], ) ) dataset.push_to_huggingface("<REPO_ID>") dataset = rg.FeedbackDataset.from_huggingface("<REPO_ID>") assert dataset.records[0].suggestions is not None ``` **Checklist** - [ ] I added relevant documentation - [X] follows the style guidelines of this project - [X] I did a self-review of my code - [ ] I made corresponding changes to the documentation - [X] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [X] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- * docs: Update dataset page (argilla-io#3535)  # Description New page to showcase different workflows to update and make changes to an existing Argilla dataset. Closes argilla-io#3534 **Type of change** (Remember to title the PR according to the type of change) - [x] Documentation update **How Has This Been Tested** (Please describe the tests that you ran to verify your changes.) - [x] `sphinx-autobuild` (read [Developer Documentation](https://docs.argilla.io/en/latest/community/developer_docs.html#building-the-documentation) for more details) **Checklist** - [ ] I added relevant documentation - [x] I followed the style guidelines of this project - [x] I did a self-review of my code - [x] I made corresponding changes to the documentation - [ ] My changes generate no new warnings - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [x] I have added relevant notes to the `CHANGELOG.md` file (See https://keepachangelog.com/) --------- * docs: align docs with `FeedbackDataset` refactor using `tab-set` (argilla-io#3531) This PR adds the `tab-set` for the outdated code-blocks with the upcoming release of Argilla v1.14.0 so as to showcase both the previous and the new code-blocks for users that are still using Argilla v1.14.0 or lower. Additionally, the style has been fixed in some documents while reviewing the outdated code-blocks, some file paths references have been fixed to point to the HTML files via relative paths, and some minor details. Finally, a `warning` has also been included to let the users know that Argilla v1.14.0 won't work for the moment with the `ArgillaCallbackHandler` in `LangChain`. **Type of change** - [X] Documentation update --------- * fix: Argilla not working behind proxy (argilla-io#3543) # Description This PR updates the URL in which the Argilla App is mounted to be `"/"`, as it's not required to change the URL of the server because the proxy will know what rewrite has to be done. In addition, the entrypoint scripts of both images have been updated to add the `--root-path $ARGILLA_BASE_URL` option to the `uvicorn` command if `ARGILLA_BASE_URL` env variable has been set. More info: [FastAPI - Behind a proxy](https://fastapi.tiangolo.com/advanced/behind-a-proxy/#behind-a-proxy). Closes argilla-io#3542 **Type of change** - [x] Bug fix (non-breaking change which fixes an issue) **How Has This Been Tested** - [x] `localhost/argilla` load in a web browser and can connect using `rg.init` with an NGINX local setup: <details> <summary>Local Nginx</summary> `docker-compose.yaml`: ```yaml version: '3.8' services: argilla: image: argilla/argilla-quickstart:pr-3543 environment: ARGILLA_BASE_URL: /argilla LOAD_DATASETS: none ports: - 6900:6900 nginx: image: nginx:latest ports: - 80:80 volumes: - ./nginx.conf:/etc/nginx/nginx.conf ``` `nginx.conf`: ``` events {} http { server { listen 80; server_name your_server_name_or_ip; location /argilla/ { proxy_pass http://argilla:6900/; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; } } } ``` </details> **Checklist** - [x] I followed the style guidelines of this project - [x] I did a self-review of my code - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [x] I have added relevant notes to the `CHANGELOG.md` file (See https://keepachangelog.com/) * fix: execute build docker only if build python (argilla-io#3547) # Description This PR updates the `package.yaml` workflow to **only** execute the `build_server_docker_image` job if the `build_python_package` has been executed and it's execution has succeeded (as the first job needs the python package artifact generated by the second one) Closes argilla-io#3544 **Type of change** - [x] Bug fix (non-breaking change which fixes an issue) **How Has This Been Tested** N/A **Checklist** - [x] I followed the style guidelines of this project - [x] I did a self-review of my code - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [ ] I have added relevant notes to the `CHANGELOG.md` file (See https://keepachangelog.com/) * chore: bump frontend version to `1.14.0` * fix: `{FieldSchema, QuestionSchema}.name` attribute didn't have regex validation (argilla-io#3550) # Description This PR adds the regex validation to the `{FieldSchema, QuestionSchema}.name` attribute. This regex validation is already present in the server side, adding it to the client will allow to raise a `ValidationError` before calling the server. Closes argilla-io#3548 **Type of change** - [x] Bug fix (non-breaking change which fixes an issue) **How Has This Been Tested** Added new unit tests covering the described issue above. **Checklist** - [x] I followed the style guidelines of this project - [x] I did a self-review of my code - [x] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [x] I have added relevant notes to the `CHANGELOG.md` file (See https://keepachangelog.com/) --------- --------- Co-authored-by: gabrielmbmb <[email protected]> Co-authored-by: Alvaro Bartolome <[email protected]> Co-authored-by: Gabriel Martin <[email protected]> Co-authored-by: Natalia Elvira <[email protected]> Co-authored-by: Natalia Elvira <[email protected]>

tomaarsen added 13 commits July 17, 2023 08:34

Prevent error if no annotated data exists

85a4096

Set 'id' as integers instead of string 'None'

38313a4

Fix broken docstrings

82b1f83

Add RankingStrategy and Unification to __init__

b088adf

Allow preparing data for Supervised Finetuning w. TRL

0ed02ba

Use formatting_func instead for preparing SFT

20a922b

Rename 'TrainingTaskMapping...' to 'TrainingTask...'

0dc961c

Rename 'TrainingTaskMapping' to 'TrainingTask' in docs

e7277c2

Rename task_mapping parameter to task

661d6c5

No deprecation strategy in place here, people's code might fail if they use task_mapping=...

Update incorrect task type hint

23b2919

Merge branch 'develop' of https://github.com/argilla-io/argilla into …

1851f9a

…integration/trl_sft

Prepare data for reward modelling

f25aedc

Support preparing data for DPO

beed960

tomaarsen added 12 commits July 28, 2023 12:55

Set up initial ArgillaTRLTrainer skeleton

01b1e63

Implement SFTTrainer, RewardTrainer, DPOTrainer

ecb77df

Allow updating configs, add __repr__

607d9eb

Prevent predict from being used

8463184

Merge branch 'develop' of https://github.com/argilla-io/argilla into …

df501c1

…integration/trl_sft

When there's no annotations, use a dictionary rather than empty

ca0836e

This is easier to parse for users writing formatting functions etc.

Add initial tests for preparing data for the 3 TRL tasks

97b850a

Prevent crash when using RM/DPO with the old Mapping

9db4811

Add train_size to all TRL tests

d4da17c

Add deprecation tests

9d76552

Add extra cases: returning None, one sample, multiple samples

212a15f

Add ArgillaTrainer tests

732f838

tomaarsen commented Aug 3, 2023

View reviewed changes

src/argilla/client/feedback/integrations/huggingface/dataset.py Outdated Show resolved Hide resolved

tomaarsen added 2 commits August 3, 2023 17:32

Add trl to dev deps

90bade2

merge 'develop' into integration/trl_sft

6c95420

tomaarsen added 3 commits August 25, 2023 11:02

Add Callable type hint

e677447

Merge branch 'feat/integration_trl' of https://github.com/argilla-io/…

9920156

…argilla into integration/trl_sft

Merge branch 'develop' of https://github.com/argilla-io/argilla into …

fcfe57b

…integration/trl_sft

tomaarsen and others added 6 commits August 25, 2023 11:12

Add integration deps to pyproject.toml

f357c53

Remove fetch_records from the ArgillaTrainer

2b576df

But keep it in prepare_for_training

Reintroduce spacy tests

c612fc4

Also require transformers and torch

4398e6c

Co-authored-by: Alvaro Bartolome <[email protected]>

Add Optional where needed

adc8a10

Co-authored-by: Alvaro Bartolome <[email protected]>

Add missing None on init type hint

d0160da

Co-authored-by: Alvaro Bartolome <[email protected]>

This was referenced Aug 25, 2023

[FEATURE] ArgillaTrainer - allow passing initialized model & tokenizer #3631

Closed

[FEATURE] ArgillaTrainer - allow passing device parameter #3632

Closed

tomaarsen and others added 4 commits August 25, 2023 12:27

Use gpt2-medium as the default instead of gpt2

7ce2152

Co-authored-by: Alvaro Bartolome <[email protected]>

Add type hints for _format_data

56f3ad5

Remove type hints from docstrings

1eff208

In favour of only having them in the type hints

Remove unnecessary mark.usefixtures

7e2c28e

This was referenced Aug 25, 2023

[FEATURE] ArgillaTrainer - add push_to_hub method #3633

Closed

[FEATURE] ArgillaTrainer - Automatic model card generation on save #3634

Closed

tomaarsen added 2 commits August 25, 2023 16:50

Rename modelling to modeling

5ac300f

Add changelog entry

d88e93b

tomaarsen mentioned this pull request Aug 25, 2023

[FEATURE] ArgillaTrainer - Allow for generative predictions #3635

Closed

davidberenstein1957 and others added 3 commits August 28, 2023 15:04

tests: added tests to test for wrong output formatting_func

bff2d1d

tests: fix test_prepare_for_training_text_classification_with_formatt…

50b68f2

…ing_func by changing tuple type check tests: fix test_prepare_for_training_sft by proper pydantic typing

davidberenstein1957 merged commit 76d9a4b into argilla-io:develop Aug 28, 2023
4 of 15 checks passed

artikandri mentioned this pull request Oct 30, 2023

Feat: Updated main branch with additional features (#18) CLARIN-PL/argilla#20

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Allow exporting data for SFT, Reward Modelling (related to RLHF), DPO, rename TrainingTaskMapping #3467

feat: Allow exporting data for SFT, Reward Modelling (related to RLHF), DPO, rename TrainingTaskMapping #3467

tomaarsen commented Jul 27, 2023 •

edited by davidberenstein1957

Loading

codecov bot commented Jul 27, 2023 •

edited

Loading

tomaarsen commented Aug 3, 2023 •

edited

Loading

tomaarsen commented Aug 25, 2023 •

edited

Loading

davidberenstein1957 commented Aug 27, 2023

feat: Allow exporting data for SFT, Reward Modelling (related to RLHF), DPO, rename TrainingTaskMapping #3467

feat: Allow exporting data for SFT, Reward Modelling (related to RLHF), DPO, rename TrainingTaskMapping #3467

Conversation

tomaarsen commented Jul 27, 2023 • edited by davidberenstein1957 Loading

Pull Request overview

Description

Prepare data

Details

Rename

TODO:

codecov bot commented Jul 27, 2023 • edited Loading

Codecov Report

tomaarsen commented Aug 3, 2023 • edited Loading

tomaarsen commented Aug 25, 2023 • edited Loading

davidberenstein1957 commented Aug 27, 2023

tomaarsen commented Jul 27, 2023 •

edited by davidberenstein1957

Loading

codecov bot commented Jul 27, 2023 •

edited

Loading

tomaarsen commented Aug 3, 2023 •

edited

Loading

tomaarsen commented Aug 25, 2023 •

edited

Loading