Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add end2end example on creating a basic text-classification dataset #4208

Merged
merged 66 commits into from
Nov 29, 2023

Conversation

plaguss
Copy link
Contributor

@plaguss plaguss commented Nov 13, 2023

Description

This PR includes 2 features towards the #4178 issue.

  • A tutorial for the creation of a FeedbackDataset for text-classification.
  • A new script has been added to run the notebooks automatically, via end2end.yml workflow.

Closes #4179 and #4220

Type of change

(Remember to title the PR according to the type of change)

  • Documentation update

How Has This Been Tested

(Please describe the tests that you ran to verify your changes.)

Checklist

  • I added relevant documentation
  • I followed the style guidelines of this project
  • I did a self-review of my code
  • I made corresponding changes to the documentation
  • My changes generate no new warnings
  • I filled out the contributor form (see text above)
  • I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

@plaguss
Copy link
Contributor Author

plaguss commented Nov 15, 2023

Example running locally (with elasticsearch and argilla quickstart images):

argilla on  docs/end2end-text-classification [!?] via 🐍 v3.10.13 (.venv) on ☁️  (us-east-1) python scripts/end2end_examples.py --api-key admin.apikey
Executing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46/46 [00:19<00:00,  2.33cell/s]
✅  text-classification-create-dataset
Removed output notebook: output-notebook
Removed output folder: output_notebooks

And an example forcing the process to fail with an error in a cell:

python scripts/end2end_examples.py
/home/agustin/github_repos/argilla-io/argilla/.venv/lib/python3.10/site-packages/nbformat/__init__.py:93: MissingIDFieldWarning: Code cell is missing an id field, this will become a hard error in future nbformat versions. You may want to use `normalize()` on your notebooks before validations (available since nbformat 5.1.4). Previous versions of nbformat are fixing this issue transparently, and will stop doing so in the future.
  validate(nb)
Executing:  75%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                        | 3/4 [00:01<00:00,  1.86cell/s]
❌  test_notebook
Traceback (most recent call last):
  File "/home/agustin/github_repos/argilla-io/argilla/scripts/end2end_examples.py", line 65, in <module>
    main()
  File "/home/agustin/github_repos/argilla-io/argilla/scripts/end2end_examples.py", line 56, in main
    example.run()
  File "/home/agustin/github_repos/argilla-io/argilla/scripts/end2end_examples.py", line 39, in run
    raise e from None
  File "/home/agustin/github_repos/argilla-io/argilla/scripts/end2end_examples.py", line 35, in run
    papermill.execute_notebook(str(self.src_filename), str(self.dst_filename), parameters=self.parameters)
  File "/home/agustin/github_repos/argilla-io/argilla/.venv/lib/python3.10/site-packages/papermill/execute.py", line 134, in execute_notebook
    raise_for_execution_errors(nb, output_path)
  File "/home/agustin/github_repos/argilla-io/argilla/.venv/lib/python3.10/site-packages/papermill/execute.py", line 241, in raise_for_execution_errors
    raise error
papermill.exceptions.PapermillExecutionError: 
---------------------------------------------------------------------------
Exception encountered at "In [3]":
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[3], line 1
----> 1 assert 1==2

AssertionError: 

scripts/end2end_examples.py Outdated Show resolved Hide resolved
scripts/end2end_examples.py Outdated Show resolved Hide resolved
run: |
echo "ARGILLA_SEARCH_ENGINE=opensearch" >> "$GITHUB_ENV"
echo "Configure opensearch engine"
- name: Run end2end examples 📈

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of running this every time, do you think we can filter ut a but and only run it when there are changes to src or examples.py? Also, perhaps we can use a subset of the datasets and/or setup a persistent cache for the datasets and set this equal to our "Cache pip 👜" step?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be relevant in other places we download 'datasets' for our cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update that

"hf_token": hf_token,
}

examples = [

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps it is better to do this with glob and select everything in our folder as examples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure whether we could add other files here, but I think that's better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidberenstein1957 I think we would like to have these run in a specific order. For that should we name them using something that can be sorted? (maybe just a number at the end works fine)

@plaguss
Copy link
Contributor Author

plaguss commented Nov 16, 2023

@gabrielmbmb I requested your review to know your opinion on the workflow, but feel free to skip the remaining content of the PR

@plaguss plaguss force-pushed the docs/end2end-text-classification branch from 7be9cd5 to a6fc67a Compare November 16, 2023 10:15
@dvsrepo
Copy link
Member

dvsrepo commented Nov 16, 2023

@plaguss @davidberenstein1957 if the plan is to run this as part of CI/CD or periodically or for testing/QA purposes please make sure we DON'T track any telemetry as this will affect our understanding of real usage/errors, etc.

@plaguss
Copy link
Contributor Author

plaguss commented Nov 16, 2023

@plaguss @davidberenstein1957 if the plan is to run this as part of CI/CD or periodically or for testing/QA purposes please make sure we DON'T track any telemetry as this will affect our understanding of real usage/errors, etc.

Sure @dvsrepo, that should be taken into account in the workflow:

        env:
          ARGILLA_ENABLE_TELEMETRY: 0
        run: |
          pip install -e .
          pip install papermill
          python scripts/end2end_examples.py

@dvsrepo
Copy link
Member

dvsrepo commented Nov 16, 2023

@plaguss @davidberenstein1957 if the plan is to run this as part of CI/CD or periodically or for testing/QA purposes please make sure we DON'T track any telemetry as this will affect our understanding of real usage/errors, etc.

Sure @dvsrepo, that should be taken into account in the workflow:

        env:
          ARGILLA_ENABLE_TELEMETRY: 0
        run: |
          pip install -e .
          pip install papermill
          python scripts/end2end_examples.py

perfect @plaguss !

@plaguss plaguss marked this pull request as ready for review November 16, 2023 17:01
Copy link

The URL of the deployed environment for this PR is https://argilla-quickstart-pr-4208-ki24f765kq-no.a.run.app

sdiazlor and others added 6 commits November 29, 2023 09:11
…ion dataset (#4342)

<!-- Thanks for your contribution! As part of our Community Growers
initiative 🌱, we're donating Justdiggit bunds in your name to reforest
sub-Saharan Africa. To claim your Community Growers certificate, please
contact David Berenstein in our Slack community or fill in this form
https://tally.so/r/n9XrxK once your PR has been merged. -->

# Description

Please include a summary of the changes and the related issue. Please
also include relevant motivation and context. List any dependencies that
are required for this change.

Closes #4184 

**Type of change**

(Remember to title the PR according to the type of change)

- [ ] Documentation update

**How Has This Been Tested**

(Please describe the tests that you ran to verify your changes.)

- [x ] `sphinx-autobuild` (read [Developer
Documentation](https://docs.argilla.io/en/latest/community/developer_docs.html#building-the-documentation)
for more details)

**Checklist**

- [ ] I added relevant documentation
- [ ] I followed the style guidelines of this project
- [ ] I did a self-review of my code
- [ ] I made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK)
(see text above)
- [ ] I have added relevant notes to the `CHANGELOG.md` file (See
https://keepachangelog.com/)
<!-- Thanks for your contribution! As part of our Community Growers
initiative 🌱, we're donating Justdiggit bunds in your name to reforest
sub-Saharan Africa. To claim your Community Growers certificate, please
contact David Berenstein in our Slack community or fill in this form
https://tally.so/r/n9XrxK once your PR has been merged. -->

# Description

Please include a summary of the changes and the related issue. Please
also include relevant motivation and context. List any dependencies that
are required for this change.

Closes #4187 

**Type of change**

(Remember to title the PR according to the type of change)

- [ ] Documentation update

**How Has This Been Tested**

(Please describe the tests that you ran to verify your changes.)

- [ ] `sphinx-autobuild` (read [Developer
Documentation](https://docs.argilla.io/en/latest/community/developer_docs.html#building-the-documentation)
for more details)

**Checklist**

- [ ] I added relevant documentation
- [ ] I followed the style guidelines of this project
- [ ] I did a self-review of my code
- [ ] I made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK)
(see text above)
- [ ] I have added relevant notes to the `CHANGELOG.md` file (See
https://keepachangelog.com/)
…tion dataset (#4350)

<!-- Thanks for your contribution! As part of our Community Growers
initiative 🌱, we're donating Justdiggit bunds in your name to reforest
sub-Saharan Africa. To claim your Community Growers certificate, please
contact David Berenstein in our Slack community or fill in this form
https://tally.so/r/n9XrxK once your PR has been merged. -->

# Description

Please include a summary of the changes and the related issue. Please
also include relevant motivation and context. List any dependencies that
are required for this change.

Closes #4185 

**Type of change**

(Remember to title the PR according to the type of change)

- [ x] Documentation update

**How Has This Been Tested**

(Please describe the tests that you ran to verify your changes.)

- [ x] `sphinx-autobuild` (read [Developer
Documentation](https://docs.argilla.io/en/latest/community/developer_docs.html#building-the-documentation)
for more details)

**Checklist**

- [ ] I added relevant documentation
- [ ] I followed the style guidelines of this project
- [ ] I did a self-review of my code
- [ ] I made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK)
(see text above)
- [ ] I have added relevant notes to the `CHANGELOG.md` file (See
https://keepachangelog.com/)
docs: changed to spacy training instead of trf
@davidberenstein1957 davidberenstein1957 merged commit cf7f67a into develop Nov 29, 2023
2 of 3 checks passed
@davidberenstein1957 davidberenstein1957 deleted the docs/end2end-text-classification branch November 29, 2023 16:16
leiyre pushed a commit that referenced this pull request Dec 5, 2023
* develop: (41 commits)
  chore: update dev version
  chore: update CHANGELOG.md before release v1.20.0 (#4357)
  docs: temporal update to indicate persistent storage (#4355)
  docs: add suggestions and responses filters and sorting (#4345)
  docs: add end2end example on creating a basic text-classification dataset (#4208)
  Fix/responses suggestions filter fine tune (#4356)
  Fix/responses suggestions filter fine tune (#4356)
  fix: Accept draft responses on dataset records creation (#4354)
  Feature/responses operator (#4352)
  Feature/responses operator (#4352)
  chore: increase dev version release to 1.21.0
  chore: remove dev suffix for release branch
  fix: responses and suggestions filter QA (#4337)
  feat: delete suggestion from record on search engine (#4336)
  feat: update suggestion from record on search engine (#4339)
  bug: fix bug and update test (#4341)
  fix: preserve `TextClassificationSettings.label_schema` order (#4332)
  Update issue templates
  feat: 🚀 support for filtering and sorting by responses and suggestions (#4160)
  fix: handling errors for non-existing endpoints (#4325)
  ...

# Conflicts:
#	frontend/v1/domain/entities/question/Question.ts
#	frontend/v1/domain/entities/record/Record.ts
leiyre pushed a commit that referenced this pull request Dec 12, 2023
* develop: (21 commits)
  ✨ Fix error handling in axios plugin for 401 (#4362)
  docs: Change `telemetry` section in tutorials to directly executable cells (#4399)
  docs: add faq files (#4363)
  fix: pinning `pytest-asyncio` to version `0.21.1` to avoid problems running unit tests on GitHub workflows (#4395)
  docs: add making most of markdown to tutorial page (#4376)
  Fixing typo in Fine Tuning LLMs Practical Guides (#4392)
  Token Classification epochs parameter trainer changed (#4393)
  docs: align practical guidescreate datasethtml with end2end examples structure (#4375)
  docs: hugging face space url (#4379)
  docs: extend using proxy section (#4368)
  chore: update dev version
  chore: update CHANGELOG.md before release v1.20.0 (#4357)
  docs: temporal update to indicate persistent storage (#4355)
  docs: add suggestions and responses filters and sorting (#4345)
  docs: add end2end example on creating a basic text-classification dataset (#4208)
  Fix/responses suggestions filter fine tune (#4356)
  Fix/responses suggestions filter fine tune (#4356)
  fix: Accept draft responses on dataset records creation (#4354)
  Feature/responses operator (#4352)
  Feature/responses operator (#4352)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DOCS] add end2end example on creating a basic text-classification dataset
6 participants