Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] add migration notebook to docs format #5002

Merged
merged 9 commits into from
Jun 20, 2024

Conversation

burtenshaw
Copy link
Contributor

@burtenshaw burtenshaw commented Jun 12, 2024

This PR transfer the notebook guide by @frascuchon and @nataliaElv over to the documentation format, which makes it easier to follow.

You can find the page here: http://localhost:8000/argilla-python/guides/how_to_guides/migrate_from_legacy_datasets/

n.b. the page is not available via the menu. @davidberenstein1957 we should figure out navigation presentation for this page.

The guide looks like this:

image

@burtenshaw burtenshaw changed the title docs: add migration notebook to docs format [DOCS] add migration notebook to docs format Jun 12, 2024
@burtenshaw burtenshaw marked this pull request as ready for review June 12, 2024 11:47
@davidberenstein1957
Copy link
Member

Hi @burtenshaw, looks good. Adding it to guides makes sense to me too.

I have not taken the time to look at the migration guides yet but I think we might be able to make the migration a lot more user-friendly by providing code that maps everything to the Datasets using the 2.0 SDK automatically.

TextClassification -> TextField ("name=text") | MultiLabel|LabelQuestion ("name=label") + metadata configs
TokenClassification -> TextField ("name=text") | SpanQuestion ("name=span") + metadata configs
Text2Text -> TextField ("name=text") | TextQuestion ("name=text") + metadata configs

It takes us some more time to figure out the logic and mappings but will save a lot of community members a lot of time and effort too. @frascuchon @burtenshaw WDYT?

@nataliaElv
Copy link
Member

The document says that Feedback datasets do not need to be migrated, but if you have a server with legacy+feedback datasets I assume that one would like to move all of them over to the new server, right? @burtenshaw

We may also want to discuss what to do with the migration of predictions to suggestions in multi-label text classification datasets. If I'm not mistaken, with the current code all labels would be selected as suggestions (as in legacy datasets you can have a prediction for each label). Maybe we can add a threshold that the user can modify so only labels where score > threshold are added to the suggestions?

@davidberenstein1957
Copy link
Member

The document says that Feedback datasets do not need to be migrated, but if you have a server with legacy+feedback datasets I assume that one would like to move all of them over to the new server, right? @burtenshaw

@nataliaElv I don't think the FeedbackDataset need to be moved. The underlying server logic is the same but the way of interacting with the server through the new SDK as API is different.

@davidberenstein1957
Copy link
Member

We may also want to discuss what to do with the migration of predictions to suggestions in multi-label text classification datasets. If I'm not mistaken, with the current code all labels would be selected as suggestions (as in legacy datasets you can have a prediction for each label). Maybe we can add a threshold that the user can modify so only labels where score > threshold are added to the suggestions?

@nataliaElv Good catch. I think we need to provide a warning and on let the user choose to add the score processing mechanism, which might be as simple as choosing the highest or lowest one.

@burtenshaw
Copy link
Contributor Author

TextClassification -> TextField ("name=text") | MultiLabel|LabelQuestion ("name=label") + metadata configs TokenClassification -> TextField ("name=text") | SpanQuestion ("name=span") + metadata configs Text2Text -> TextField ("name=text") | TextQuestion ("name=text") + metadata configs

This is a nice suggestion but I think it depends on use, and the metrics are declining on legacy datasets.

Personally, I would implement this based on community engagement with this guide.

@burtenshaw
Copy link
Contributor Author

The document says that Feedback datasets do not need to be migrated, but if you have a server with legacy+feedback datasets I assume that one would like to move all of them over to the new server, right? @burtenshaw

@nataliaElv I don't think the FeedbackDataset need to be moved. The underlying server logic is the same but the way of interacting with the server through the new SDK as API is different.

@nataliaElv Is on to something here. The guide proposes that the v2 server is 'their new server'. So from a user persepctive, they would want their datasets migrated. I will add the a snippet that shows them how to client.datasets on all other datasets.

@burtenshaw
Copy link
Contributor Author

We may also want to discuss what to do with the migration of predictions to suggestions in multi-label text classification datasets. If I'm not mistaken, with the current code all labels would be selected as suggestions (as in legacy datasets you can have a prediction for each label). Maybe we can add a threshold that the user can modify so only labels where score > threshold are added to the suggestions?

@nataliaElv Good catch. I think we need to provide a warning and on let the user choose to add the score processing mechanism, which might be as simple as choosing the highest or lowest one.

I would add this to the tab with multilabel text classification as a note box.

@frascuchon frascuchon force-pushed the docs/migration-guide-in-docs-format branch from 4452759 to fd979dd Compare June 13, 2024 10:45
@frascuchon frascuchon changed the base branch from docs/migration-notebook to feat/v2.0.0 June 13, 2024 10:45
Base automatically changed from feat/v2.0.0 to develop June 19, 2024 13:46
Copy link

codecov bot commented Jun 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.93%. Comparing base (8e9f42b) to head (9c0768e).
Report is 39 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff              @@
##           develop    #5002       +/-   ##
============================================
+ Coverage    61.13%   91.93%   +30.80%     
============================================
  Files          328      135      -193     
  Lines        17670     5818    -11852     
============================================
- Hits         10802     5349     -5453     
+ Misses        6868      469     -6399     
Flag Coverage Δ
argilla ?
argilla-server 91.93% <ø> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 261 to 286
## Migrating Feedback Datasets on your Argilla 1.* server

As mentioned above, `FeedbackDataset`'s are compatible with Argilla V2 and do not need to be reformatted. However, you may want to migrate your feedback datasets to the new server so that you can deprecate your Argilla 1.* server. Here is a guide on how to migrate your feedback datasets:

```python
import argilla as rg

# Initialize the API with an Argilla server less than 2.0
old_client = rg.Argilla(old_server_api_url, old_server_api_key)
new_client = rg.Argilla(new_server_api_url, new_server_api_key)

dataset_name = "feedback-dataset"
old_dataset = old_client.datasets(name=dataset_name)
new_dataset = new_client.datasets.add(old_dataset)

# Load the records from the old server
new_dataset.records.log(
old_dataset.records(
with_responses=True, # (1)
with_suggestions=True,
with_vectors=True,
)
)
```

1. The `with_responses`, `with_suggestions`, `with_vectors`, and `with_metadata` flags are used to load the records with the responses, suggestions, vectors, and metadata respectively.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scenario has several problems.

  1. We cannot share resources between clients, or even create new resources from other resources. Those actions will modify the internal state of the original resource, so old_dataset settings will change when the new_dataset is created
  2. Copy responses from one server to another requires a revision of response users. User IDs won't exist in the target server so we should provide a user mapping in the same way we do for legacy datasets.

Not sure how common would be this scenario. But I propose to skip this for now and think a bit more about how the SDK should work for this. cc @burtenshaw

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, suggestions are linked to questions by a question ID. When we do:

new_dataset.records.log(
    list(old_dataset.records(with_responses=False, with_suggestions=True, with_vectors=True)),
)

an error will be raised since the error contains wrong question info for the provided dataset.

@frascuchon frascuchon merged commit 39779c1 into develop Jun 20, 2024
5 of 7 checks passed
@frascuchon frascuchon deleted the docs/migration-guide-in-docs-format branch June 20, 2024 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants