Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curate UltraFeedack dataset's overall_score #7

Open
dvsrepo opened this issue Nov 27, 2023 · 3 comments
Open

Curate UltraFeedack dataset's overall_score #7

dvsrepo opened this issue Nov 27, 2023 · 3 comments
Assignees
Labels

Comments

@dvsrepo
Copy link
Member

dvsrepo commented Nov 27, 2023

Based on our curation efforts, we spotted a bug in the overall_score of UltraFeedback AI Critique score. TLDR: Responses getting the lowest score (1 or less) become a high score (10 or 8.0 or 7.5 who knows!). Our initial work with notus shows that by using something different to the overall score, we can train a better model.

In this task, we want to really clean up the original dataset to make sure others build on an error free dataset. I have myself curated a few hundreds (sorting by chosen score = 10) and most of the responses getting a 10 are totally useless according to the rationale (natural language) explanation.

The objective is as follows:

  1. Using this dataset take the col best_overall_score_response, get the critique text and run it through a very simple sentiment analysis (I suggest starting with TextBlob's because it's really fast and the rationales are very expressive when the response is really bad).
  2. Add this sentiment score to the dataset on a new column, best_overall_score_response_critique_sentiment.
  3. Based on this new dataset, let's try to find out those examples that get a high overall_score but a bad sentiment.
  4. Iterate as much as we can to really narrow down those problematic cases. I'd strongly suggest to use Argilla UI with sort and filters to quickly adjust.
  5. Once we know the problematic cases, we have several choices, the best I can think of is reduce their overall_score (dividing by 10 :-) ) in the completions object.
  6. Now we have a clean dataset, we can use to experiment further (compare rating vs critique, etc.) and most important share it with the community so people build on a clean version!

More details about the initial analysis on the dataset readme.

Please keep us posted as you start and iterate!

@plaguss
Copy link

plaguss commented Nov 29, 2023

Hi! @dvsrepo I have updated the dataset with the sentiment. I finally decided to use the distilbert-base-uncased-finetuned-sst-2-english model, it's more accurate and easy to detect the bad sentiment. The result from the pipeline is transformed: the sentiment is in the range [0, 1], for the negative labels I modified the score by just putting 1 - negative score.

Also, I updated the new dataset to ultrafeedback-curator-v2

Some initial thoughts

  • Filtering by metadata and going to 27450 we see a question without an answer. In this case the sentiment is not clear, the ones with sentiment around 0.5 deserve a closer look.
  • Filtering score >= 10 and sentiment < 0.3: around 1.8K, most of these should be updated
  • Filtering score [8, 10) and sentiment < 0.001: The answers are not perfect, but they seem mostly ok

I'll try to find some more rules and write them here before applying them and uploading the dataset again

@dvsrepo
Copy link
Member Author

dvsrepo commented Nov 29, 2023

Cool @plaguss, how are you doing the calculation of the sentiment score for negative 1-score negative? In that case I think the model is not very reliable, see: https://argilla-ultrafeedback-curator.hf.space/dataset/bcb66f4a-50ba-4707-a2ec-95c0bc6c0780/annotation-mode?_page=1&_status=pending&_metadata=sentiment%3A%7B%22ge%22%3A0.00019,%22le%22%3A0.99019%7D%2Bbest_overall_score%3A%7B%22ge%22%3A10,%22le%22%3A10%7D&_sort=metadata.sentiment%3Adesc which has a high score and the rationale is not highly positive.

I think is a good idea using a DL model for the sentiment, you could also try to add the label of the highest score as metadata terms.

@plaguss
Copy link

plaguss commented Nov 29, 2023

Oh there seems to be an error with that one (and many of them...), I didn't noticed that sorry.

This is how the sentiment score is computed:

from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512,}
sentiment_pipeline = pipeline("sentiment-analysis", **tokenizer_kwargs)

critique = "Your response didn't follow the instruction. You were supposed to act as a customer service chatbot for Wells Fargo, ask for customer information in JSON format, and then start the chat session with a customer. However, you provided no output. Try to follow the instruction more closely next time, ensuring you ask for the necessary information and address the customer appropriately. Remember, if you're unsure of an answer, it's okay to admit it. Honesty is crucial in customer service."
sentiment = sentiment_pipeline(critique)[0]
# sentiment
# {'label': 'POSITIVE', 'score': 0.9900312423706055}
sentiment_score = sentiment["score"] if sentiment["label"] == "POSITIVE" else 1 - sentiment["score"]

For the other case:

critique = """Your response didn't follow the instruction at all. The task was to generate a question from a given scientific fact. Instead, you asked if you could assist with other queries. This doesn't provide any value to the user and doesn't meet the requirements of the task. To improve, try to understand the fact and think about what question could be answered by that fact. For example, for the fact "Fertilization process results in the formation of a new cell with a full set of chromosomes.", a possible question could be "What is the result of the fertilization process?"."""
sentiment = sentiment_pipeline(critique)[0]
sentiment_score = sentiment["score"] if sentiment["label"] == "POSITIVE" else 1 - sentiment["score"]
# sentiment_score
# 0.00021225214004516602

I checked some of them and it made sense. These are the correlations with this sentiment:
score: 5, records in the subset: 62670, correlation: 0.3528
score: 6, records in the subset: 61895, correlation: 0.3343
score: 7, records in the subset: 58044, correlation: 0.2457
score: 8, records in the subset: 42703, correlation: -0.0516
score: 9, records in the subset: 9964, correlation: -0.604

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants