Releases: argilla-io/argilla
v2.4.1
This release includes some bugfixes:
- Fixed redirection problems after users sign-in using HF OAuth. (#5635)
- Fixed highlighting of the searched text in text, span and chat fields (#5678)
- Fixed validation for rating question when creating a dataset (#5670)
- Fixed question name based on question type when creating a dataset (#5670)
- Fixed error so now
_touch_dataset_last_activity_at
function is not updating dataset'supdated_at
column. (#5656)
Full Changelog: v2.4.0...v2.4.1
v2.4.0
🔆 Release highlights
Import Hub datasets from the UI
import_hub_dataset.mp4
In this release, we’ve focused all of our efforts in bringing you a new feature to import datasets from the Hugging Face Hub directly within our UI, making it easier and faster to get started with your AI projects.
To get started, click on the “Import dataset from Hugging Face” button and paste the repo id of the dataset you want to use. Argilla will process the columns of the dataset and map them to Fields or Questions. Then, you can add more questions or remove any unnecessary fields by selecting the “No mapping” options. All the changes you make will be automatically reflected in the preview.
Once you’re happy with the result you simply need to provide a name for your dataset, select a workspace and (if applicable) a split. Then, Argilla will start importing the dataset.
Note
If your dataset is bigger than 10k records, at this stage Argilla will only import the first 10k. You can import the rest of the dataset using the Argilla SDK: simply click on the “Import data” button in the dataset and use the code snippet provided.
If you want to make extra changes, like customizing the titles of your fields and questions, don’t worry, you can always go back to the Dataset Settings page after the dataset has been created.
Learn more about this new feature in our docs.
Deploy an Argilla Space directly from the SDK
If you're working from the SDK and don't want to leave to start your Argilla server, you can start an Argilla deployment on Spaces with a simple line of code:
import argilla as rg
client = rg.Argilla.deploy_on_spaces(api_key="12345678")
Learn more in our docs.
Changelog v2.4.0
- Enhancement/improve-error-messaging-for-role-forbidden by @burtenshaw in #5554
- refactor: add
DatasetPublishValidator
class by @jfcalvo in #5568 - feat: set CREATOR_USER_ID to avoid difficulties with creation in orga… by @davidberenstein1957 in #5556
- [Refactor] remove name validations for dataset workspaces and usernames by @frascuchon in #5575
- fix: SPACES_CREATOR_USER_ID -> SPACE_CREATOR_USER_ID by @davidberenstein1957 in #5590
- [FIX] Prevent duplicated field text by @leiyre in #5592
- feat: Add basic support to bool features by @frascuchon in #5576
- feat: Add support to other than str values for terms metadata properties by @frascuchon in #5594
- [BUGFIX] argilla server: parse fields for record schemas by @frascuchon in #5600
- correct phrase on docs: "a recod question" -> "a question" by @HeAndres in #5599
- docs: update filter_dataset.md by @eltociear in #5571
- feat: 5108 feature add method to deploy on spaces through huggingface hub by @davidberenstein1957 in #5547
- docs: add quickstart update for deploy on spaces by @davidberenstein1957 in #5550
- Typo: missing comma by @ACMCMC in #5565
- Typo fix by @ACMCMC in #5566
- Fix typo by @ACMCMC in #5567
- [REFACTOR] argilla server: moving all record validators by @frascuchon in #5603
- [BUGFIX] argilla server: Prevent convert
ChatFieldValue
objects by @frascuchon in #5605 - Introducing Argilla Guru on Gurubase.io by @kursataktas in #5608
- [PERF][IMPROVEMENT] argilla server: improve computation for dataset progress and metrics by @frascuchon in #5618
- [PERF] argilla server: Reduce general transaction time by @frascuchon in #5609
- fix: Prevent compute metrics for draft datasets by @frascuchon in #5624
- Refine German translations and update non-localized UI elements by @paulbauriegel in #5632
- [BUGFIX] Catch None in image feature columns by @burtenshaw in #5626
- feat: added support for
with_vectors
with query filter in sdk by @bharath97-git in #5638 - perf: Using search engine to compute the total number of records for user metrics by @frascuchon in #5641
- [IMPROVEMENT] feat(helm): add support for default storage class in PVCs by @dme86 in #5628
- Feature - Improve Accessibility for Screenreaders by @paulbauriegel in #5634
- [FEATURE-BRANCH] Argilla direct import from Hub by @jfcalvo in #5572
- fix: remove unnecesary exposed ports for Argilla Docker compose file by @jfcalvo in #5644
- Dataset creation feature final QA by @leiyre in #5646
- [CI] argilla frontend: Remove invalid workflow permissions by @frascuchon in #5647
- [CI] Configure workflow permissions by @frascuchon in #5648
- chore: update changelogs for release
2.4.0
by @jfcalvo in #5650 - chore: small improvement installing dependencies for HF Spaces Dockerfile by @jfcalvo in #5651
- fix: skip
helmlint
pre-commit hook on CI becausehelm
command is not available by @jfcalvo in #5654 - Import from hub docs by @nataliaElv in #5631
- [RELEASE] 2.4.0 by @frascuchon in #5643
New Contributors
- @HeAndres made their first contribution in #5599
- @ACMCMC made their first contribution in #5565
- @kursataktas made their first contribution in #5608
- @bharath97-git made their first contribution in #5638
- @dme86 made their first contribution in #5628
Full Changelog: v2.3.1...v2.4.0
v2.3.1
v2.3.0
🌟 Release highlights
Custom Fields: the most powerful way to build custom annotation tasks
We heard you. This new type of field gives you full control over how data is presented to annotators.
With custom fields, you can use your own CSS, HTML, and even Javascript (welcome interactive fields!). Moreover, you can populate your fields with custom structures like custom_field={"image1": ..., "image_2": ..., etc.}
.
Here's an example:
Imagine you want to show two images and a prompt to your users.
With a custom field
With the new custom field, you can configure something like this:
And you can set this up with a few lines of code:
css_template = """
<style>
#container {
display: flex;
flex-direction: column;
font-family: Arial, sans-serif;
}
.prompt {
margin-bottom: 10px;
font-size: 16px;
line-height: 1.4;
color: #333;
background-color: #f8f8f8;
padding: 10px;
border-radius: 5px;
box-shadow: 0 1px 3px rgba(0,0,0,0.1);
}
.image-container {
display: flex;
gap: 10px;
}
.column {
flex: 1;
position: relative;
}
img {
max-width: 100%;
height: auto;
display: block;
}
.image-label {
position: absolute;
top: 10px;
right: 10px;
background-color: rgba(255, 255, 255, 0.7);
color: black;
padding: 5px 10px;
border-radius: 5px;
font-weight: bold;
}
</style>
"""
html_template = """
<div id="container">
<div class="prompt"><strong>Prompt:</strong> {{record.fields.images.prompt}}</div>
<div class="image-container">
<div class="column">
<img src="{{record.fields.images.image_1}}" />
<div class="image-label">Image 1</div>
</div>
<div class="column">
<img src="{{record.fields.images.image_2}}" />
<div class="image-label">Image 2</div>
</div>
</div>
</div>
"""
custom_field = rg.CustomField(
name="images",
template=css_template + html_template,
)
# and the log records like this
rg.Record(
fields={
"prompt": prompt,
"image_1": schnell_uri,
"image_2": dev_uri,
}
)
Before the custom field
Before this release, you were forced to use two ImageField
and a TextField
, which would be displayed sequentially, limiting the ability to compare the images side-by-side, with clear labels, prompt text, etc. It would look like this:
How to get started with custom fields
Here we've shown a basic presentation-oriented custom field but you can set up anything you can think of, leveraging JS, html, and css. Imagination is the limit!
To get started check the docs: https://docs.argilla.io/v2.3/how_to_guides/custom_fields/
Other features
- Support for similarity search from the SDK and other search and filtering improvements.
- New Helm chart deployment configuration.
- Support credentials from colab secrets.
An other changes and fixes
Changed
- Changed the repr method for
SettingsProperties
to display the details of all the properties inSetting
object. (#5380) - Changed error messages when creating datasets with insufficient permissions. (#5540)
Fixed
- Fixed serialization of
ChatField
when collecting records from the hub and exporting todatasets
. (#5554) - Fixed error when creating default user with existing default workspace. (#5558)
- Fixed the deployment yaml used to create a new Argilla server in K8s. Added
USERNAME
andPASSWORD
to the environment variables of pod template. (#5434) - Fix autofill form on sign-in page #5522
- Support copy on clipboard for no secure context #5535
New Contributors
Thanks to
- @bikash119 for Helm chart in #5512
Full Changelog: v2.2.2...v2.3.0
v2.2.2
What's Changed
This is a patch release with certain fixes to the SDK
Fixed
- Fixed
from_hub
with unsupported column names. (#5524) - Fixed
from_hub
with missing datasetsubset
configuration value. (#5524)
Changed
- Changed
from_hub
to only generate fields not questions for strings in the dataset. (#5524)
Full Changelog: v2.2.1...v2.2.2
v2.2.1
What's Changed
This is a patch release with certain fixes to the SDK:
- Fixed
from_hub
errors when columns names contain uppercase letters. (#5523) - Fixed
from_hub
errors when class feature values contains unlabelled values. (#5523) - Fixed
from_hub
errors when loading cached datasets. (#5523)
Full Changelog: v2.2.0...v2.2.1
v2.2.0
🌟 Release highlights
Important
Argilla server 2.2.0
adds support for background jobs. These background jobs allow us to run jobs that might take a long time at request time. For this reason we now rely on Redis and Python RQ workers.
So to upgrade your Argilla instance to version 2.2.0
you need to have an available Redis server. See the Redis get-started documentation for more information or the Argilla server configuration documentation.
If you have deployed Argilla server using the docker-compose.yaml, you should download the docker-compose.yaml file again to bring the latest changes to set Redis and Argilla workers
Workers are needed to process Argilla's background jobs. You can run Argilla workers with the following command:
python -m argilla_server worker
ChatField: working with text conversations in Argilla
chat_field.mp4
You can now work with text conversations natively in Argilla using the new ChatField
. It is especially designed to make it easier to build datasets for conversational Large Language Models (LLMs), displaying conversational data in the form of a chat.
Here's how you can create a dataset with a ChatField
:
import argilla as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
settings = rg.Settings(
fields=[rg.ChatField(name="chat")],
questions=[...]
)
dataset = rg.Dataset(
name="chat_dataset",
settings=settings,
workspace="my_workspace",
client=client
)
dataset.create()
record = rg.Record(
fields={
"chat": [
{"role": "user", "content": "Hello World, how are you?"},
{"role": "assistant", "content": "I'm doing great, thank you!"}
]
}
)
dataset.records.log([record])
Read more about how to use this new field type here and here.
Adjust task distribution settings
You can now modify task distribution settings at any time, and Argilla will automatically recalculate the completed and pending records. When you update this setting, records will be removed from or added to the pending queues of your team accordingly.
You can make this change in the dataset settings page or using the SDK:
import argilla as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
dataset = client.datasets("my_dataset")
dataset.settings.distribution.min_submitted = 2
dataset.update()
Track team progress from the SDK
The Argilla SDK now provides a way to retrieve data on annotation progress. This feature allows you to monitor the number of completed and pending records in a dataset and also the number of responses made by each user:
import argilla as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
dataset = client.datasets("my_dataset")
progress = dataset.progress(with_users_distribution=True)
The expected output looks like this:
{
"total": 100,
"completed": 50,
"pending": 50,
"users": {
"user1": {
"completed": { "submitted": 10, "draft": 5, "discarded": 5},
"pending": { "submitted": 5, "draft": 10, "discarded": 10},
},
"user2": {
"completed": { "submitted": 20, "draft": 10, "discarded": 5},
"pending": { "submitted": 2, "draft": 25, "discarded": 0},
},
...
}
Read more about this feature here.
Automatic settings inference
When you import a dataset using the from_hub method, Argilla will automatically infer the settings, such as the fields and questions, based on the dataset Features. This will save you time and effort when working with datasets from the Hub.
import argilla as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
dataset = rg.Dataset.from_hub("yahma/alpaca-cleaned")
Task templates
We've added pre-built templates for common dataset types, including text classification, ranking, and rating tasks. These templates provide a starting point for your dataset creation, with pre-configured settings. You can use these templates to get started quickly, without having to configure everything from scratch.
import argilla as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
settings = rg.Settings.for_classification(labels=["positive", "negative"])
dataset = rg.Dataset(
name="my_dataset",
settings=settings,
client=client,
workspace="my_workspace",
)
dataset.create()
Read more about templates here.
Full Changelog: v2.1.0...v2.2.0
Release 2.1.0
🌟 Release highlights
Image Field
Argilla now supports multimodal datasets with the introduction of a native ImageField
. This new type of field allows you to work seamlessly with image data, making it easier to annotate and curate datasets that combine text and images.
Here's an example of a dataset with an image field:
import argilla as rg
client = rg.Argilla(...)
settings = rg.Settings(
fields = [
rg.ImageField(name="image"),
rg.TextField(name="caption")
],
questions = [
rg.LabelQuestion(
name="good_or_bad",
title="Is the caption good or bad",
labels=["good", "bad"]
),
rg.TextQuestion(name="comments")
]
)
dataset = rg.Dataset(name="image_captions", settings=settings)
dataset.create()
record = rg.Record(
fields= {
"image": "https://docs.argilla.io/dev/assets/logo.svg",
"caption": "This is the Argilla logo"
}
)
dataset.records.log([record])
Dark Mode
Argilla seems too bright for you? You can now try our new Dark Mode: a theme designed to reduce eye strain and give a new modern look to the app. You can enable Dark Mode under "My Settings".
Spanish Translation
We're committed to making Argilla accessible to a broader audience. With the addition of Spanish translation, we're taking another step towards breaking language barriers and enabling more teams to collaborate on data curation projects.
There's nothing you need to do to enable it: Argilla will automatically switch to Spanish when your browser's main language is set to Spanish. ¡Disfrutadla!
Import any dataset from the Hugging Face Hub
The from_hub
method just got a major boost! You can now input your own settings, allowing you to use this method with almost any dataset from the Hugging Face Hub, not just Argilla datasets.
Here's how easy it is to import a dataset from the Hub:
import argilla as rg
client = rg.Argilla(...)
settings = rg.Settings(
fields=[
rg.TextField(name="input"),
],
questions=[
rg.TextQuestion(name="output"),
],
)
dataset = rg.Dataset.from_hub(
repo_id="yahma/alpaca-cleaned",
settings=settings,
)
Other Notable Fixes and Improvements
- Adaptable text areas for
TextQuestion
's, providing a better user experience in the UI. - Enhanced messaging for empty queues, keeping you informed when no records are available in the UI.
Full Changelog: v2.0.1...v2.1.0
v2.0.1
What's Changed
🧹 Patch release of bug fixes and minor documentation and messaging improvements. Enjoy your summer while we change the world in v2.1.0
.
Fixed
- Fixed error when creating optional fields. (#5362)
- Fixed error creating integer and float metadata with
visible_for_annotators
. (#5364) - Fixed error when logging records with
suggestions
orresponses
for non-existent questions. (#5396 by @maxserras) - Fixed error from conflicts in testing suite when running tests in parallel. (#5349)
- Fixed error in response model when creating a response with a
None
value. (#5343)
Changed
- Changed
from_hub
method to raise an error when a dataset with the same name exists. (#5258) - Changed
log
method when ingesting records with no known keys to raise a descriptive error. (#5356) - Changed
code snippets
to add new datasets (#5395)
Added
- Added Google Analytics to the documentation site. (#5366)
- Added frontend skeletons to progress metrics to optimise load time and improve user experience. (#5391)
- Added documentation in methods in API references for the Python SDK. (#5400)
Full Changelog: v2.0.0...v2.0.1
v2.0.0
🔆 Release highlights
One Dataset
to rule them all
The main difference between Argilla 1.x and Argilla 2.x is that we've converted the previous dataset types tailored for specific NLP tasks into a single highly-configurable Dataset
class.
With the new Dataset
you can combine multiple fields and question types, so you can adapt the UI for your specific project. This offers you more flexibility, while making Argilla easier to learn and maintain.
Important
If you want to continue using your legacy datasets in Argilla 2.x, you will need to convert them into v2 Dataset
's as explained in this migration guide. This includes: DatasetForTextClassification
, DatasetForTokenClassification
, and DatasetForText2Text
.
FeedbackDataset
's do not need to be converted as they are already compatible with the Argilla v2 format.
New SDK & documentation
We've redesigned our SDK with the idea to adapt it to the new single Dataset
and Record
classes and, most importantly, improve the user and developer experience.
The main goal of the new design is to make the SDK easier to use and learn, making it simpler and faster to configure your dataset and get it up and running.
Here's an example of what creating a Dataset
looks like:
import argilla as rg
from datasets import load_dataset
# log to the Argilla client
client = rg.Argilla(
api_url="<api_url>",
api_key="<api_key>"
# headers={"Authorization": f"Bearer {HF_TOKEN}"}
)
# configure dataset settings
settings = rg.Settings(
guidelines="Classify the reviews as positive or negative.",
fields=[
rg.TextField(
name="review",
title="Text from the review",
use_markdown=False,
),
],
questions=[
rg.LabelQuestion(
name="my_label",
title="In which category does this article fit?",
labels=["positive", "negative"],
)
],
)
# create the dataset in your Argilla instance
dataset = rg.Dataset(
name=f"my_first_dataset",
settings=settings,
client=client,
)
dataset.create()
# get some data from the hugging face hub and load the records
data = load_dataset("imdb", split="train[:100]").to_list()
dataset.records.log(records=data, mapping={"text": "review"})
To learn more about this SDK and how it works, check out our revamped documentation: https://argilla-io.github.io/argilla/latest
We made this new documentation site from scratch, applying the Diátaxis framework and UX principles with the hope to make this version cleaner and the information easier to find.
New UI layout
We have also redesigned part of our UI for Argilla 2.0:
- We've redistributed the information in the Home page.
- Datasets don't have Tasks, but Questions.
- A clearer way to see your team's progress over each dataset.
- Annotation guidelines and your progress are now accessible at all times within the dataset page.
- Dataset pages also have a new flexible layout, so you can change the size of different panels and expand or collapse the guidelines and progress.
SpanQuestion
's are now supported in the bulk view.
Argilla2.mp4
Automatic task distribution
Argilla 2.0 also comes with an automated way to split the task of annotating a dataset among a team. Here's how it works in a nutshell:
- An owner or an admin can set the minimum number of submitted responses expected for each record.
- When a record reaches that threshold, its status changes to
complete
and it's automatically removed from the pending queue of all team members. - A dataset is 100% complete when all records have the status
complete
.
By default, the minimum submitted answers is 1, but you can create a dataset with a different value:
settings = rg.Settings(
guidelines="These are some guidelines.",
fields=[
rg.TextField(
name="text",
),
],
questions=[
rg.LabelQuestion(
name="label",
labels=["label_1", "label_2", "label_3"]
),
],
distribution=rg.TaskDistribution(min_submitted=3)
)
You can also change the value of an existing dataset as long as it has no responses. You can do this from the General
tab inside the Dataset Settings page in the UI or from the SDK:
import argilla as rg
client = rg.Argilla(...)
dataset = client.datasets("my_dataset")
dataset.settings.distribution.min_submitted = 4
dataset.update()
To learn more, check our guide on how to distribute the annotation task.
Easily deploy in Hugging face Spaces
We've streamlined the deployment of an Argilla Space in the Hugging Face Hub. Now, there's no need to manage users and passwords. Follow these simple steps to create your Argilla Space:
- Select the Argilla template.
- Choose your hardware and persistent storage options (if you prefer others than the recommended ones).
- If you are creating a space inside an organization, enter your Hugging Face Hub username under
username
to get theowner
role. - Leave
password
empty if you'd like to use Hugging Face OAuth to sign in to Argilla. - Select if the space will be public or private.
Create Space
! 🎉
Now you and your team mates can simply sign in to Argilla using Hugging Face OAuth!
Learn more about deploying Argilla in Hugging Face Spaces.
spaces_deploy.mp4
New Contributors
- @bikash119 made their first contribution in #5294
Full Changelog: v1.29.1...v2.0.0