diff --git a/CHANGELOG.md b/CHANGELOG.md index 0b5e21d9a7..6591a9537c 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -14,6 +14,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Enhancements - Raise an appropriate error when exporting a datumaro dataset if its subset name contains path separators. () +- Update docs for transform plugins + () ### Bug fixes diff --git a/docs/source/docs/command-reference/context_free/transform.md b/docs/source/docs/command-reference/context_free/transform.md index 6ecbdbc384..7a63a90c55 100644 --- a/docs/source/docs/command-reference/context_free/transform.md +++ b/docs/source/docs/command-reference/context_free/transform.md @@ -101,8 +101,10 @@ Basic dataset item manipulations: - [`remove_images`](#remove_images) - Removes specific images - [`remove_annotations`](#remove_annotations) - Removes annotations - [`remove_attributes`](#remove_attributes) - Removes attributes -- [`astype_annotations`](#astype_annotations) - Convert annotation type -- [`pseudo_labeling`](#pseudo_labeling) - Generate pseudo labels for unlabeled data +- [`astype_annotations`](#astype_annotations) - Transforms annotation types +- [`pseudo_labeling`](#pseudo_labeling) - Generates pseudo labels for unlabeled data +- [`correct`](#correct) - Corrects annotaiton types +- [`clean`](#clean) - Removes noisy data for tabular dataset Subset manipulations: - [`random_split`](#random_split) - Splits dataset into subsets @@ -827,19 +829,6 @@ bbox_values_decrement [-h] Optional arguments: - `-h`, `--help` (flag) - Show this help message and exit -#### `correct` - -Correct the dataset from a validation report - -Usage: -```console -correct [-h] [-r REPORT_PATH] -``` - -Optional arguments: -- `-h`, `--help` (flag) - Show this help message and exit -- `-r`, `--reports` (str) - A validation report from a 'validate' CLI (default=validation_reports.json) - #### `pseudo_labeling` Assigns pseudo-labels to items in a dataset based on their similarity to predefined labels. This class is useful for semi-supervised learning when dealing with missing or uncertain labels. @@ -858,7 +847,6 @@ Attributes: Usage: ```console pseudo_labeling [-h] [--labels LABELS] -``` Optional arguments: - `-h`, `--help` (flag) - Show this help message and exit @@ -869,3 +857,40 @@ Examples: ```console datum transform -t pseudo_labeling -- --labels 'label1,label2' ``` + +#### `correct` + +Correct the dataset from a validation report + +Usage: +```console +correct [-h] [-r REPORT_PATH] +``` + +Optional arguments: +- `-h`, `--help` (flag) - Show this help message and exit +- `-r`, `--reports` (str) - A validation report from a 'validate' CLI (default=validation_reports.json) + +#### `clean` + +Refines and preprocesses media items in a dataset, focusing on string, numeric, and categorical data. This transform is designed to clean and improve the quality of the data, making it more suitable for analysis and modeling. + +The cleaning process includes: + +- String Data: Removes unnecessary characters using NLP techniques. +- Numeric Data: Identifies and handles outliers and missing values. +- Categorical Data: Cleans and refines categorical information. + +Usage: +```console +clean [-h] +``` + +Optional arguments: +- `-h`, `--help` (flag) - Show this help message and exit + +Examples: +- Clean and preprocess dataset items + ```console + datum transform -t clean + ``` diff --git a/src/datumaro/plugins/transforms.py b/src/datumaro/plugins/transforms.py index 62b17288fb..de8bb308c5 100644 --- a/src/datumaro/plugins/transforms.py +++ b/src/datumaro/plugins/transforms.py @@ -1351,9 +1351,21 @@ def transform_item(self, item: DatasetItem): class Correct(Transform, CliPlugin): """ - Correct the dataset from a validation report. - A user can should feed into validation_reports.json from validator to correct the dataset. - This helps to refine the dataset by rejecting undefined labels, missing annotations, and outliers. + This class provides functionality to correct and refine a dataset based on a validation report.|n + It processes a validation report (typically in JSON format) to identify and rectify various |n + dataset issues, such as undefined labels, missing annotations, outliers, empty labels/captions,|n + and unnecessary characters in captions. The correction process includes:|n + |n + - Adding missing labels and attributes.|n + - Removing or adjusting annotations with invalid or anomalous values.|n + - Filling in missing labels and captions with appropriate values.|n + - Removing unnecessary characters from text-based annotations like captions.|n + - Handling outliers by capping values within specified bounds.|n + - Updating dataset categories and annotations according to the corrections.|n + |n + The class is designed to be used as part of a command-line interface (CLI) and can be |n + configured with different validation reports. It integrates with the dataset extraction |n + process, ensuring that corrections are applied consistently across the dataset.|n """ @classmethod @@ -1749,13 +1761,15 @@ def __iter__(self): class AstypeAnnotations(ItemTransform): """ - Enables the conversion of annotation types for the categories and individual items within a dataset.|n + Converts the types of annotations within a dataset based on a specified mapping.|n |n - Based on a specified mapping, it transforms the annotation types,|n - changing them to 'Label' if they are categorical, and to 'Caption' if they are of type string, float, or integer.|n + This transform changes annotations to 'Label' if they are categorical, and to 'Caption' + if they are of type string, float, or integer. This is particularly useful when working + with tabular data that needs to be converted into a format suitable for specific machine + learning tasks.|n |n Examples:|n - - Convert type of `title` annotation|n + - Converts the type of a `title` annotation:|n .. code-block::