Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs for transform plugins #1599

Merged
merged 3 commits into from
Sep 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Enhancements
- Raise an appropriate error when exporting a datumaro dataset if its subset name contains path separators.
(<https://github.com/openvinotoolkit/datumaro/pull/1615>)
- Update docs for transform plugins
(<https://github.com/openvinotoolkit/datumaro/pull/1599>)

### Bug fixes

Expand Down
57 changes: 41 additions & 16 deletions docs/source/docs/command-reference/context_free/transform.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,8 +101,10 @@ Basic dataset item manipulations:
- [`remove_images`](#remove_images) - Removes specific images
- [`remove_annotations`](#remove_annotations) - Removes annotations
- [`remove_attributes`](#remove_attributes) - Removes attributes
- [`astype_annotations`](#astype_annotations) - Convert annotation type
- [`pseudo_labeling`](#pseudo_labeling) - Generate pseudo labels for unlabeled data
- [`astype_annotations`](#astype_annotations) - Transforms annotation types
- [`pseudo_labeling`](#pseudo_labeling) - Generates pseudo labels for unlabeled data
- [`correct`](#correct) - Corrects annotaiton types
- [`clean`](#clean) - Removes noisy data for tabular dataset

Subset manipulations:
- [`random_split`](#random_split) - Splits dataset into subsets
Expand Down Expand Up @@ -827,19 +829,6 @@ bbox_values_decrement [-h]
Optional arguments:
- `-h`, `--help` (flag) - Show this help message and exit

#### `correct`

Correct the dataset from a validation report

Usage:
```console
correct [-h] [-r REPORT_PATH]
```

Optional arguments:
- `-h`, `--help` (flag) - Show this help message and exit
- `-r`, `--reports` (str) - A validation report from a 'validate' CLI (default=validation_reports.json)

#### `pseudo_labeling`

Assigns pseudo-labels to items in a dataset based on their similarity to predefined labels. This class is useful for semi-supervised learning when dealing with missing or uncertain labels.
Expand All @@ -858,7 +847,6 @@ Attributes:
Usage:
```console
pseudo_labeling [-h] [--labels LABELS]
```

Optional arguments:
- `-h`, `--help` (flag) - Show this help message and exit
Expand All @@ -869,3 +857,40 @@ Examples:
```console
datum transform -t pseudo_labeling -- --labels 'label1,label2'
```

#### `correct`

Correct the dataset from a validation report

Usage:
```console
correct [-h] [-r REPORT_PATH]
```

Optional arguments:
- `-h`, `--help` (flag) - Show this help message and exit
- `-r`, `--reports` (str) - A validation report from a 'validate' CLI (default=validation_reports.json)

#### `clean`

Refines and preprocesses media items in a dataset, focusing on string, numeric, and categorical data. This transform is designed to clean and improve the quality of the data, making it more suitable for analysis and modeling.

The cleaning process includes:

- String Data: Removes unnecessary characters using NLP techniques.
- Numeric Data: Identifies and handles outliers and missing values.
- Categorical Data: Cleans and refines categorical information.

Usage:
```console
clean [-h]
```

Optional arguments:
- `-h`, `--help` (flag) - Show this help message and exit

Examples:
- Clean and preprocess dataset items
```console
datum transform -t clean
```
28 changes: 21 additions & 7 deletions src/datumaro/plugins/transforms.py
Original file line number Diff line number Diff line change
Expand Up @@ -1351,9 +1351,21 @@ def transform_item(self, item: DatasetItem):

class Correct(Transform, CliPlugin):
"""
Correct the dataset from a validation report.
A user can should feed into validation_reports.json from validator to correct the dataset.
This helps to refine the dataset by rejecting undefined labels, missing annotations, and outliers.
This class provides functionality to correct and refine a dataset based on a validation report.|n
It processes a validation report (typically in JSON format) to identify and rectify various |n
dataset issues, such as undefined labels, missing annotations, outliers, empty labels/captions,|n
and unnecessary characters in captions. The correction process includes:|n
|n
- Adding missing labels and attributes.|n
- Removing or adjusting annotations with invalid or anomalous values.|n
- Filling in missing labels and captions with appropriate values.|n
- Removing unnecessary characters from text-based annotations like captions.|n
- Handling outliers by capping values within specified bounds.|n
- Updating dataset categories and annotations according to the corrections.|n
|n
The class is designed to be used as part of a command-line interface (CLI) and can be |n
configured with different validation reports. It integrates with the dataset extraction |n
process, ensuring that corrections are applied consistently across the dataset.|n
"""

@classmethod
Expand Down Expand Up @@ -1749,13 +1761,15 @@ def __iter__(self):

class AstypeAnnotations(ItemTransform):
"""
Enables the conversion of annotation types for the categories and individual items within a dataset.|n
Converts the types of annotations within a dataset based on a specified mapping.|n
|n
Based on a specified mapping, it transforms the annotation types,|n
changing them to 'Label' if they are categorical, and to 'Caption' if they are of type string, float, or integer.|n
This transform changes annotations to 'Label' if they are categorical, and to 'Caption'
if they are of type string, float, or integer. This is particularly useful when working
with tabular data that needs to be converted into a format suitable for specific machine
learning tasks.|n
|n
Examples:|n
- Convert type of `title` annotation|n
- Converts the type of a `title` annotation:|n

.. code-block::

Expand Down
Loading