Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for dynamic embedded document field schemas #1825

Merged
merged 109 commits into from
Nov 7, 2022

Conversation

brimoor
Copy link
Contributor

@brimoor brimoor commented Jun 6, 2022

Change log

  • Adds support for declaring dynamic embedded document fields on the dataset's schema via add_sample_field() and add_frame_field()
  • Added support for selecting/excluding embedded document fields via select_fields() and exclude_fields()
  • Added a dynamic=True flag that can be passed to dataset factory methods that will cause all dynamic embedded document attributes that are encountered to be automatically added to the dataset's schema
  • Added a schema() aggregation that can be used to compute the observed type(s) of arbitrarily nested embedded documents
  • Added get_dynamic_field_schema() and add_dynamic_sample_fields() methods for automatically detecting and declaring dynamic sample fields
  • Added get_dynamic_frame_field_schema() and add_dynamic_frame_fields() methods for detecting and declaring dynamic frame fields
  • Added flat=True option to get_field_schema() and get_frame_field_schema() methods that returns all embedded document fields as top-level keys

Notes

  • The only default behavior that this PR changes is that evaluate_detections() will automatically add the dynamic attributes that it populates to the dataset's schema
  • Dynamic attributes are not declared by default by add_samples(), from_dir(), etc
  • As a result, there is no decrease in performance in the default case

Example usage

Previously undeclared dynamic attributes can now be declared:

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart")
fo.pprint(dataset.get_dynamic_field_schema())

# Declare dynamic attributes
dataset.add_dynamic_sample_fields()
fo.pprint(dataset.get_dynamic_field_schema())

# Verify that they exist in the dataset's schema
fo.pprint(dataset.get_field_schema(flat=True))

# Dynamic attributes are available in the App for filtering
session = fo.launch_app(dataset)

# Dynamic attributes are carried over to patches views too
session.view = dataset.to_patches("ground_truth")

Screen Shot 2022-10-20 at 3 43 11 PM

Detection evaluation automatically declares the dynamic attributes that it populates:

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart")

dataset.evaluate_detections("predictions", gt_field="ground_truth", eval_key="eval")
fo.pprint(dataset.get_field_schema(flat=True))

session = fo.launch_app(dataset)

# Dynamic attributes can be excluded
# This syntax selects only the default fields on the detections
session.view = dataset.select_fields(
    ["predictions.detections.label", "ground_truth.detections.label"]
)

Design documentation

Any field(s) of your FiftyOne datasets that contain DynamicEmbeddedDocument values can have arbitrary custom attributes added to their instances.

For example, all Label and Metadata classes are dynamic, so you can add custom attributes to them as follows:

# Provide some default attributes
label = fo.Classification(label="cat", confidence=0.98)

# Add custom attributes
label["int"] = 5
label["float"] = 51.0
label["list"] = [1, 2, 3]
label["bool"] = True
label["dict"] = {"key": ["list", "of", "values"]}

By default, dynamic attributes are not included in a dataset's schema, which means that these attributes may contain arbitrary heterogenous values across the dataset's samples.

However, FiftyOne provides methods that you can use to formally declare custom dynamic attributes, which allows you to enforce type constraints, filter by these custom attributes in the App, and more.

You can use get_dynamic_field_schema() to detect the names and type(s) of any undeclared dynamic embedded document attributes on a dataset:

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart")

print(dataset.get_dynamic_field_schema())
{
    'ground_truth.detections.iscrowd': <fiftyone.core.fields.FloatField>,
    'ground_truth.detections.area': <fiftyone.core.fields.FloatField>,
}

You can then use add_sample_field() to declare a specific dynamic embedded document attribute:

dataset.add_sample_field("ground_truth.detections.iscrowd", fo.FloatField)

or you can use the add_dynamic_sample_fields() method to declare all dynamic embedded document attribute(s) that contain values of a single type:

dataset.add_dynamic_sample_fields()

Pass the add_mixed=True option to add_dynamic_sample_fields() if you wish to declare all dynamic attributes that contain mixed values using a generic Field type.

You can provide the optional flat=True option to get_field_schema() to retrieve a flattened version of a dataset's schema that includes all embedded document attributes as top-level keys:

print(dataset.get_field_schema(flat=True))
{
    'id': <fiftyone.core.fields.ObjectIdField>,
    'filepath': <fiftyone.core.fields.StringField>,
    'tags': <fiftyone.core.fields.ListField>,
    'metadata': <fiftyone.core.fields.EmbeddedDocumentField>,
    'metadata.size_bytes': <fiftyone.core.fields.IntField>,
    'metadata.mime_type': <fiftyone.core.fields.StringField>,
    'metadata.width': <fiftyone.core.fields.IntField>,
    'metadata.height': <fiftyone.core.fields.IntField>,
    'metadata.num_channels': <fiftyone.core.fields.IntField>,
    'ground_truth': <fiftyone.core.fields.EmbeddedDocumentField>,
    'ground_truth.detections': <fiftyone.core.fields.ListField>,
    'ground_truth.detections.id': <fiftyone.core.fields.ObjectIdField>,
    'ground_truth.detections.tags': <fiftyone.core.fields.ListField>,
    ...
    'ground_truth.detections.iscrowd': <fiftyone.core.fields.FloatField>,
    'ground_truth.detections.area': <fiftyone.core.fields.FloatField>,
    ...
}

By default, dynamic attributes are not declared on a dataset's schema when samples are added to it:

import fiftyone as fo

sample = fo.Sample(
    filepath="/path/to/image.jpg",
    ground_truth=fo.Detections(
        detections=[
            fo.Detection(
                label="cat",
                bounding_box=[0.1, 0.1, 0.4, 0.4],
                mood="surly",
            ),
            fo.Detection(
                label="dog",
                bounding_box=[0.5, 0.5, 0.4, 0.4],
                mood="happy",
            )
        ]
    )
)

dataset = fo.Dataset()
dataset.add_sample(sample)

schema = dataset.get_field_schema(flat=True)

assert "ground_truth.detections.mood" not in schema

However, methods such as add_samples() and from_dir() provide an optional dynamic=True option that you can provide to automatically declare any dynamic embedded document attributes encountered while importing data:

dataset = fo.Dataset()

dataset.add_sample(sample, dynamic=True)
schema = dataset.get_field_schema(flat=True)

assert "ground_truth.detections.mood" in schema

Note that, when declaring dynamic attributes on non-empty datasets, you must ensure that the attribute's type is consistent with any existing values in that field, e.g., by first running get_dynamic_field_schema() to check the existing type(s). Methods like add_sample_field() and add_samples(..., dynamic=True) do not validate newly declared field's types against existing field values:

import fiftyone as fo

sample1 = fo.Sample(
    filepath="/path/to/image1.jpg",
    ground_truth=fo.Classification(
        label="cat",
        mood="surly",
        age="bad-value",
    ),
)

sample2 = fo.Sample(
    filepath="/path/to/image2.jpg",
    ground_truth=fo.Classification(
        label="dog",
        mood="happy",
        age=5,
    ),
)

dataset = fo.Dataset()

dataset.add_sample(sample1)

# Either of these are problematic
dataset.add_sample(sample2, dynamic=True)
dataset.add_sample_field("ground_truth.age", fo.IntField)

sample1.reload()  # ValidationError: bad-value could not be converted to int

If you declare a dynamic attribute with a type that is not compatible with existing values in that field, you will need to remove that field from the dataset's schema using remove_dynamic_sample_field() in order for the dataset to be usable again:

# Removes dynamic field from dataset's schema without deleting the values
dataset.remove_dynamic_sample_field("ground_truth.age")

You can use select_fields() and exclude_fields() to create views that select/exclude specific dynamic attributes from your dataset and its schema:

dataset.add_sample_field("ground_truth.age", fo.Field)
sample = dataset.first()

assert "ground_truth.age" in dataset.get_field_schema(flat=True)
assert sample.ground_truth.has_field("age")

# Omits the `age` attribute from the `ground_truth` field
view = dataset.exclude_fields("ground_truth.age")
sample = view.first()

assert "ground_truth.age" not in view.get_field_schema(flat=True)
assert not sample.ground_truth.has_field("age")

# Only include `mood` (and default) attributes of the `ground_truth` field
view = dataset.select_fields("ground_truth.mood")
sample = view.first()

assert "ground_truth.age" not in view.get_field_schema(flat=True)
assert not sample.ground_truth.has_field("age")

@brimoor brimoor added the feature Work on a feature request label Jun 6, 2022
@brimoor brimoor requested a review from a team June 6, 2022 05:12
@brimoor brimoor self-assigned this Jun 6, 2022
Copy link
Contributor

@benjaminpkane benjaminpkane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@brimoor brimoor merged commit ed202e6 into develop Nov 7, 2022
@brimoor brimoor deleted the add-dynamic-fields1 branch November 7, 2022 05:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Work on a feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants