Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a dataset save context #1727

Merged
merged 22 commits into from
Aug 23, 2022
Merged

Conversation

j053y
Copy link
Contributor

@j053y j053y commented May 5, 2022

This PR adds a dataset "save context" that allows for aggregating sample updates into bulk save operations:

# Existing syntax: no batching
for sample in dataset:
    # Edit sample here
    sample.save()

# Syntax 1
for sample in dataset.iter_samples(autosave=True):
    # Edit sample here

# Syntax 2
with dataset.save_context() as context:
    for sample in dataset:
        # Edit sample here
        context.save(sample)

By default, updates are dynamically batched such that database connections occur every 0.2 seconds (this is the same strategy used by add_samples()), but this can be customized via the optional batch_size kwarg.

Example usage

Here's the iter_samples(autosave=True) syntax:

import random as r
import string as s

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("cifar10", split="test")

def make_label():
    return "".join(r.choice(s.ascii_letters) for i in range(10))

# No save context
for sample in dataset.iter_samples(progress=True):
    sample.ground_truth.label = make_label()
    sample.save()

# Save in batches of 10
for sample in dataset.iter_samples(progress=True, autosave=True, batch_size=10):
    sample.ground_truth.label = make_label()

# Save every 0.5 seconds
for sample in dataset.iter_samples(progress=True, autosave=True, batch_size=0.5):
    sample.ground_truth.label = make_label()

And here's the save_context() syntax:

# No save context
for sample in dataset.iter_samples(progress=True):
    sample.ground_truth.label = label()
    sample.save()

# Save in batches of 10
with dataset.save_context(batch_size=10) as context:
    for sample in dataset.iter_samples(progress=True):
        sample.ground_truth.label = make_label()
        context.save(sample)

# Save every 0.5 seconds
with dataset.save_context(batch_size=0.5) as context:
    for sample in dataset.iter_samples(progress=True):
        sample.ground_truth.label = make_label()
        context.save(sample)

Benchmarking

I created a script to gather some time metrics around different batch sizes.

The script is very long running because it works with some bigger datasets and runs tests multiple times. I've have already run it and added the results into the this Google Sheet.

If we need even more speed increases, I suggest we create/optimize a way of getting Document diffs, but in the spirit of incremental progress and keeping tasks small I think that work should be done somewhere else and not included in this PR.

import statistics
import time
import uuid

import fiftyone as fo
import fiftyone.core.utils as fou

SAMPLE_COUNTS = [80, 400, 2000, 10000, 50000]
BATCH_SIZES = [1, 10, 25, 50, 100, 1000]
NUMBER_OF_TESTS_PER_BATCH = 10

results = {}
for sample_count in SAMPLE_COUNTS:
    dataset_name = f"samples:n={sample_count}"

    try:
        dataset = fo.load_dataset(dataset_name)
        print(f"Loaded DATASET[{dataset_name}]")
    except ValueError:
        dataset = fo.Dataset(dataset_name, persistent=True)
        print(f"Created DATASET[{dataset_name}]")
        print(f"Adding samples to DATASET[{dataset_name}]...")
        dataset.add_samples(
            [
                fo.Sample(filepath=f"/some/path/dummy{i}.png")
                for i in range(sample_count)
            ]
        )

    results[dataset_name] = {}
    for size in BATCH_SIZES:
        durations = []
        with fou.ProgressBar(
            total=NUMBER_OF_TESTS_PER_BATCH, iters_str="tests"
        ) as pb:
            print(f"Running tests where BATCH SIZE={size}...")

            for i in pb(range(NUMBER_OF_TESTS_PER_BATCH)):
                start = time.time()

                for sample in dataset.iter_samples(autosave=True, batch_size=size):
                    sample[
                        "somefield"
                    ] = f"{dataset_name};size={size};{uuid.uuid4()}"

                durations.append(round(time.time() - start, 4))

        results[dataset_name][size] = durations


for dataset_name, resultsByBatchSize in results.items():
    print()
    print(f"DATASET[{dataset_name}] ".ljust(79, "="))
    for batch_size, durations in resultsByBatchSize.items():
        print(f"batch_size={batch_size}:")
        print(f" - Durations : {', '.join([str(d) for d in durations])}")
        print(f" - Mean      : {round(statistics.mean(durations), 4)}")
        print(f" - Median    : {round(statistics.median(durations), 4)}")
        print(f" - Min       : {round(min(durations), 4)}")
        print(f" - Max       : {round(max(durations), 4)}")

@j053y j053y added the enhancement Code enhancement label May 5, 2022
@j053y j053y requested a review from a team May 5, 2022 15:28
@j053y j053y self-assigned this May 5, 2022
@j053y j053y linked an issue May 5, 2022 that may be closed by this pull request
Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start here!

It occurs to me that #1724 should have been more (or less, really) specific: we'd like to support save contexts for DatasetViews too, not just Dataset.

As the examples below demonstrate, users are probably even more likely to be iterating over views than entire datasets:

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart")

# Example 1
# Since I don't need to read any fields for this operation, I include `select_fields()` to optimize reads
# But, this means I'm iterating over a view, but FiftyOne is designed to let me do that without me having to care about the distinction
for sample in dataset.select_fields().iter_samples(progress=True, batch_size=10):
    sample["hello"] = "world"
    sample.save()

# Example 2
# Here I'm slicing, so again I'm working with a view
for sample in dataset[:100].iter_samples(progress=True, batch_size=10):
    sample["spam"] = "eggs"
    sample.save()

Also, note that, when working with video samples, there are also Frame objects in play, which can be saved individually via frame.save() or at the sample-level via sample.frames.save(), or by just relying on sample.save() to catch all sample and frame-level changes.

In other words, all of the following will result in the same dataset contents, and the save context would get maximal mileage if it captured Frame.save() and Frames.save() events as well:

import fiftyone as fo

sample = fo.Sample(filepath="video.mp4")
sample.frames[1] = fo.Frame()
sample.frames[2] = fo.Frame()

dataset = fo.Dataset()
dataset.add_sample(sample)

# Option 1: rely on `sample.save()` to capture all sample and frame-level edits
dataset1 = dataset.clone()
for sample in dataset1
    sample["foo"] = "bar"
    for frame in sample.frames.values():
        frame["foo"] = "bar"

    sample.save()

# Option 2: call `frame.save()` on each frame individually
dataset2 = dataset.clone()
for sample in dataset2:
    for frame in sample.frames.values():
        frame["foo"] = "bar"
        frame.save()

# Option 3: call `sample.frames.save()` to save all frame edits
dataset3 = dataset.clone()
for sample in dataset3:
    for frame in sample.frames.values():
        frame["foo"] = "bar"
    
    sample.frames.save()

fiftyone/core/dataset.py Outdated Show resolved Hide resolved
fiftyone/core/dataset.py Outdated Show resolved Hide resolved
fiftyone/core/dataset.py Outdated Show resolved Hide resolved
fiftyone/core/dataset.py Outdated Show resolved Hide resolved
fiftyone/core/sample.py Outdated Show resolved Hide resolved
fiftyone/core/sample.py Outdated Show resolved Hide resolved
@brimoor
Copy link
Contributor

brimoor commented May 13, 2022

I think some unit tests involving the new save context are called for here in order to be sure that everything is behaving as expected (eg image and video datasets).

@brimoor
Copy link
Contributor

brimoor commented May 13, 2022

Okay one more comment: when adding a feature to the public API, please add documentation for it to the User Guide section of the docs. In this case, the relevant pages are probably:

@j053y j053y force-pushed the 1724-fr-add-a-database-save-context branch from 01a4c7f to 0dfd0f4 Compare May 19, 2022 18:29
@brimoor
Copy link
Contributor

brimoor commented May 31, 2022

@j053y ready for another review pass here?

convention: we use the "re-request review" option to prompt this

@j053y j053y requested a review from brimoor June 3, 2022 12:03
@j053y
Copy link
Contributor Author

j053y commented Jun 3, 2022

I've taken a new approach to accomplish this. the iter_samples method now accepts an autosave variable. This allows anything sample or frame under the iteration of a dataset or view to be automatically saved (in a batch).

import fiftyone as fo


DEFAULT_DATASET_NAME = "autosave-dataset"
VIDEO_DATASET_NAME = "autosave-dataset-video"


default_dataset = fo.Dataset(DEFAULT_DATASET_NAME)
default_dataset.add_samples(
    [fo.Sample("some/path/to/file") for _ in range(10)]
)

video_dataset = fo.Dataset(VIDEO_DATASET_NAME)
for _ in range(10):
    sample = fo.Sample(filepath="video.mp4")
    for i in range(1, 11):
        sample.frames[i] = fo.Frame()
    video_dataset.add_sample(sample)

key, value = "hello", "dataset"
for sample in default_dataset.iter_samples(autosave=True):
    sample[key] = value


key, value = "helloselect", "datasetviewselect"
for sample in default_dataset.select_fields().iter_samples(autosave=True):
    sample[key] = value


key, value = "helloindex", "datasetviewindex"
for sample in default_dataset[:100].iter_samples(autosave=True):
    sample[key] = value

key, sample_value, frame_value = "hellovideo", "sample", "frame"
for sample in video_dataset.iter_samples(autosave=True):
    sample[key] = sample_value
    for frame in sample.frames.values():
        frame[key] = frame_value

@brimoor brimoor changed the base branch from develop to bugfix/logging August 20, 2022 22:09
@brimoor brimoor changed the title 1724 fr add a database save context Adding a dataset save context Aug 20, 2022
@brimoor brimoor mentioned this pull request Aug 22, 2022
@brimoor brimoor changed the base branch from bugfix/logging to develop August 23, 2022 15:22
@brimoor brimoor changed the base branch from develop to bugfix/logging August 23, 2022 15:23
Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Base automatically changed from bugfix/logging to develop August 23, 2022 20:59
@brimoor brimoor merged commit b542188 into develop Aug 23, 2022
@brimoor brimoor deleted the 1724-fr-add-a-database-save-context branch August 23, 2022 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Code enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FR] Add a database save context
2 participants