Adding a dataset save context #1727

j053y · 2022-05-05T15:28:40Z

This PR adds a dataset "save context" that allows for aggregating sample updates into bulk save operations:

# Existing syntax: no batching
for sample in dataset:
    # Edit sample here
    sample.save()

# Syntax 1
for sample in dataset.iter_samples(autosave=True):
    # Edit sample here

# Syntax 2
with dataset.save_context() as context:
    for sample in dataset:
        # Edit sample here
        context.save(sample)

By default, updates are dynamically batched such that database connections occur every 0.2 seconds (this is the same strategy used by add_samples()), but this can be customized via the optional batch_size kwarg.

Example usage

Here's the iter_samples(autosave=True) syntax:

import random as r
import string as s

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("cifar10", split="test")

def make_label():
    return "".join(r.choice(s.ascii_letters) for i in range(10))

# No save context
for sample in dataset.iter_samples(progress=True):
    sample.ground_truth.label = make_label()
    sample.save()

# Save in batches of 10
for sample in dataset.iter_samples(progress=True, autosave=True, batch_size=10):
    sample.ground_truth.label = make_label()

# Save every 0.5 seconds
for sample in dataset.iter_samples(progress=True, autosave=True, batch_size=0.5):
    sample.ground_truth.label = make_label()

And here's the save_context() syntax:

# No save context
for sample in dataset.iter_samples(progress=True):
    sample.ground_truth.label = label()
    sample.save()

# Save in batches of 10
with dataset.save_context(batch_size=10) as context:
    for sample in dataset.iter_samples(progress=True):
        sample.ground_truth.label = make_label()
        context.save(sample)

# Save every 0.5 seconds
with dataset.save_context(batch_size=0.5) as context:
    for sample in dataset.iter_samples(progress=True):
        sample.ground_truth.label = make_label()
        context.save(sample)

Benchmarking

I created a script to gather some time metrics around different batch sizes.

The script is very long running because it works with some bigger datasets and runs tests multiple times. I've have already run it and added the results into the this Google Sheet.

If we need even more speed increases, I suggest we create/optimize a way of getting Document diffs, but in the spirit of incremental progress and keeping tasks small I think that work should be done somewhere else and not included in this PR.

import statistics
import time
import uuid

import fiftyone as fo
import fiftyone.core.utils as fou

SAMPLE_COUNTS = [80, 400, 2000, 10000, 50000]
BATCH_SIZES = [1, 10, 25, 50, 100, 1000]
NUMBER_OF_TESTS_PER_BATCH = 10

results = {}
for sample_count in SAMPLE_COUNTS:
    dataset_name = f"samples:n={sample_count}"

    try:
        dataset = fo.load_dataset(dataset_name)
        print(f"Loaded DATASET[{dataset_name}]")
    except ValueError:
        dataset = fo.Dataset(dataset_name, persistent=True)
        print(f"Created DATASET[{dataset_name}]")
        print(f"Adding samples to DATASET[{dataset_name}]...")
        dataset.add_samples(
            [
                fo.Sample(filepath=f"/some/path/dummy{i}.png")
                for i in range(sample_count)
            ]
        )

    results[dataset_name] = {}
    for size in BATCH_SIZES:
        durations = []
        with fou.ProgressBar(
            total=NUMBER_OF_TESTS_PER_BATCH, iters_str="tests"
        ) as pb:
            print(f"Running tests where BATCH SIZE={size}...")

            for i in pb(range(NUMBER_OF_TESTS_PER_BATCH)):
                start = time.time()

                for sample in dataset.iter_samples(autosave=True, batch_size=size):
                    sample[
                        "somefield"
                    ] = f"{dataset_name};size={size};{uuid.uuid4()}"

                durations.append(round(time.time() - start, 4))

        results[dataset_name][size] = durations


for dataset_name, resultsByBatchSize in results.items():
    print()
    print(f"DATASET[{dataset_name}] ".ljust(79, "="))
    for batch_size, durations in resultsByBatchSize.items():
        print(f"batch_size={batch_size}:")
        print(f" - Durations : {', '.join([str(d) for d in durations])}")
        print(f" - Mean      : {round(statistics.mean(durations), 4)}")
        print(f" - Median    : {round(statistics.median(durations), 4)}")
        print(f" - Min       : {round(min(durations), 4)}")
        print(f" - Max       : {round(max(durations), 4)}")

brimoor

Good start here!

It occurs to me that #1724 should have been more (or less, really) specific: we'd like to support save contexts for DatasetViews too, not just Dataset.

As the examples below demonstrate, users are probably even more likely to be iterating over views than entire datasets:

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart")

# Example 1
# Since I don't need to read any fields for this operation, I include `select_fields()` to optimize reads
# But, this means I'm iterating over a view, but FiftyOne is designed to let me do that without me having to care about the distinction
for sample in dataset.select_fields().iter_samples(progress=True, batch_size=10):
    sample["hello"] = "world"
    sample.save()

# Example 2
# Here I'm slicing, so again I'm working with a view
for sample in dataset[:100].iter_samples(progress=True, batch_size=10):
    sample["spam"] = "eggs"
    sample.save()

Also, note that, when working with video samples, there are also Frame objects in play, which can be saved individually via frame.save() or at the sample-level via sample.frames.save(), or by just relying on sample.save() to catch all sample and frame-level changes.

In other words, all of the following will result in the same dataset contents, and the save context would get maximal mileage if it captured Frame.save() and Frames.save() events as well:

import fiftyone as fo

sample = fo.Sample(filepath="video.mp4")
sample.frames[1] = fo.Frame()
sample.frames[2] = fo.Frame()

dataset = fo.Dataset()
dataset.add_sample(sample)

# Option 1: rely on `sample.save()` to capture all sample and frame-level edits
dataset1 = dataset.clone()
for sample in dataset1
    sample["foo"] = "bar"
    for frame in sample.frames.values():
        frame["foo"] = "bar"

    sample.save()

# Option 2: call `frame.save()` on each frame individually
dataset2 = dataset.clone()
for sample in dataset2:
    for frame in sample.frames.values():
        frame["foo"] = "bar"
        frame.save()

# Option 3: call `sample.frames.save()` to save all frame edits
dataset3 = dataset.clone()
for sample in dataset3:
    for frame in sample.frames.values():
        frame["foo"] = "bar"
    
    sample.frames.save()

fiftyone/core/dataset.py

fiftyone/core/sample.py

brimoor · 2022-05-13T20:46:15Z

I think some unit tests involving the new save context are called for here in order to be sure that everything is behaving as expected (eg image and video datasets).

brimoor · 2022-05-13T20:48:19Z

Okay one more comment: when adding a feature to the public API, please add documentation for it to the User Guide section of the docs. In this case, the relevant pages are probably:

Dataset save contexts: https://github.com/voxel51/fiftyone/blob/develop/docs/source/user_guide/using_datasets.rst
View save contexts: https://github.com/voxel51/fiftyone/blob/develop/docs/source/user_guide/using_views.rst

brimoor · 2022-05-31T13:57:13Z

@j053y ready for another review pass here?

convention: we use the "re-request review" option to prompt this

j053y · 2022-06-03T12:04:03Z

I've taken a new approach to accomplish this. the iter_samples method now accepts an autosave variable. This allows anything sample or frame under the iteration of a dataset or view to be automatically saved (in a batch).

import fiftyone as fo


DEFAULT_DATASET_NAME = "autosave-dataset"
VIDEO_DATASET_NAME = "autosave-dataset-video"


default_dataset = fo.Dataset(DEFAULT_DATASET_NAME)
default_dataset.add_samples(
    [fo.Sample("some/path/to/file") for _ in range(10)]
)

video_dataset = fo.Dataset(VIDEO_DATASET_NAME)
for _ in range(10):
    sample = fo.Sample(filepath="video.mp4")
    for i in range(1, 11):
        sample.frames[i] = fo.Frame()
    video_dataset.add_sample(sample)

key, value = "hello", "dataset"
for sample in default_dataset.iter_samples(autosave=True):
    sample[key] = value


key, value = "helloselect", "datasetviewselect"
for sample in default_dataset.select_fields().iter_samples(autosave=True):
    sample[key] = value


key, value = "helloindex", "datasetviewindex"
for sample in default_dataset[:100].iter_samples(autosave=True):
    sample[key] = value

key, sample_value, frame_value = "hellovideo", "sample", "frame"
for sample in video_dataset.iter_samples(autosave=True):
    sample[key] = sample_value
    for frame in sample.frames.values():
        frame[key] = frame_value

…tweaks

Save context tweaks

brimoor

LGTM

j053y added the enhancement Code enhancement label May 5, 2022

j053y requested a review from a team May 5, 2022 15:28

j053y self-assigned this May 5, 2022

j053y linked an issue May 5, 2022 that may be closed by this pull request

[FR] Add a database save context #1724

Closed

brimoor requested changes May 13, 2022

View reviewed changes

add 'autosave' functionality to 'Dataset.iter_samples'

0dfd0f4

j053y force-pushed the 1724-fr-add-a-database-save-context branch from 01a4c7f to 0dfd0f4 Compare May 19, 2022 18:29

j053y added 4 commits May 19, 2022 16:08

update code to work with views

cf6c761

remove local test file

38acf76

accidental delete and autoformatter corrections

ffe8774

fix conflicts

8965da5

expose autosave batch size

8ef250c

j053y requested a review from brimoor June 3, 2022 12:03

brimoor added 5 commits August 19, 2022 12:47

Merge branch 'develop' into 1724-fr-add-a-database-save-context

9ba0df2

converting to public SaveContext class

cb24e87

adding missing _save_replacements implementation

e29c2a7

adding public save_context() method

dacf2e4

Merge branch 'bugfix/logging' into 1724-fr-add-a-database-save-context

cb9380f

brimoor changed the base branch from develop to bugfix/logging August 20, 2022 22:09

brimoor changed the title ~~1724 fr add a database save context~~ Adding a dataset save context Aug 20, 2022

brimoor added 6 commits August 20, 2022 18:10

Merge branch '1724-fr-add-a-database-save-context' into save-context-…

5059322

…tweaks

generated views don't support save contexts

baff9b6

setting default behavior

fa99d92

compute ops in real-time

42f764b

documenting save contexts

991bfa7

refactoring into a deferred=True option

827a4b8

brimoor added 3 commits August 21, 2022 21:50

adding unit tests

3d04002

unnecessary

7c428d0

adding more examples, standardizing default batch_size logic

8955448

brimoor mentioned this pull request Aug 22, 2022

Save context tweaks #2012

Merged

brimoor and others added 2 commits August 23, 2022 11:11

Merge pull request #2012 from voxel51/save-context-tweaks

7b1e679

Save context tweaks

tweaking docs

c3253ca

brimoor changed the base branch from bugfix/logging to develop August 23, 2022 15:22

brimoor changed the base branch from develop to bugfix/logging August 23, 2022 15:23

brimoor approved these changes Aug 23, 2022

View reviewed changes

Base automatically changed from bugfix/logging to develop August 23, 2022 20:59

brimoor merged commit b542188 into develop Aug 23, 2022

brimoor deleted the 1724-fr-add-a-database-save-context branch August 23, 2022 20:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a dataset save context #1727

Adding a dataset save context #1727

j053y commented May 5, 2022 •

edited by brimoor

Loading

brimoor left a comment •

edited

Loading

brimoor commented May 13, 2022

brimoor commented May 13, 2022

brimoor commented May 31, 2022

j053y commented Jun 3, 2022

brimoor left a comment

Adding a dataset save context #1727

Adding a dataset save context #1727

Conversation

j053y commented May 5, 2022 • edited by brimoor Loading

Example usage

Benchmarking

brimoor left a comment • edited Loading

Choose a reason for hiding this comment

brimoor commented May 13, 2022

brimoor commented May 13, 2022

brimoor commented May 31, 2022

j053y commented Jun 3, 2022

brimoor left a comment

Choose a reason for hiding this comment

j053y commented May 5, 2022 •

edited by brimoor

Loading

brimoor left a comment •

edited

Loading