-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a dataset save context #1727
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start here!
It occurs to me that #1724 should have been more (or less, really) specific: we'd like to support save contexts for DatasetView
s too, not just Dataset
.
As the examples below demonstrate, users are probably even more likely to be iterating over views than entire datasets:
import fiftyone as fo
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart")
# Example 1
# Since I don't need to read any fields for this operation, I include `select_fields()` to optimize reads
# But, this means I'm iterating over a view, but FiftyOne is designed to let me do that without me having to care about the distinction
for sample in dataset.select_fields().iter_samples(progress=True, batch_size=10):
sample["hello"] = "world"
sample.save()
# Example 2
# Here I'm slicing, so again I'm working with a view
for sample in dataset[:100].iter_samples(progress=True, batch_size=10):
sample["spam"] = "eggs"
sample.save()
Also, note that, when working with video samples, there are also Frame
objects in play, which can be saved individually via frame.save()
or at the sample-level via sample.frames.save()
, or by just relying on sample.save()
to catch all sample and frame-level changes.
In other words, all of the following will result in the same dataset contents, and the save context would get maximal mileage if it captured Frame.save()
and Frames.save()
events as well:
import fiftyone as fo
sample = fo.Sample(filepath="video.mp4")
sample.frames[1] = fo.Frame()
sample.frames[2] = fo.Frame()
dataset = fo.Dataset()
dataset.add_sample(sample)
# Option 1: rely on `sample.save()` to capture all sample and frame-level edits
dataset1 = dataset.clone()
for sample in dataset1
sample["foo"] = "bar"
for frame in sample.frames.values():
frame["foo"] = "bar"
sample.save()
# Option 2: call `frame.save()` on each frame individually
dataset2 = dataset.clone()
for sample in dataset2:
for frame in sample.frames.values():
frame["foo"] = "bar"
frame.save()
# Option 3: call `sample.frames.save()` to save all frame edits
dataset3 = dataset.clone()
for sample in dataset3:
for frame in sample.frames.values():
frame["foo"] = "bar"
sample.frames.save()
I think some unit tests involving the new save context are called for here in order to be sure that everything is behaving as expected (eg image and video datasets). |
Okay one more comment: when adding a feature to the public API, please add documentation for it to the User Guide section of the docs. In this case, the relevant pages are probably: |
01a4c7f
to
0dfd0f4
Compare
@j053y ready for another review pass here? convention: we use the "re-request review" option to prompt this |
I've taken a new approach to accomplish this. the import fiftyone as fo
DEFAULT_DATASET_NAME = "autosave-dataset"
VIDEO_DATASET_NAME = "autosave-dataset-video"
default_dataset = fo.Dataset(DEFAULT_DATASET_NAME)
default_dataset.add_samples(
[fo.Sample("some/path/to/file") for _ in range(10)]
)
video_dataset = fo.Dataset(VIDEO_DATASET_NAME)
for _ in range(10):
sample = fo.Sample(filepath="video.mp4")
for i in range(1, 11):
sample.frames[i] = fo.Frame()
video_dataset.add_sample(sample)
key, value = "hello", "dataset"
for sample in default_dataset.iter_samples(autosave=True):
sample[key] = value
key, value = "helloselect", "datasetviewselect"
for sample in default_dataset.select_fields().iter_samples(autosave=True):
sample[key] = value
key, value = "helloindex", "datasetviewindex"
for sample in default_dataset[:100].iter_samples(autosave=True):
sample[key] = value
key, sample_value, frame_value = "hellovideo", "sample", "frame"
for sample in video_dataset.iter_samples(autosave=True):
sample[key] = sample_value
for frame in sample.frames.values():
frame[key] = frame_value |
Save context tweaks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR adds a dataset "save context" that allows for aggregating sample updates into bulk save operations:
By default, updates are dynamically batched such that database connections occur every 0.2 seconds (this is the same strategy used by
add_samples()
), but this can be customized via the optionalbatch_size
kwarg.Example usage
Here's the
iter_samples(autosave=True)
syntax:And here's the
save_context()
syntax:Benchmarking
I created a script to gather some time metrics around different batch sizes.
The script is very long running because it works with some bigger datasets and runs tests multiple times. I've have already run it and added the results into the this Google Sheet.
If we need even more speed increases, I suggest we create/optimize a way of getting Document diffs, but in the spirit of incremental progress and keeping tasks small I think that work should be done somewhere else and not included in this PR.