Possibility to subsample when loading the binary ? #13

ReHoss · 2023-03-10T18:34:11Z

Hello,

Is it possible to subsample the event file while loading ? What do you recommend if we don't have enough RAM like in a Jupyter notebook to read the event file ?

I know that tensorboard use a subsampling strategy.

Thanks for you consideration.

j3soon · 2023-03-14T17:15:45Z

Hi,

I would like to know more details about your use case. What event types are you loading? and how large is the event file? Does your use case require iterating through all events, or does it only need to process certain filtered events?

tbparse is designed to load all events directly into the system memory, and currently does not support subsampling. However, it may be possible to add a feature for pre-filtering the events in the future, given valid use cases.

If you simply want to iterate through the events, maybe you can try out the raw method by TensorBoard/TensorFlow as documented here.

ReHoss · 2023-03-14T19:09:06Z

From: https://github.com/tensorflow/tensorboard/blob/master/README.md

Is my data being downsampled? Am I really seeing all the data?

TensorBoard uses reservoir sampling to downsample your data so that it can be loaded into RAM. You can modify the number of elements it will keep per tag by using the --samples_per_plugin command line argument (ex: --samples_per_plugin=scalars=500,images=20). See this Stack Overflow question for some more information.

And according to the help command:

--samples_per_plugin: An optional comma separated list of plugin_name=num_samples pairs to explicitly specify how many samples to keep per tag for that plugin. For unspecified plugins, TensorBoard randomly downsamples logged summaries to reasonable values to prevent out-of-memory errors for long running jobs. This flag allows fine control over that downsampling. Note that 0 means keep all samples of that type. For instance, "scalars=500,images=0" keeps 500 scalars and all images. Most users should not need to set this flag. (default: '')

For instance, the asker from the StackOverflow thread trains over 20M steps.

I train over 1e6 steps but run 100 experiments. If I log accurately the training score I end up with an extremely large DataFrame.

It would be nice to have an option to downsample randomly (with a seed interface then) or evenly. Ideally for n training curves, same time steps are kept.

Thank you for your consideration,
Best,

j3soon · 2023-03-19T14:38:40Z

Thanks for providing the detailed information. I think reservoir sampling is a useful feature and won't be too hard to implement. However, I'm not sure if we can manually set the RNG seed...

This feature may be implemented by modifying the code here. I'll see if I can add this feature in my free time.

Meanwhile, I suggest loading each experiments individually and downsample them by yourself. You can retrieve a deterministic results by stacking the downsampled experiments.

j3soon added the enhancement New feature or request label Mar 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility to subsample when loading the binary ? #13

Possibility to subsample when loading the binary ? #13

ReHoss commented Mar 10, 2023

j3soon commented Mar 14, 2023

ReHoss commented Mar 14, 2023 •

edited

Loading

j3soon commented Mar 19, 2023

Possibility to subsample when loading the binary ? #13

Possibility to subsample when loading the binary ? #13

Comments

ReHoss commented Mar 10, 2023

j3soon commented Mar 14, 2023

ReHoss commented Mar 14, 2023 • edited Loading

j3soon commented Mar 19, 2023

ReHoss commented Mar 14, 2023 •

edited

Loading