Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build_dataset fails when aggregating timestamps into buckets #119

Open
juancq opened this issue Aug 23, 2024 · 0 comments
Open

build_dataset fails when aggregating timestamps into buckets #119

juancq opened this issue Aug 23, 2024 · 0 comments

Comments

@juancq
Copy link
Contributor

juancq commented Aug 23, 2024

Branch: dev
build_dataset runs out of memory when aggregating timestamps into buckets.

2024-08-23 13:21:58.081 | DEBUG    | EventStream.data.dataset_base:__init__:479 - Built events and measurements dataframe
2024-08-23 13:21:58.085 | DEBUG    | EventStream.data.dataset_polars:_agg_by_time:642 - Collecting events DF. Not using streaming here as it sometimes causes segfaults.
2024-08-23 13:22:06.849 | DEBUG    | EventStream.data.dataset_polars:_agg_by_time:649 - Aggregating timestamps into buckets
fish: Job 1, 'PYTHONPATH=".;"  python scripts…' terminated by signal SIGKILL (Forced quit)

The relevant code is this:

logger.debug("Aggregating timestamps into buckets")
grouped = self.events_df.sort(["subject_id", "timestamp"], descending=False).group_by_dynamic(
"timestamp",
every=self.config.agg_by_time_scale,
truncate=True,
closed="left",
start_by="datapoint",
by="subject_id",
)
grouped = (
grouped.agg(
pl.col("event_type").unique().sort(),
pl.col("event_id").unique().alias("old_event_id"),
)
.with_columns(
pl.struct(subject_id=pl.col("subject_id"), timestamp=pl.col("timestamp"))
.hash(1, 2, 3, 4)
.alias("event_id")
)
.with_columns(
"event_id",
pl.col("event_type")
.list.eval(pl.col("").cast(pl.Utf8))
.list.join("&")
.cast(pl.Categorical)
.alias("event_type"),
)
)

To replicate, run generate_synthetic_data with n_subjects > 50,000 and then run build_dataset.

The solution with my dataset (about 5 million subjects) was to use a compute instance with more memory during the build_dataset phase.

This is partly a polars issue. I tried limiting number of threads and streaming, and it did not make a difference.

A refactor of agg_by_time would be nice to have, but not a must.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant