build_dataset fails when aggregating timestamps into buckets #119

juancq · 2024-08-23T03:34:04Z

Branch: dev
build_dataset runs out of memory when aggregating timestamps into buckets.

2024-08-23 13:21:58.081 | DEBUG    | EventStream.data.dataset_base:__init__:479 - Built events and measurements dataframe
2024-08-23 13:21:58.085 | DEBUG    | EventStream.data.dataset_polars:_agg_by_time:642 - Collecting events DF. Not using streaming here as it sometimes causes segfaults.
2024-08-23 13:22:06.849 | DEBUG    | EventStream.data.dataset_polars:_agg_by_time:649 - Aggregating timestamps into buckets
fish: Job 1, 'PYTHONPATH=".;"  python scripts…' terminated by signal SIGKILL (Forced quit)

The relevant code is this:

EventStreamGPT/EventStream/data/dataset_polars.py

Lines 648 to 676 in ce9e2c3

    
               logger.debug("Aggregating timestamps into buckets") 
        
               grouped = self.events_df.sort(["subject_id", "timestamp"], descending=False).group_by_dynamic( 
        
                   "timestamp", 
        
                   every=self.config.agg_by_time_scale, 
        
                   truncate=True, 
        
                   closed="left", 
        
                   start_by="datapoint", 
        
                   by="subject_id", 
        
               ) 
        
           grouped = ( 
        
               grouped.agg( 
        
                   pl.col("event_type").unique().sort(), 
        
                   pl.col("event_id").unique().alias("old_event_id"), 
        
               ) 
        
               .with_columns( 
        
                   pl.struct(subject_id=pl.col("subject_id"), timestamp=pl.col("timestamp")) 
        
                   .hash(1, 2, 3, 4) 
        
                   .alias("event_id") 
        
               ) 
        
               .with_columns( 
        
                   "event_id", 
        
                   pl.col("event_type") 
        
                   .list.eval(pl.col("").cast(pl.Utf8)) 
        
                   .list.join("&") 
        
                   .cast(pl.Categorical) 
        
                   .alias("event_type"), 
        
               ) 
        
           )

To replicate, run generate_synthetic_data with n_subjects > 50,000 and then run build_dataset.

The solution with my dataset (about 5 million subjects) was to use a compute instance with more memory during the build_dataset phase.

This is partly a polars issue. I tried limiting number of threads and streaming, and it did not make a difference.

A refactor of agg_by_time would be nice to have, but not a must.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build_dataset fails when aggregating timestamps into buckets #119

build_dataset fails when aggregating timestamps into buckets #119

juancq commented Aug 23, 2024

build_dataset fails when aggregating timestamps into buckets #119

build_dataset fails when aggregating timestamps into buckets #119

Comments

juancq commented Aug 23, 2024