Perform Stateful UDF initialization once-per-worker instead of once-per-partition #196

jaychia · 2022-09-16T23:42:26Z

Is your feature request related to a problem? Please describe.
Currently Stateful UDFs are initialized once per execution of a UDF, instead of once per worker initialization. This means that we are unable to amortize the cost of expensive initializations over the multiple partitions that a single worker is processing.

Describe the solution you'd like
Workers should be able to identify stateful UDFs in a given window of execution, and only run their initializers once only, reusing them across multiple windows.

Additional context
See code in @udf which hardcodes the initializations of stateful UDFs on a per-UDF call basis:

Daft/daft/udf.py

Lines 73 to 79 in 2496baa

    
           # TODO: The initialization of stateful UDFs is currently done on the execution on every partition here, 
        
           # but should instead be done on a higher level so that state initialization cost can be amortized across partitions. 
        
           try: 
        
               initialized_func = func() if isinstance(func, type) else func 
        
           except: 
        
               logger.error(f"Encountered error when initializing user-defined function {func.__name__}") 
        
               raise

The text was updated successfully, but these errors were encountered:

jaychia · 2024-02-09T18:55:21Z

Closing as inactive

jaychia · 2024-09-02T23:50:38Z

As of #2677, there are a few remaining todos:

Add tests and fixes for accounting for init_args and batch_size when running the stateful UDFs
Have actor pool resource requests deduct from the globally available resource pool to avoid weird issues of starving any running tasks (e.g. not enough memory). Instead, the tasks should pre-emptively throw an error during admission to indicate that there will not be enough resources to run the tasks.

As per offline discussion, we can keep the implementation of actor pools locally simple for now by not attempting to do any smart lazy initialization/teardown of these pools. When the PhysicalPlan runs, all the actor pools in the plan will spin up and we make no guarantees about when they are torn down. These pools deduct from the global available resources (of GPUs and memory) so any subsequent tasks will now have a smaller pool of resources they can pick from.

cc @kevinzwang

jaychia · 2024-10-07T18:21:52Z

Closing this as it is done, but needs to make it through last bits of dogfooding/testing from @kevinzwang to push past the finish line.

jaychia added the inactive label Sep 25, 2023

jaychia closed this as not planned Won't fix, can't repro, duplicate, stale Feb 9, 2024

jaychia reopened this Jul 18, 2024

jaychia added p0 Priority 0 - to be addressed immediately and removed inactive labels Jul 18, 2024

jaychia self-assigned this Jul 18, 2024

jaychia mentioned this issue Sep 2, 2024

[FEAT] Add runner logic in PyRunner for ActorPoolProject #2677

Merged

jaychia assigned kevinzwang Sep 3, 2024

jaychia closed this as completed Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform Stateful UDF initialization once-per-worker instead of once-per-partition #196

Perform Stateful UDF initialization once-per-worker instead of once-per-partition #196

jaychia commented Sep 16, 2022

jaychia commented Feb 9, 2024

jaychia commented Sep 2, 2024

jaychia commented Oct 7, 2024

Perform Stateful UDF initialization once-per-worker instead of once-per-partition #196

Perform Stateful UDF initialization once-per-worker instead of once-per-partition #196

Comments

jaychia commented Sep 16, 2022

jaychia commented Feb 9, 2024

jaychia commented Sep 2, 2024

jaychia commented Oct 7, 2024