-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot persist dask.dataframes #4492
Comments
cc @ian-r-rose |
Thanks for the nice example @dantegd. In #4406 we removed the ability to pass Dask collections to For the above example, this means changing client.persist(objects, workers={o: workers for o in objects}) to with dask.annotate(workers=set(workers)):
return client.persist(objects) Details:import dask.dataframe as dd
import numpy as np
import pandas as pd
import dask
from dask.distributed import Client
from dask.distributed import LocalCluster
def persist_across_workers(client, objects, workers=None):
if workers is None:
# Default to all workers
workers = client.has_what().keys()
with dask.annotate(workers=set(workers)):
return client.persist(objects)
if __name__ == "__main__":
cluster = LocalCluster()
client = Client(cluster)
X = np.ones((10000, 20))
X_df = pd.DataFrame(X)
X_dist = dd.from_pandas(X_df, npartitions=2)
X_f = persist_across_workers(client, X_dist) |
Should we improve the error message here then? Maybe drop this flag? Something else? As it is, this is not obviously unsupported based on the error message given (MsgPack unable to serialize something) |
Yes, that's right @jrbourbeau . @jakirkham I agree that an improved error message would be helpful. At the very least, we could do a better job ensuring that the shape of the priority/workers/etc makes sense (i.e., iterable for workers, number for priority, error if dict-of-collections) |
@trivialfis you might be interested in this -- I think xgboost maybe does similar things ? |
cc @hcho3 (from xgboost as well) |
Good to know. Briefly looking through |
@hcho3 is out for a bit, but it sounds like we're good on the xgboost front? @trivialfis, please feel free to hit me up if you need extra hands/eyes on that. |
Thanks for checking @jrbourbeau |
Fixes the current CI issue dask/distributed#4492 Authors: - Dante Gama Dessavre (@dantegd) - Michael Demoret (@mdemoret-nv) Approvers: - John Zedlewski (@JohnZed) - William Hicks (@wphicks) - @jakirkham - Corey J. Nolet (@cjnolet) URL: #3474
Just checking in here, was there anything else we still need to do or is this ok to close now? |
It was proposed we could improve the error message which is raised (#4492 (comment)). But otherwise I think things have already been fixed upstream in dask-cudf |
I'm sorry for completely missing this thread until I tried to dig up some old emails...
We do have a line that might be related https://github.com/dmlc/xgboost/blob/905fdd3e08d91077aada776346c7e49e4ff69334/python-package/xgboost/dask.py#L335 , copying it to here: data = client.persist(data) But we just use the default value for workers. So should be safe? |
If you don't need the workers to be specified, would not specify them |
Yeah, as John mentioned, you should be fine since you're not explicitly specifying a set of workers |
@jakirkham @jrbourbeau Thanks for the advice! |
What happened:
DataFrame collections like dask dataframes or dask-cudf cannot be persisted after release 2021.2.0. @wphicks triaged that after the merge of this PR the issue started to present: #4406
What you expected to happen:
Persist to work (see reproducer)
Minimal Complete Verifiable Example:
Output:
Environment:
cc @jakirkham @pentschev @madsbk @wphicks
The text was updated successfully, but these errors were encountered: