Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive tokenization #3515

Merged
merged 7 commits into from
Nov 13, 2019
Merged

Conversation

crusaderky
Copy link
Contributor

After misreading the dask documentation https://docs.dask.org/en/latest/custom-collections.html#deterministic-hashing, I was under the impression that the output of __dask_tokenize__ would be recursively parsed, like it happens for __getstate__ or __reduce__. That's not the case - the output of __dask_tokenize__ is just fed into a str() function so it has to be made explicitly recursive!

@crusaderky crusaderky self-assigned this Nov 12, 2019
@crusaderky
Copy link
Contributor Author

xref pydata/sparse#300

# Test DataArray and Variable
da_a = DataArray(a)
da_b = DataArray(b)
assert dask.base.tokenize(da_a) != dask.base.tokenize(da_b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I'm clear, these are two different (but equal) xarray objects, so we don't want dask to think that they are the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They aren't equal. da_b[5000] == 2, while da_a[5000] == 1.

@@ -856,6 +856,10 @@ def test_dask_token():
import dask

s = sparse.COO.from_numpy(np.array([0, 0, 1, 2]))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xfail this instead?

Copy link
Contributor Author

@crusaderky crusaderky Nov 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that was an option. But it would have completely disabled the test, and the test is really about not accidentally invoking self.values (which in turn invokes self._variable._data.__array__()) on NEP18-compatible backends. It is not really about testing sparse specifically; it just conveniently relies on the fact that COO.__array__ raises an exception. A more correct, but also more laborious, test could have created a NEP18-compatible dummy class on the fly.

@crusaderky crusaderky merged commit e70138b into pydata:master Nov 13, 2019
@crusaderky crusaderky deleted the recursive_tokenize branch November 13, 2019 00:54
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 14, 2019
* upstream/master:
  Allow appending datetime & boolean variables to zarr stores (pydata#3504)
  warn if dim is passed to rolling operations. (pydata#3513)
  Deprecate allow_lazy (pydata#3435)
  Recursive tokenization (pydata#3515)
  format indexing.rst code with black (pydata#3511)
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 17, 2019
* upstream/master:
  Added fill_value for unstack (pydata#3541)
  Add DatasetGroupBy.quantile (pydata#3527)
  ensure rename does not change index type (pydata#3532)
  Leave empty slot when not using accessors
  interpolate_na: Add max_gap support. (pydata#3302)
  units & deprecation merge (pydata#3530)
  Fix set_index when an existing dimension becomes a level (pydata#3520)
  add Variable._replace (pydata#3528)
  Tests for module-level functions with units (pydata#3493)
  Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502)
  FUNDING.yml (pydata#3523)
  Allow appending datetime & boolean variables to zarr stores (pydata#3504)
  warn if dim is passed to rolling operations. (pydata#3513)
  Deprecate allow_lazy (pydata#3435)
  Recursive tokenization (pydata#3515)
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 17, 2019
* upstream/master: (22 commits)
  Added fill_value for unstack (pydata#3541)
  Add DatasetGroupBy.quantile (pydata#3527)
  ensure rename does not change index type (pydata#3532)
  Leave empty slot when not using accessors
  interpolate_na: Add max_gap support. (pydata#3302)
  units & deprecation merge (pydata#3530)
  Fix set_index when an existing dimension becomes a level (pydata#3520)
  add Variable._replace (pydata#3528)
  Tests for module-level functions with units (pydata#3493)
  Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502)
  FUNDING.yml (pydata#3523)
  Allow appending datetime & boolean variables to zarr stores (pydata#3504)
  warn if dim is passed to rolling operations. (pydata#3513)
  Deprecate allow_lazy (pydata#3435)
  Recursive tokenization (pydata#3515)
  format indexing.rst code with black (pydata#3511)
  add missing pint integration tests (pydata#3508)
  DOC: update bottleneck repo url (pydata#3507)
  add drop_sel, drop_vars, map to api.rst (pydata#3506)
  remove syntax warning (pydata#3505)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants