Total refactor #27

rabernat · 2020-12-02T02:42:53Z

This is the beginning of a complete rewrite of pangeo forge, learning from our initial steps.

The core of the new code is a Recipe class, which is used like this:

r = recipe.NetCDFtoZarrSequentialRecipe(
    input_urls=netcdf_local_paths,
    sequence_dim="time",
    inputs_per_chunk=1,
    nitems_per_input=daily_xarray_dataset.attrs["items_per_file"],
    target=tmp_target,
    input_cache=tmp_cache,
)

# manual execution of a recipe
r.prepare()
for input_key in r.iter_inputs():
    r.cache_input(input_key)
for chunk_key in r.iter_chunks():
    r.store_chunk(chunk_key)
r.finalize()

The idea is that we will then implement a RecipeExecutor class which runs these recipes. This will work because the methods prepare, cache_input, store_chunk and finalize are not regular methods. The are actually properties that return serializable functions.

There is significant work to do here, but I feel pretty good about where this is going...

rabernat · 2020-12-04T01:58:49Z

Black and flake8 both pass on my local env but are failing in this action. Not what I want to be debugging right now.

jhamman

I'm still catching up here so I've mostly asked questions to improve my understanding. Generally, I'm a fan of the redesign with my only concerns being how we leave space for extension to pipelines that may not use xarray or zarr.

jhamman · 2020-12-09T19:17:26Z

pangeo_forge/__init__.py

@@ -1,6 +1,6 @@
 from pkg_resources import DistributionNotFound, get_distribution

-from pangeo_forge.pipelines import AbstractPipeline
+# from pangeo_forge.pipelines import AbstractPipeline


reminder to remove this.

jhamman · 2020-12-09T19:23:28Z

pangeo_forge/executors/prefect.py

+from ..recipe import DatasetRecipe
+
+
+class PrefectExecutor:


do we want to define a base Executor class? Neither class has an init method in this PR. Are these just wrappers or do we think we'll want to parameterize the executor at some point?

Thinking about this a bit more, I think I'd like to see init and repr methods on all these classes.

jhamman · 2020-12-09T19:23:47Z

pangeo_forge/executors/python.py

+from functools import partial
+from typing import Callable, Iterable
+
+# from ..types import Pipeline, Stage, Task


reminder to remove this

jhamman · 2020-12-09T19:28:50Z

pangeo_forge/recipe.py

+from .storage import InputCache, Target
+from .utils import chunked_iterable, fix_scalar_attr_encoding
+
+# logger = logging.getLogger(__name__)


was this not working?

It does work. I'll put it back

jhamman · 2020-12-09T19:51:58Z

pangeo_forge/recipe.py

+# Notes about dataclasses:
+# - https://www.python.org/dev/peps/pep-0557/#inheritance
+# - https://stackoverflow.com/questions/51575931/class-inheritance-in-python-3-7-dataclasses
+# This means that, for now, I can't get default arguments to work.


So what do dataclasses give us here?

Dataclasses reduce the amount of boilerplate we have to write / maintain. None of these classes needs init methods.

I believe we can fix the default arguments problem by tweaking the mixin order. This requires me to understand python's method resolution order 🤯.

jhamman · 2020-12-09T21:26:45Z

pangeo_forge/recipe.py

+
+
+@dataclass
+class StandardSequentialRecipe(


Let's think about a more informative name here. Once concern I have is that we'll get to Xarray->Zarr focused. While I'm quite happy to support that workflow as the primary and initial implementation, I want to make sure we leave room for a Rasterio->COG workflow or Pandas->Parquet workflow.

jhamman · 2020-12-09T21:29:14Z

pangeo_forge/storage.py

+
+
+@dataclass
+class Target:


Maybe this should be called a ZarrTarget?

Maybe MapperTarget? There is nothing zarr specific about it...

jhamman · 2020-12-09T21:31:05Z

pangeo_forge/storage.py

+    def _full_path(self, path):
+        return os.path.join(self.prefix, _hash_path(path))


For what its worth, I've found that hashing paths like this can make it difficult to debug failed workflows. Maybe you can explain a bit more why we need to hash all paths like this?

I agree. My thinking was: I want there to be unique mapping between inputs and paths in the cache. Hashing achieves this. The input paths may be urls with lots of forbidden characters in them. But I'll play with some alternatives that are more readable / debuggable.

jhamman · 2020-12-09T21:32:13Z

pangeo_forge/recipe.py

+        # do we really want to just delete all encoding?
+        # for v in ds.variables:
+        #    ds[v].encoding = {}
+
+        # TODO: maybe do some chunking here?


can we make these options? I think that yes, we generally want to remove netcdf encoding before writing to zarr. But there are probably cases where this isn't true.

Yes agreed.

rabernat · 2020-12-09T22:17:36Z

Thanks a lot Joe! I'll first reply to some of your questions and then update my PR in response to your comments.

rabernat · 2020-12-17T20:56:26Z

In retrospect, I feel like this approach of creating dozens of mixins and multiple inheritance is premature complexification / abstractifiction. I recently read this blog post and felt like it was speaking directly to me! 😆

Now I think that what we should do is create a basic working recipe class that implements the methods executors expect. As we define new recipes based that don't fit this mold, we should slowly refactor this class to make it more generalizable (rather than trying to generalize everything from day 1.)

Working on this now.

jhamman · 2020-12-18T16:55:39Z

In retrospect, I feel like this approach of creating dozens of mixins and multiple inheritance is premature complexification / abstractifiction.

Couldn't agree more. I think I went through this same realization a few months ago. I think for now, the executors may be enough of an abstraction to allow us to move forward. If we find that recipes are frequently sharing elements, we can pull those out one by one.

rabernat · 2021-01-04T14:11:09Z

Just a note: in pangeo-data/rechunker#77 I am working on an update to rechunker that intersects with this.

joshmoore · 2021-01-06T16:17:18Z

#27 (review) my only concerns being how we leave space for extension to pipelines that may not use xarray or zarr.

This was my first impression as well. I've not yet been able to make multiscales representations openable by xarray.

davidbrochart · 2021-01-06T16:45:21Z

@joshmoore what other tool are you thinking about to generate multiscale representations?

joshmoore · 2021-01-07T09:47:52Z

@davidbrochart : really anything producing (largish) images (or generally spatial arrays?) would benefit from a multiscale representation. For other datatypes, I'd defer to other domains whether it's useful or not.

rabernat · 2021-01-18T18:57:01Z

The tests are hanging intermittently. But I think everything is working.

rabernat · 2021-01-22T13:01:45Z

The tests have become so unreliable for this PR. I think it has to do with starting the HTTP server.

rabernat added 15 commits November 22, 2020 17:49

wip: new recipe syntax

36faf3e

messy wip

7ee78f2

made target fixture

f980e8f

made target fixture

0f31141

spaghetti at this point

15e240d

working storage classes

18a895a

recipe working pretty well

301206b

recipe tests pass

a11fdc3

prune old stuff

b1cc65b

big cleanup

35e0c9f

lint and fix tests

8856344

update requirements

b3a42ed

more linting

ec95d9d

added executors

79ab04f

linting and stuff

e6b32a9

testing executors

b993dab

rabernat mentioned this pull request Dec 7, 2020

regular coordination meetings pangeo-forge/roadmap#3

Closed

jhamman reviewed Dec 9, 2020

View reviewed changes

rabernat added 3 commits December 21, 2020 11:45

major simplification of recipe class

0e150cb

fix precommit again

1ec63eb

finally

a4bf88a

jhamman mentioned this pull request Jan 6, 2021

Feature Request: Hierarchical storage and processing in xarray pydata/xarray#4118

Closed

rabernat added 3 commits January 18, 2021 12:39

cleanup

dbf6b13

add rechunker to CI

c2927be

add rechunker to requirements.txt

9a1d11f

rabernat marked this pull request as ready for review January 18, 2021 18:08

rabernat added 3 commits January 18, 2021 16:55

create ABC for Recipe

1879acb

start working on docs

0fa41ee

writing more docs

4ee634a

charlesbluca mentioned this pull request Jan 20, 2021

Bump documentation #31

Closed

rabernat added 9 commits January 21, 2021 00:42

add tutorial to docs

fa20daf

refactored storage targets

8d66eb9

better target testing

86f6f92

change cannonical recipe execution order

049e692

big update

49519ef

last commit of the night

63e2297

update doc requirements

e0c97b0

use rechunker from github

382663d

fix requirements

57304e8

rabernat merged commit 71c367c into master Jan 22, 2021

This was referenced Jan 22, 2021

[bug] appending to zarr in object store fails #11

Closed

Support for COG outputs/provide functionality for conversion to COG #63

Open

rabernat mentioned this pull request Feb 2, 2021

Need for common vocabulary/visibility of work related to high-level concepts pangeo-forge/roadmap#9

Open

andersy005 deleted the new-recipe-class branch October 21, 2022 00:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Total refactor #27

Total refactor #27

rabernat commented Dec 2, 2020 •

edited

Loading

rabernat commented Dec 4, 2020

jhamman left a comment

jhamman Dec 9, 2020

jhamman Dec 9, 2020

jhamman Dec 9, 2020

jhamman Dec 9, 2020

jhamman Dec 9, 2020

rabernat Dec 9, 2020

jhamman Dec 9, 2020

rabernat Dec 9, 2020

jhamman Dec 9, 2020

jhamman Dec 9, 2020

rabernat Dec 9, 2020

jhamman Dec 9, 2020

rabernat Dec 9, 2020

jhamman Dec 9, 2020

rabernat Dec 9, 2020

rabernat commented Dec 9, 2020

rabernat commented Dec 17, 2020

jhamman commented Dec 18, 2020

rabernat commented Jan 4, 2021

joshmoore commented Jan 6, 2021

davidbrochart commented Jan 6, 2021

joshmoore commented Jan 7, 2021

rabernat commented Jan 18, 2021

rabernat commented Jan 22, 2021

		def _full_path(self, path):
		return os.path.join(self.prefix, _hash_path(path))

Total refactor #27

Total refactor #27

Conversation

rabernat commented Dec 2, 2020 • edited Loading

rabernat commented Dec 4, 2020

jhamman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat commented Dec 9, 2020

rabernat commented Dec 17, 2020

jhamman commented Dec 18, 2020

rabernat commented Jan 4, 2021

joshmoore commented Jan 6, 2021

davidbrochart commented Jan 6, 2021

joshmoore commented Jan 7, 2021

rabernat commented Jan 18, 2021

rabernat commented Jan 22, 2021

rabernat commented Dec 2, 2020 •

edited

Loading