-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Total refactor #27
Total refactor #27
Conversation
Black and flake8 both pass on my local env but are failing in this action. Not what I want to be debugging right now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still catching up here so I've mostly asked questions to improve my understanding. Generally, I'm a fan of the redesign with my only concerns being how we leave space for extension to pipelines that may not use xarray or zarr.
@@ -1,6 +1,6 @@ | |||
from pkg_resources import DistributionNotFound, get_distribution | |||
|
|||
from pangeo_forge.pipelines import AbstractPipeline | |||
# from pangeo_forge.pipelines import AbstractPipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder to remove this.
pangeo_forge/executors/prefect.py
Outdated
from ..recipe import DatasetRecipe | ||
|
||
|
||
class PrefectExecutor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to define a base Executor class? Neither class has an init method in this PR. Are these just wrappers or do we think we'll want to parameterize the executor at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this a bit more, I think I'd like to see init and repr methods on all these classes.
pangeo_forge/executors/python.py
Outdated
from functools import partial | ||
from typing import Callable, Iterable | ||
|
||
# from ..types import Pipeline, Stage, Task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder to remove this
pangeo_forge/recipe.py
Outdated
from .storage import InputCache, Target | ||
from .utils import chunked_iterable, fix_scalar_attr_encoding | ||
|
||
# logger = logging.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this not working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does work. I'll put it back
pangeo_forge/recipe.py
Outdated
# Notes about dataclasses: | ||
# - https://www.python.org/dev/peps/pep-0557/#inheritance | ||
# - https://stackoverflow.com/questions/51575931/class-inheritance-in-python-3-7-dataclasses | ||
# This means that, for now, I can't get default arguments to work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what do dataclasses give us here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dataclasses reduce the amount of boilerplate we have to write / maintain. None of these classes needs init methods.
I believe we can fix the default arguments problem by tweaking the mixin order. This requires me to understand python's method resolution order 🤯.
pangeo_forge/recipe.py
Outdated
|
||
|
||
@dataclass | ||
class StandardSequentialRecipe( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's think about a more informative name here. Once concern I have is that we'll get to Xarray->Zarr focused. While I'm quite happy to support that workflow as the primary and initial implementation, I want to make sure we leave room for a Rasterio->COG workflow or Pandas->Parquet workflow.
pangeo_forge/storage.py
Outdated
|
||
|
||
@dataclass | ||
class Target: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be called a ZarrTarget
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe MapperTarget
? There is nothing zarr specific about it...
pangeo_forge/storage.py
Outdated
def _full_path(self, path): | ||
return os.path.join(self.prefix, _hash_path(path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what its worth, I've found that hashing paths like this can make it difficult to debug failed workflows. Maybe you can explain a bit more why we need to hash all paths like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. My thinking was: I want there to be unique mapping between inputs and paths in the cache. Hashing achieves this. The input paths may be urls with lots of forbidden characters in them. But I'll play with some alternatives that are more readable / debuggable.
pangeo_forge/recipe.py
Outdated
# do we really want to just delete all encoding? | ||
# for v in ds.variables: | ||
# ds[v].encoding = {} | ||
|
||
# TODO: maybe do some chunking here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make these options? I think that yes, we generally want to remove netcdf encoding before writing to zarr. But there are probably cases where this isn't true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agreed.
Thanks a lot Joe! I'll first reply to some of your questions and then update my PR in response to your comments. |
In retrospect, I feel like this approach of creating dozens of mixins and multiple inheritance is premature complexification / abstractifiction. I recently read this blog post and felt like it was speaking directly to me! 😆 Now I think that what we should do is create a basic working recipe class that implements the methods executors expect. As we define new recipes based that don't fit this mold, we should slowly refactor this class to make it more generalizable (rather than trying to generalize everything from day 1.) Working on this now. |
Couldn't agree more. I think I went through this same realization a few months ago. I think for now, the executors may be enough of an abstraction to allow us to move forward. If we find that recipes are frequently sharing elements, we can pull those out one by one. |
Just a note: in pangeo-data/rechunker#77 I am working on an update to rechunker that intersects with this. |
This was my first impression as well. I've not yet been able to make multiscales representations openable by xarray. |
@joshmoore what other tool are you thinking about to generate multiscale representations? |
@davidbrochart : really anything producing (largish) images (or generally spatial arrays?) would benefit from a multiscale representation. For other datatypes, I'd defer to other domains whether it's useful or not. |
The tests are hanging intermittently. But I think everything is working. |
The tests have become so unreliable for this PR. I think it has to do with starting the HTTP server. |
This is the beginning of a complete rewrite of pangeo forge, learning from our initial steps.
The core of the new code is a
Recipe
class, which is used like this:The idea is that we will then implement a
RecipeExecutor
class which runs these recipes. This will work because the methodsprepare
,cache_input
,store_chunk
andfinalize
are not regular methods. The are actually properties that return serializable functions.There is significant work to do here, but I feel pretty good about where this is going...