-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need for common vocabulary/visibility of work related to high-level concepts #9
Comments
Thanks for this useful feedback @ciaranevans. I agree things are a bit of a mess right now. I believe that most of the confusion comes from the fact that the recent massive refactoring of this project (pangeo-forge/pangeo-forge-recipes#27) has still not propagated throughout all of the related repos and documentation. That refactor changed many of the terms / definitions. I agree that clarifying these should be a top priority.
This entire page was written before the refactor and is now almost totally obsolete.
In analogy with Conda Forge, we really imagine that the recipe is the specific thing that applies to a specific dataset. The So would it be helpful if we defined:
? They would be related as follows: all recipes are instances of Recipe Classes. I am definitely open to different terminology that you think would be less confusing. |
p.s. "pipeline" is an internal concept--a technical detail about how recipes describe themselves. I think it should not really be part of the user-facing documentation, but it is important for developers to understand. |
Hey @rabernat
I think this is where we need to look into our terminology, I think Recipe is too wide a catch-all. If we're going with the culinary theme (and forgive me for using Wikipedia 🤣 ) then maybe we make something like In the example of @davidbrochart's PR, I'd say then that his recipe.py is a Recipe, what it contains is actually the Procedure (maybe we can call them transformers? - not very culinary) which is just a step in the Recipe.
Doesn't work for things that are 'scripts' that do many things and use one of the Recipe classes. In an OOP world, your recipe would be the 'script' part, you're then using the NetCDFtoZarrProcedure (which is generic in that it just takes an input, regardless of the dataset, and puts it in the output location you specify). You NetCDFtoZarrProcedure would be an implementation of your Procedure interface (which is currently I think.. 😅 |
Please do not draw any conclusions from @davidbrochart's recipe. It does not accurately represent our vision for how this will work. It mixes together the recipe itself and its execution. It was written by David without really discussing or consulting with the broader group. Going forward, we need to work together to define how recipes will be specified, and this discussion is useful for that. (See also pangeo-forge/pangeo-forge-recipes#71).
We don't want to be maintaining "scripts." We want to be maintaining recipes. The idea is that the instantiated recipe object contains everything needed to produce a dataset inside itself. No extra "script" should be necessary. However, some code may need to be written to instantiate the recipe. I imagine that the contents of a recipe.py file for a specific recipe would look something like this url_list = ['http://foo.com/bar1.nc', 'http://foo.com/bar2.nc']
def my_preprocessor(ds):
return ds**2
recipe = NetCDFtoZarrSequentialRecipe(
input_urls=url_list,
preprocess_chunks=my_preprocessor
) This is the code that would actually live in a recipe repe (e.g. Does that make sense to you? Regarding the terminology, have you looked in the the usage of "recipe" in conda forge? That is what we are basing things on here: |
Indeed, in conda-forge, usually everything is contained in a single YAML file, including metadata about the repo and build instructions. In their case, the recipe and that file and mostly synonymous. Even the little example above could be more declarative, but I suppose the best description of code (such as the preprocess function) really is code. |
Good points martin. Where there is no custom code involved, we could easily implement a yaml syntax and translator that would create the recipe, e.g. recipe:
class: NetCDFtoZarrSequentialRecipe
input_urls:
- 'http://foo.com/bar1.nc'
- 'http://foo.com/bar1.nc' I had imagined eventually supporting this, for people who don't want to write python. |
Yeah, that makes more sense to me in how they're used (I guess another ticket to raise to point this intention out) The statement that the I also like the above yaml syntax, if it's a straight in/out job Thanks for the inputs @rabernat @martindurant |
Really helpful to have your feedback. This is all super useful to talk through. The next step is to get this all written down in pangeo-forge/pangeo-forge-recipes#71 |
Hopefully I've not just missed the train overall and gone on a rant 😅 |
@rabernat just so I can 🤓 -up on the project some more, is there a good place to look into how those From a |
Nope! We still have to sort that part out, and it is quite a big task. However, I imagine it would look roughly like this. This code would run inside a github workflow. from recipe_file import recipe
# some parts of the recipe need to be customized by the specific bakery where it will run
# this part needs to be figured out more
bakery = determine_bakery(recipe)
recipe.target target = bakery.target
recipe.input_cache = bakery.input_cache
# convert recipe to a prefect flow
from pangeo_forge.executors import PrefectPipelineExecutor
executor = PrefectPipelineExecutor() # will need more config
plan = executor.pipelines_to_plan(recipe.to_pipelines())
# we now have a prefect flow (plan)
# next step is to submit it to prefect cloud and make sure it gets routed to the correct bakery Some of that could go inside a CLI (see pangeo-forge/pangeo-forge-recipes#43). |
Cool thanks, makes sense to me |
(Sorry for question upon question) - Do we have a test suite/expectation of tests for folks' recipes? |
Could you elaborate what you mean here? The recipe classes are tested in pangeo-forge: https://github.com/pangeo-forge/pangeo-forge/blob/master/tests/test_recipe.py Are you talking about validation of contributed recipes? |
I'd also like to note that that the recipe repo may need an additional metadata file, besides the recipe itself. This would specify things like the maintainer(s), data license, target bakery, etc. |
I am yes, I.E if someone wants to contribute a Recipe what is the expectation on that contribution in terms of testing? Do we expect it to come with unit tests? |
No, I don't think we should require users to write customized tests for their recipe. Because the recipes all follow a fixed pattern, it should be possible to write a generic validator that is run as part of the recipe contribution process. |
Interesting. I feel like we want ensure the data out is what's expected.. especially for custom recipes? Is that what your generic validator would do? For example, if I had dataset X and wanted to generate COGs (with specified parameters) from it, how do we ensure that we have a valid dataset output before we potentially burn a lot of 💰 on a recipe that isn't correct? |
It might 'work' functionally, but it might just be making junk (which is definitely a possibility) |
Yeah that's what I imagined. For example, the validator could do the following:
At this point, the recipe "works" as in does not error. But how do you verify the data are "right"? I see two options (these are not mutually exclusive):
I have a hard time imagining what 1 would actually look like, but I'm definitely open to the idea. What might a formal verification test look like for the COGs you generate? For 2, assuming this is all happening as part of a PR workflow, we could have a bot post to the PR with a link to the validation output data. Then the user could manually inspect the data. |
Well, I'd assume that someone writing a Recipe to take X and make it into Y, would have some knowledge of both and be able to provide at least a unit test clarifying their dataset does indeed look how they want. I think manual inspection puts a huge onus on either Anything manual is a bit of a smell in that sense. For example, there are libraries ( Personally, I believe a contributor should provide a unit test that confirms their recipe outputs data correctly, and doesn't just not error. If I was hosting a bakery, I'd rather not be spending lots of money hosting dud datasets. I think an expectation of provided unit-tests in the grand scheme of things is to be expected for contributions to OSS projects? |
Otherwise, Bakeries will need to manually check each recipe before they say 'yes, you can run your recipe in our bakery' |
This is a useful discussion and I really appreciate you view. I would distinguish between a general OSS contributor and a Pangeo Forge recipe contributor. Our goal here is to really democratize the production of ARCO data in the cloud. I imagine we will have contributions from data managers and individual scientists who don't really know python at all (thus the potential yaml syntax). We want to make it easy for them. Requiring writing a custom python unit test for all recipes will significantly raise the bar for contribution.
I agree that it's important to have validation for stuff like this. But I wonder if this always needs to be done at the individual recipe level. Assuming you are using a standard recipe class, and that class itself is well tested (accurately propagates data and metadata), can we not trust that it will work as expected? I see this sort of individual-recipe testing as especially important when the recipe has lots of custom processing code. I am not opposed to the idea of recipe tests. But I do wonder whether they are always needed, especially in simple cases. |
👍🏽 I'm also in favor of making the recipe tests optional and maybe providing some user guides (in form of examples) of how one could go about writing custom tests for their recipes if need be. |
I think anything that is run, should have some form of coverage. I am pretty 'adamant' about that. Standard But contributing a new Recipe class, that needs unit tests at the class implementation level, no ifs/buts. I guess this is another case of the jump from recipe to Recipe Class. A recipe that just imports a provided class will need a lot less testing. But if i'm contributing a In the culinary vocab, you should be able to prove your dataset bakes correctly, before handing it to the bakery. That should reside with the recipe, not the bakery. |
Absolutely! I would never disagree with that. Our goal is to have 100% test coverage in But some ambiguity creeps in if we allow recipes to define custom recipe classes e.g. class CiaransMagicalRecipe(NetCDFtoZarrSequentialRecipe):
...
recipe = CiaransMagicalRecipe() We should either disallow this completely or else definitely require complete test coverage. |
I get what @ciaranevans is saying, in the context of End Users submitting recipes.
This is all separate from code contributions of functions and classes that become the methods used. All of these should have unit tests that CI/CD runs on commits/PRs to the code repos. |
@rabernat with all the above discussions, I think that maybe in the next co-ordination meeting we should try and do a whiteboard session (we've used Miro here at DevSeed before, maybe we could do guest invites then export the resultant whiteboard for documentation) to outline a 'steel thread' through A example that touches bases on all our high level use-cases, that we can then use as both a way to test and work out our assumptions and to act as a 'golden' example of what's expected. |
I have a certain example in my head that I have been using. I just submitted it as a draft PR in pangeo-forge/staged-recipes#20. |
In 5aa5aca I pushed an update to the roadmap based on this discussion. Sorry for not PR-ing, but I wanted to get an update to coincide with my talk tomorrow at the ECMWF Virtual workshop on Weather and climate in the cloud. We can continue to iterate this after this next coordination meeting. |
TL;DR:
We need to ensure our vocabulary is consistent throughout projects and that all planned/current work is visible. This is to ensure that development is done towards the right issues and people on-boarding onto any
pangeo-forge
projects can easily identify the current state of work.The long bit, time to read
Hey folks, please let me know if this is better placed somewhere else. Just doing a little dump of my thoughts after going through the repos and project boards.
I've gone through the following:
pangeo-forge/pangeo-forge
,pangeo-forge/roadmap
,pangeo-forge/staged-recipes
, and https://pangeo-forge.readthedocs.io/en/latest/. Whilst on-boarding and getting my head aroundpangeo-forge
I've noticed a pretty big gap/discrepancy in how we're talking about Recipes as a whole.This gap (we'll call it that) is mainly down to what a Recipe actually is, I'll try to outline where I see the confusion:
Docs
The documentation says that a Recipe
defines how to transform data in one format / location into another format / location
- which makes complete sense 💯What confuses the picture (for me at least) currently, is going to Recipes and seeing that a Recipe is defined by
instantiating a class that inherits from pangeo_forge.recipe.BaseRecipe
(which still makes sense), but in Recipe Tutorials the documentation here is more 'how to use a Recipe', rather than write one. I think a first 'quick win' could be to rename this to 'Using a Recipe'.I should note, Recipe Execution covers the 'Using a Recipe', so maybe the contents of the former, would be better in here?
I think the most confusion for me currently comes from Contributing a Recipe which mentions Pipelines (which aren't currently mentioned on the actual Recipe pages). My understanding from the meetings I've been in and what I've seen of refactoring etc, leads me to think that the idea of a Pipeline is now not necessarily related to a Recipe, a Recipe should just relate to taking an input, transforming it in some manner, then serialising it somewhere else. I've assumed that a Recipe, once written, shouldn't need knowledge of Prefect etc for contribution to
pangeo-forge
.Whilst I appreciate that Recipes does clearly state
The Recipe API is still in flux and may change. Make sure the version of the documentation you are reading matches your installed version of pangeo_forge.
- I think this should be on both thepangeo-forge
repo and thestaged-recipes
repo, as I think folks might be implementing the wrong/an old solution?Repos
Staged Recipes looks like it could do with some TLC, I'm sure this is known, but it should probably reside in an Issue, which I've not seen yet. It might be an Issue, I just didn't notice it on the Project Boards I've looked at. It still mentions making Pipelines (what are those?) and also introduces a
metal.yaml
file too, that's new to me! (If I'm playing devils advocate of someone who's just going by documentation).Not to beat a dead 🐴 but example/pipeline.py is a great example of where my confusion comes in. No where has the mention of a Recipe or something that looks like it implements
pangeo_forge.recipe.BaseRecipe
, to me (If I was to look at contributing) this then causes a real blocker, as I don't know which Recipe is the source of truth.The BaseRecipe class is where I see the source of truth for what a Recipe currently is,
NetCDFtoZarrSequentialRecipe
is a clear example of the implementation of this and I'd assume this is what any contributed Recipe would look like. If that's the case, I think we need to make it immediately obvious that this area is in flux, even if it's a 'Don't try to contribute yet, we're just lining up stuff'.I noticed (and sorry for the name drop 😅) that @davidbrochart has started working on a GPM IMERG Recipe - This to me, is not a Recipe, it's a script/orchestration of a already available Recipe. This seems like it falls into another category of contributions/work we need.
Concluding thoughts
Firstly, a disclaimer - I'm pretty damn new to
pangeo-forge
so might just be oblivious to something that's key 🤣. This is also by no means a rant/attack on the work done so far! I'm super excited to see where this goes! ❤️To wrap up, I think that we need to:
Is what I'm doing pangeo-forge related?
If yes, it's a ticket. No matter how big/small, we need to keep track of what is happeningpangeo-forge
is trying to be, I envision the definition to be as memorable as all of our favourite songs 🎵Please do interrogate this, tell me I'm wrong etc. Looking forward to what folks think!
cc @pangeo-forge/dev-team
The text was updated successfully, but these errors were encountered: