Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Total refactor #27

Merged
merged 34 commits into from
Jan 22, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
36faf3e
wip: new recipe syntax
rabernat Nov 22, 2020
7ee78f2
messy wip
rabernat Nov 29, 2020
f980e8f
made target fixture
rabernat Nov 29, 2020
0f31141
made target fixture
rabernat Nov 29, 2020
15e240d
spaghetti at this point
rabernat Nov 30, 2020
18a895a
working storage classes
rabernat Dec 1, 2020
301206b
recipe working pretty well
rabernat Dec 1, 2020
a11fdc3
recipe tests pass
rabernat Dec 2, 2020
b1cc65b
prune old stuff
rabernat Dec 2, 2020
35e0c9f
big cleanup
rabernat Dec 2, 2020
8856344
lint and fix tests
rabernat Dec 2, 2020
b3a42ed
update requirements
rabernat Dec 2, 2020
ec95d9d
more linting
rabernat Dec 2, 2020
79ab04f
added executors
rabernat Dec 2, 2020
e6b32a9
linting and stuff
rabernat Dec 4, 2020
b993dab
testing executors
rabernat Dec 4, 2020
0e150cb
major simplification of recipe class
rabernat Dec 21, 2020
1ec63eb
fix precommit again
rabernat Dec 21, 2020
a4bf88a
finally
rabernat Dec 22, 2020
dbf6b13
cleanup
rabernat Jan 18, 2021
c2927be
add rechunker to CI
rabernat Jan 18, 2021
9a1d11f
add rechunker to requirements.txt
rabernat Jan 18, 2021
1879acb
create ABC for Recipe
rabernat Jan 18, 2021
0fa41ee
start working on docs
rabernat Jan 19, 2021
4ee634a
writing more docs
rabernat Jan 19, 2021
fa20daf
add tutorial to docs
rabernat Jan 21, 2021
8d66eb9
refactored storage targets
rabernat Jan 21, 2021
86f6f92
better target testing
rabernat Jan 21, 2021
049e692
change cannonical recipe execution order
rabernat Jan 21, 2021
49519ef
big update
rabernat Jan 21, 2021
63e2297
last commit of the night
rabernat Jan 22, 2021
e0c97b0
update doc requirements
rabernat Jan 22, 2021
382663d
use rechunker from github
rabernat Jan 22, 2021
57304e8
fix requirements
rabernat Jan 22, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ jobs:
path: ~/conda_pkgs_dir
key: ${{ runner.os }}-conda-${{ env.CACHE_NUMBER }}-${{ hashFiles('ci/py${{ matrix.python-version }}.yml') }}
- name: setup miniconda
uses: goanpeca/setup-miniconda@v1
uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: pangeo-forge
environment-file: ci/py${{ matrix.python-version }}.yml
Expand Down
6 changes: 1 addition & 5 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,16 +26,12 @@ repos:
rev: v2.2.0
hooks:
- id: seed-isort-config

- repo: https://github.com/pre-commit/mirrors-isort
rev: v5.2.0
hooks:
- id: isort

- repo: https://github.com/deathbeds/prenotebook
rev: f5bdb72a400f1a56fe88109936c83aa12cc349fa
hooks:
- id: prenotebook

- repo: https://github.com/myint/rstcheck
rev: master
hooks:
Expand Down
5 changes: 4 additions & 1 deletion ci/py3.7.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,7 @@ dependencies:
- scipy
- setuptools
- toolz
- zarr
- xarray>=0.16.2
- zarr>=2.6.0
- pip:
- git+https://github.com/rabernat/rechunker.git@refactor-executors
5 changes: 4 additions & 1 deletion ci/py3.8.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,7 @@ dependencies:
- scipy
- setuptools
- toolz
- zarr
- xarray>=0.16.2
- zarr>=2.6.0
- pip:
- git+https://github.com/rabernat/rechunker.git@refactor-executors
27 changes: 27 additions & 0 deletions docs/_static/custom.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
/* Put your custom CSS here */
@import url('http://fonts.cdnfonts.com/css/panton-black-caps');

h1 {
font-family: "Panton Black Caps", sans-serif;
color: #003B71 !important;
}

h2 {
font-family: "Panton Light Caps", sans-serif;
color: #003B71 !important;
}

a {
color: #5eb130 !important;
}



/* Fixing up some pygments and code-styling CSS for accessibility */
code { font-size: 100%; color: #e50051; }
pre { font-family: monospace; }

/* .highlight { font-size: 125%; } */
.highlight .c1 { color: #e50051; }
.highlight .si { color: #e50051; }
.highlight .nn { color: #e50051; }
Binary file added docs/_static/pangeo-forge-logo-blue.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
50 changes: 50 additions & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# API Reference


## Storage

```{eval-rst}
.. autoclass:: pangeo_forge.storage.FSSpecTarget
:members:
```

```{eval-rst}
.. autoclass:: pangeo_forge.storage.FlatFSSpecTarget
:members:
:show-inheritance:
```

```{eval-rst}
.. autoclass:: pangeo_forge.storage.CacheFSSpecTarget
:members:
:show-inheritance:
```

## Recipes

```{eval-rst}
.. autoclass:: pangeo_forge.recipe.BaseRecipe
:members:
```

```{eval-rst}
.. autoclass:: pangeo_forge.recipe.NetCDFtoZarrSequentialRecipe
:show-inheritance:
```

## Excutors

```{eval-rst}
.. autoclass:: pangeo_forge.executors.PythonPipelineExecutor
:members:
```

```{eval-rst}
.. autoclass:: pangeo_forge.executors.DaskPipelineExecutor
:members:
```

```{eval-rst}
.. autoclass:: pangeo_forge.executors.PrefectPipelineExecutor
:members:
```
1 change: 1 addition & 0 deletions docs/bakeries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Bakeries
23 changes: 17 additions & 6 deletions docs/conf.py
Original file line number Diff line number Diff line change
@@ -1,25 +1,36 @@
import sphinx_pangeo_theme # noqa
# import sphinx_pangeo_theme # noqa
import sphinx_book_theme # noqa

# -- Project information -----------------------------------------------------

project = "pangeo-forge"
project = "Pangeo Forge"
copyright = "2020, Pangeo Community"
author = "Pangeo Community"


# -- General configuration ---------------------------------------------------

extensions = [
"myst_parser",
"myst_nb",
"sphinx.ext.autodoc",
# "numpydoc",
"sphinx_autodoc_typehints",
"sphinx_copybutton",
]

templates_path = ["_templates"]
exclude_patterns = []
exclude_patterns = ["_build", "**.ipynb_checkpoints"]
master_doc = "index"

# we always have to manually run the notebooks because they are slow / expensive
jupyter_execute_notebooks = "off"

# -- Options for HTML output -------------------------------------------------

html_theme = "pangeo"
html_theme = "sphinx_book_theme"
html_logo = "_static/pangeo-forge-logo-blue.png"
html_static_path = ["_static"]

myst_heading_anchors = 2
html_css_files = [
"custom.css",
]
64 changes: 0 additions & 64 deletions docs/design.md

This file was deleted.

82 changes: 82 additions & 0 deletions docs/execution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Recipe Execution

There are many different types of Pangeo Forge recipes.
However, **all recipes are executed the same way**!
This is a key part of the Pangeo Forge design.

Once you have created a recipe object (see {doc}`recipes`) you have two
options for executing it. In the subsequent code, we will assume that a
recipe has already been initialized in the variable `recipe`.

## Manual Execution

A recipe can be executed manually, step by step, in serial, from a notebook
or interactive interpreter. The ability to manually step through a recipe
is very important for developing and debugging complex recipes.
There are four stages of recipe execution.

### Stage 1: Cache Inputs

Recipes may define files that have to be cached locally before the subsequent
steps can proceed. The common use case here is for files that have to be
extracted from a slow FTP server. Here is how to cache the inputs.

```{code-block} python
for input_name in recipe.iter_inputs():
recipe.cache_input(input_name)
```

If the recipe doesn't do input caching, nothing will happen here.

### Stage 2: Prepare Target

Once the inputs have been cached, we can get the target ready.
Preparing the target for writing is done as follows:

```{code-block} python
recipe.prepare_target()
```

For example, for Zarr targets, this sets up the Zarr group with the necessary
arrays and metadata.

### Stage 3: Store Chunks

This is the step where the bulk of the work happens.

```{code-block} python
for chunk in recipe.iter_chunks():
recipe.store_chunk(chunk)
```

### Stage 4: Finalize Target

If there is any cleanup or consolidation to be done, it happens here.

```{code-block} python
recipe.finalize_target()
```

For example, consolidating Zarr metadta happens in the finalize step.

## Execution by Executors

Very large recipes cannot feasibly be executed this way.
To support distributed parallel execution, Pangeo Forge borrows the
[Executors framework from Rechunker](https://rechunker.readthedocs.io/en/latest/executors.html).

There are currently three executors implemented.
- {class}`pangeo_forge.executors.PythonPipelineExecutor`: a reference executor
using simple python
- {class}`pangeo_forge.executors.DaskPipelineExecutor`: distributed executor using Dask
- {class}`pangeo_forge.executors.PrefectPipelineExecutor`: distributed executor using Prefect

To use an executor, the recipe must first be transformed into a `Pipeline` object.
The full process looks like this:

```{code-block} python
pipeline = recipe.to_pipelines()
executor = PrefectPipelineExecutor()
plan = executor.pipelines_to_plan(pipeline)
executor.execute_plan(plan) # actually runs the recipe
```
40 changes: 38 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,45 @@
# pangeo-forge
# Pangeo Forge

Pangeo Forge is an open source tool for data Extraction, Transformation, and Loading (ETL).
The goal of Pangeo Forge is to make it easy to extract data from traditional data
repositories and deposit in cloud object storage in analysis-ready, cloud-optimize (ARCO) format.

Pangeo Forge is inspired by [Conda Forge](https://conda-forge.org/), a
community-led collection of recipes for building conda packages.
We hope that Pangeo Forge can play the same role for datasets.

## Recipes

The most important concept in Pangeo Forge is a ``recipe``.
A recipe defines how to transform data in one format / location into another format / location.
The primary way people contribute to Pangeo Forge is by writing / maintaining recipes.
Recipes developed by the community are stored in GitHub repositories.
For information about how recipes work see {doc}`recipes`.
The {doc}`tutorials/index` provide deep dives into how to develop and debug Pangeo Forge recipes.

## Recipe Execution

There are several different ways to execute recipes.
See {doc}`execution` for details.

## Bakeries

Bakeries are cloud-based environments for executing recipes.
Each Bakery is coupled to one or more cloud storage buckets where the ARCO data is stored.
Bakeries use [Prefect](https://prefect.io/) to orchestrate the various steps
of the recipe.
For more information, see {doc}`bakeries`.


```{toctree}
:maxdepth: 2
:caption: Contents

recipes
tutorials/index
execution
bakeries
contribute
design
api

```
Loading