Skip to content

Commit

Permalink
Add dynamic client recommendations, costs, and future areas sections (#…
Browse files Browse the repository at this point in the history
…21)

* Update Zarr version axis labels

* Update recommendations

* Lint notebooks

* Add section on costs

* Add future areas section

* Update description

* Update methods
  • Loading branch information
maxrjones authored Sep 3, 2023
1 parent a33b68b commit ea5cffc
Show file tree
Hide file tree
Showing 14 changed files with 484 additions and 55 deletions.
2 changes: 1 addition & 1 deletion _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ website:
- approaches/dynamic-client/e2e-results-pixels-per-tile.ipynb
- approaches/dynamic-client/e2e-results-aws-region.ipynb
- approaches/dynamic-client/recommendations.qmd
- approaches/dynamic-client/costs.qmd
- approaches/dynamic-client/costs.ipynb
- approaches/dynamic-client/future-areas.qmd


Expand Down
6 changes: 3 additions & 3 deletions approaches/dynamic-client/benchmarking-methodology.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ CarbonPlan's [benchmark-maps](https://github.com/carbonplan/benchmark-maps) repo

The benchmarking script takes the following steps:

1. Launch chromium browser
1. Launch chromium browser
2. Create a new page
3. Start chromium tracing
4. Navigate to web mapping application
Expand Down Expand Up @@ -52,11 +52,11 @@ playwright install
Once the environment is set up, you can run the benchmarks by running the following command:

```bash
python main.py --dataset 1MB-chunks --zarr-version v2 --action zoom_in --zoom-level 4
carbonplan_benchmarks --dataset pyramids-v2-3857-True-128-1-0-0-f4-0-0-0-gzipL1-100 --action zoom_in --zoom-level 4
```

In addition, `main.sh` in the [benchmark-maps](https://github.com/carbonplan/benchmark-maps) repository is a script for running multiple iterations of the benchmarks on multiple datasets and Zarr versions.

### End-to-End Benchmarks: Processing

Each benchmark yields a metadata file and trace record. The `carbonplan_benchmarks` Python package provides utilities for analyzing and visualizing these outputs.
Each benchmark yields a metadata file and trace record. The `carbonplan_benchmarks` Python package provides utilities for analyzing and visualizing these outputs. For each interation (e.g., loading the page, zooming in), we extracted information about the requests (e.g., duration, URL, encoded data length), frames (e.g., duration, status), and calculated the amount of time before rendering was complete. Note that these metrics do not consider the fact that the time to render the first part of the data on the page strongly influences the user experience and would be much faster than the time to render the entire page.
71 changes: 71 additions & 0 deletions approaches/dynamic-client/cost_widgets.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
import numpy as np
import panel as pn


def calculate_level_size(*, level, pixels_per_tile, extra_dimension_length, data_dtype):
"""
Calculate the uncompressed size for a given zoom level in GB.
"""
data_size = (
(pixels_per_tile * 2**level) ** 2
* extra_dimension_length
* data_dtype.itemsize
* 1e-9
)
spatial_coords_size = (
(pixels_per_tile * 2**level * 2) * data_dtype.itemsize * 1e-9
)
extra_coords_size = extra_dimension_length * data_dtype.itemsize * 1e-9
return data_size + spatial_coords_size + extra_coords_size


def calculate_pyramid_cost(
*,
number_of_zoom_levels,
pixels_per_tile,
extra_dimension_length,
data_dtype,
data_compression_ratio,
price_per_GB,
):
"""
Calculated the uncompressed size for pyramids with a given number of zoom levels
in GB.
"""
data_dtype = np.dtype(data_dtype)
pyramid_size = 0
for level in range(number_of_zoom_levels):
pyramid_level_size = calculate_level_size(
level=level,
pixels_per_tile=pixels_per_tile,
extra_dimension_length=extra_dimension_length,
data_dtype=data_dtype,
)
pyramid_size += pyramid_level_size
pyramid_cost = pyramid_size / data_compression_ratio * price_per_GB
return f"Pyramid cost: ${pyramid_cost:.2f}/month"


# Define widgets for panel app
extra_dim_widget = pn.widgets.IntSlider(
name="Time dimension length", start=365, end=3650, step=365, value=730
)
pixels_widget = pn.widgets.DiscreteSlider(
name="Pixels per tile", options=[128, 256, 512], value=128
)
zoom_level_widget = pn.widgets.IntSlider(
name="Number of zoom levels", start=1, end=8, step=1, value=4
)
compression_widget = pn.widgets.IntSlider(
name="Data compression ratio", start=5, end=9, step=2, value=7
)
dtype_widget = pn.widgets.Select(
name="Data type", options=["float16", "float32", "float64"], value="float32"
)
price_widget = pn.widgets.FloatSlider(
name="Storage pricing ($ per GB per month)",
start=0.02,
end=0.03,
step=0.02,
value=0.005,
)
348 changes: 348 additions & 0 deletions approaches/dynamic-client/costs.ipynb

Large diffs are not rendered by default.

5 changes: 0 additions & 5 deletions approaches/dynamic-client/costs.qmd

This file was deleted.

6 changes: 2 additions & 4 deletions approaches/dynamic-client/e2e-results-aws-region.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -57,10 +57,8 @@
],
"source": [
"import hvplot\n",
"import holoviews as hv\n",
"import pandas as pd\n",
"import hvplot.pandas # noqa\n",
"\n",
"import pandas as pd\n",
"import statsmodels.formula.api as smf\n",
"\n",
"pd.options.plotting.backend = \"holoviews\""
Expand Down Expand Up @@ -236,7 +234,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Fit a multiple linear regression to the results. The results show that the chunk size strongly impacts the action duration. Datasets with larger chunk sizes take longer to render. The AWS region does not have a noticeable impact on rendering time."
"Fit a multiple linear regression to the results. The results show that the chunk size strongly impacts the time to render the data. Datasets with larger chunk sizes take longer to render. The AWS region does not have a noticeable impact on rendering time."
]
},
{
Expand Down
4 changes: 1 addition & 3 deletions approaches/dynamic-client/e2e-results-pixels-per-tile.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -57,10 +57,8 @@
],
"source": [
"import hvplot\n",
"import holoviews as hv\n",
"import pandas as pd\n",
"import hvplot.pandas # noqa\n",
"\n",
"import pandas as pd\n",
"import statsmodels.formula.api as smf\n",
"\n",
"pd.options.plotting.backend = \"holoviews\""
Expand Down
5 changes: 2 additions & 3 deletions approaches/dynamic-client/e2e-results-projection.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -56,11 +56,10 @@
}
],
"source": [
"import hvplot\n",
"import holoviews as hv\n",
"import pandas as pd\n",
"import hvplot\n",
"import hvplot.pandas # noqa\n",
"\n",
"import pandas as pd\n",
"import statsmodels.formula.api as smf\n",
"\n",
"pd.options.plotting.backend = \"holoviews\""
Expand Down
6 changes: 2 additions & 4 deletions approaches/dynamic-client/e2e-results-shard-size.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -57,10 +57,8 @@
],
"source": [
"import hvplot\n",
"import holoviews as hv\n",
"import pandas as pd\n",
"import hvplot.pandas # noqa\n",
"\n",
"import pandas as pd\n",
"import statsmodels.formula.api as smf\n",
"\n",
"pd.options.plotting.backend = \"holoviews\""
Expand Down Expand Up @@ -220,7 +218,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Fit a multiple linear regression to the results. The results show that the chunk size strongly impacts the action duration. Datasets with larger chunk sizes take longer to render. The shard size does not have a noticeable impact on rendering time."
"Fit a multiple linear regression to the results. The results show that the chunk size strongly impacts the time to render. Datasets with larger chunk sizes take longer to render. The shard size does not have a noticeable impact on rendering time."
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,11 +56,10 @@
}
],
"source": [
"import hvplot\n",
"import holoviews as hv\n",
"import pandas as pd\n",
"import hvplot\n",
"import hvplot.pandas # noqa\n",
"\n",
"import pandas as pd\n",
"import statsmodels.formula.api as smf\n",
"\n",
"pd.options.plotting.backend = \"holoviews\""
Expand Down Expand Up @@ -714,7 +713,7 @@
],
"source": [
"model = smf.ols(\n",
" \"duration ~ actual_chunk_size * C(zoom) + C(zarr_version) * C(zoom) + actual_chunk_size * C(zarr_version)\",\n",
" \"duration ~ actual_chunk_size * C(zoom) + C(zarr_version) * C(zoom) + actual_chunk_size * C(zarr_version)\", # noqa\n",
" data=df,\n",
").fit()\n",
"model.summary()"
Expand Down
47 changes: 23 additions & 24 deletions approaches/dynamic-client/e2e-results-zarr-version.ipynb

Large diffs are not rendered by default.

2 changes: 0 additions & 2 deletions approaches/dynamic-client/e2e-results.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -75,8 +75,6 @@
],
"source": [
"import carbonplan_benchmarks.analysis as cba\n",
"import holoviews as hv\n",
"import hvplot.pandas\n",
"import pandas as pd\n",
"\n",
"pd.options.plotting.backend = \"holoviews\""
Expand Down
10 changes: 9 additions & 1 deletion approaches/dynamic-client/future-areas.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,12 @@
title: Future Areas
---

COMING SOON.
The benchmarking results and recommendations detailed in this cookbook motivates several possible avenues of exploration:

1. Performance improvements for ndpyramid to reduce post-processing costs.
2. Integration between ndpyramid and pangeo-forge-recipes to support generating pyramids during dataset creation.
3. Extending the dynamic client approach to visualize datasets without pyramids.
4. Formalizing the representation of pyramids and spatial overviews through a Zarr Enhancement Proposal (ZEP) on multiscales.
5. Extending the dynamic client approach to visualize COGs using Kerchunk.
6. Generalizing the dynamic client approach as a plug-in for general solutions like Mapbox or deck.gl.
7. Supporting and benchmarking additional compression algorithms, data types, and bitrounding.
20 changes: 19 additions & 1 deletion approaches/dynamic-client/recommendations.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,22 @@
title: Recommendations
---

COMING SOON.
Here we provide recommendations for producing pyramids for performant Zarr visualization on the web. These recommendations are based on the [end-to-end benchmarking results](e2e-results.ipynb) for the [dynamic client](../dynamic-client.qmd) approach. These benchmark consider the use-case of rendering data on a map. However, we also discuss how time series visualization could factor into these decisions. As we eventually aim to remove the pyramid requirement for the dynamic client approach, it is worth mentioning that many of these recommendations should still apply when rendering raw data on a web map. However, it would be much more important to also consider the performance implications for scientific computational workflows in that case.

## Zarr Version

The end-to-end benchmarking results showed that [V2 and V3 data are comparable in performance](e2e-results-zarr-version.ipynb). Therefore, we recommend adopting the Zarr V3 specification if your preferred Zarr implementation includes the approved version of the [Zarr V3 spec](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html). At the time of writing, the [Zarrita Python library](https://github.com/scalableminds/zarrita) has implemented the approved Zarr V3 spec but is not recommended for production use. The [Zarr Python library](https://github.com/zarr-developers/zarr-python) is undergoing a [refactor](https://github.com/zarr-developers/zarr-python/discussions/1480) to bring the library up-to-date with the V3 spec.

## Number of pixels per tile (spatial chunking)

The number of pixels per tile (i.e., spatial chunking) must be a multiple of 16 and is generally 128, 256, or 512. The end-to-end benchmarks tested 128 and 256 pixels per tile and showed that the number of pixels per tile [does not impact rendering performance at a given chunk size](e2e-results-pixels-per-tile.ipynb). However, including more pixels per tile would reduce the proportion of other dimensions (e.g., time) that can be included in a chunk of a given size. Therefore, it would be worth considering fewer pixels per tile (e.g., 128) if visualizing time series is an important use case. By contrast, if only spatial rendering is important and more detail at coarser zoom levels is desired, you may consider increasing the number of pixels per tile to 256. If corresponding to "de facto" standards is particularly important, 256 is [commonly used](https://wiki.openstreetmap.org/wiki/Zoom_levels) as the spatial width for tiles. Increasing the number of pixels per tile should not increase total storage costs, as the number of zoom levels before reaching full resolution would be correspondingly smaller.

## Chunk size (non-spatial chunking)

The chunk size was the [strongest driver](e2e-results-zarr-version.ipynb) for the total time required to render datasets. For optimal rendering performance, we recommended targeting chunk sizes <1MB for the uncompressed data.

## Zarr V3 sharding extension

The end-to-end benchmarks showed that the time to render was [slower for sharded V3 datasets relative to V2 datasets](e2e-results-zarr-version-shards.ipynb) for zoom levels greater than 0. However, a primary benefit of the sharding extension is that it allows the same dataset to be accessed via large shards for analysis and smaller chunks for visualization. Further, a single file can store many chunks which benefits applications that rely on file-based operations. Given these benefits, we recommend leveraging the sharding specification after its [review and acceptance](https://github.com/zarr-developers/zarr-specs/issues/254) as a [Zarr Enhancement Proposal](https://zarr.dev/zeps/active/ZEP0000.html#how-does-a-zep-become-accepted) (ZEP) and your preferred Zarr implementation includes the approved sharding extension. The voting process for this ZEP expected to end on October 31, 2023. We found the the [shard size does not impact the time to render](e2e-results-shard-size.ipynb) and therefore recommend a follow-up study on optimal shard structures for computational workflows.

We expect that the performance difference between sharded and non-sharded datasets could be minimized by future optimizations in the loading library, such as through the concatenation of range requests for adjacent chunks at higher zoom levels.

0 comments on commit ea5cffc

Please sign in to comment.