Add dynamic client recommendations, costs, and future areas sections (#…

…21) * Update Zarr version axis labels * Update recommendations * Lint notebooks * Add section on costs * Add future areas section * Update description * Update methods
NASA-IMPACT · Sep 3, 2023 · ea5cffc · ea5cffc
1 parent a33b68b
commit ea5cffc
Show file tree

Hide file tree

Showing 14 changed files with 484 additions and 55 deletions.
diff --git a/_quarto.yml b/_quarto.yml
@@ -52,7 +52,7 @@ website:
             - approaches/dynamic-client/e2e-results-pixels-per-tile.ipynb
             - approaches/dynamic-client/e2e-results-aws-region.ipynb
             - approaches/dynamic-client/recommendations.qmd
-            - approaches/dynamic-client/costs.qmd
+            - approaches/dynamic-client/costs.ipynb
             - approaches/dynamic-client/future-areas.qmd         
 
 

diff --git a/approaches/dynamic-client/benchmarking-methodology.qmd b/approaches/dynamic-client/benchmarking-methodology.qmd
@@ -23,7 +23,7 @@ CarbonPlan's [benchmark-maps](https://github.com/carbonplan/benchmark-maps) repo
 
 The benchmarking script takes the following steps:
 
-1. Launch chromium browser 
+1. Launch chromium browser
 2. Create a new page
 3. Start chromium tracing
 4. Navigate to web mapping application
@@ -52,11 +52,11 @@ playwright install
 Once the environment is set up, you can run the benchmarks by running the following command:
 
 ```bash
-python main.py --dataset 1MB-chunks --zarr-version v2 --action zoom_in --zoom-level 4
+carbonplan_benchmarks --dataset pyramids-v2-3857-True-128-1-0-0-f4-0-0-0-gzipL1-100 --action zoom_in --zoom-level 4
 ```
 
 In addition, `main.sh` in the [benchmark-maps](https://github.com/carbonplan/benchmark-maps) repository is a script for running multiple iterations of the benchmarks on multiple datasets and Zarr versions.
 
 ### End-to-End Benchmarks: Processing
 
-Each benchmark yields a metadata file and trace record. The `carbonplan_benchmarks` Python package provides utilities for analyzing and visualizing these outputs.
+Each benchmark yields a metadata file and trace record. The `carbonplan_benchmarks` Python package provides utilities for analyzing and visualizing these outputs. For each interation (e.g., loading the page, zooming in), we extracted information about the requests (e.g., duration, URL, encoded data length), frames (e.g., duration, status), and calculated the amount of time before rendering was complete. Note that these metrics do not consider the fact that the time to render the first part of the data on the page strongly influences the user experience and would be much faster than the time to render the entire page.
diff --git a/approaches/dynamic-client/cost_widgets.py b/approaches/dynamic-client/cost_widgets.py
@@ -0,0 +1,71 @@
+import numpy as np
+import panel as pn
+
+
+def calculate_level_size(*, level, pixels_per_tile, extra_dimension_length, data_dtype):
+    """
+    Calculate the uncompressed size for a given zoom level in GB.
+    """
+    data_size = (
+        (pixels_per_tile * 2**level) ** 2
+        * extra_dimension_length
+        * data_dtype.itemsize
+        * 1e-9
+    )
+    spatial_coords_size = (
+        (pixels_per_tile * 2**level * 2) * data_dtype.itemsize * 1e-9
+    )
+    extra_coords_size = extra_dimension_length * data_dtype.itemsize * 1e-9
+    return data_size + spatial_coords_size + extra_coords_size
+
+
+def calculate_pyramid_cost(
+    *,
+    number_of_zoom_levels,
+    pixels_per_tile,
+    extra_dimension_length,
+    data_dtype,
+    data_compression_ratio,
+    price_per_GB,
+):
+    """
+    Calculated the uncompressed size for pyramids with a given number of zoom levels
+    in GB.
+    """
+    data_dtype = np.dtype(data_dtype)
+    pyramid_size = 0
+    for level in range(number_of_zoom_levels):
+        pyramid_level_size = calculate_level_size(
+            level=level,
+            pixels_per_tile=pixels_per_tile,
+            extra_dimension_length=extra_dimension_length,
+            data_dtype=data_dtype,
+        )
+        pyramid_size += pyramid_level_size
+    pyramid_cost = pyramid_size / data_compression_ratio * price_per_GB
+    return f"Pyramid cost: ${pyramid_cost:.2f}/month"
+
+
+# Define widgets for panel app
+extra_dim_widget = pn.widgets.IntSlider(
+    name="Time dimension length", start=365, end=3650, step=365, value=730
+)
+pixels_widget = pn.widgets.DiscreteSlider(
+    name="Pixels per tile", options=[128, 256, 512], value=128
+)
+zoom_level_widget = pn.widgets.IntSlider(
+    name="Number of zoom levels", start=1, end=8, step=1, value=4
+)
+compression_widget = pn.widgets.IntSlider(
+    name="Data compression ratio", start=5, end=9, step=2, value=7
+)
+dtype_widget = pn.widgets.Select(
+    name="Data type", options=["float16", "float32", "float64"], value="float32"
+)
+price_widget = pn.widgets.FloatSlider(
+    name="Storage pricing ($ per GB per month)",
+    start=0.02,
+    end=0.03,
+    step=0.02,
+    value=0.005,
+)
diff --git a/approaches/dynamic-client/costs.ipynb b/approaches/dynamic-client/costs.ipynb
diff --git a/approaches/dynamic-client/costs.qmd b/approaches/dynamic-client/costs.qmd
diff --git a/approaches/dynamic-client/e2e-results-aws-region.ipynb b/approaches/dynamic-client/e2e-results-aws-region.ipynb
@@ -57,10 +57,8 @@
    ],
    "source": [
     "import hvplot\n",
-    "import holoviews as hv\n",
-    "import pandas as pd\n",
     "import hvplot.pandas  # noqa\n",
-    "\n",
+    "import pandas as pd\n",
     "import statsmodels.formula.api as smf\n",
     "\n",
     "pd.options.plotting.backend = \"holoviews\""
@@ -236,7 +234,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Fit a multiple linear regression to the results. The results show that the chunk size strongly impacts the action duration. Datasets with larger chunk sizes take longer to render. The AWS region does not have a noticeable impact on rendering time."
+    "Fit a multiple linear regression to the results. The results show that the chunk size strongly impacts the time to render the data. Datasets with larger chunk sizes take longer to render. The AWS region does not have a noticeable impact on rendering time."
    ]
   },
   {

diff --git a/approaches/dynamic-client/e2e-results-pixels-per-tile.ipynb b/approaches/dynamic-client/e2e-results-pixels-per-tile.ipynb
@@ -57,10 +57,8 @@
    ],
    "source": [
     "import hvplot\n",
-    "import holoviews as hv\n",
-    "import pandas as pd\n",
     "import hvplot.pandas  # noqa\n",
-    "\n",
+    "import pandas as pd\n",
     "import statsmodels.formula.api as smf\n",
     "\n",
     "pd.options.plotting.backend = \"holoviews\""

diff --git a/approaches/dynamic-client/e2e-results-projection.ipynb b/approaches/dynamic-client/e2e-results-projection.ipynb
@@ -56,11 +56,10 @@
     }
    ],
    "source": [
-    "import hvplot\n",
     "import holoviews as hv\n",
-    "import pandas as pd\n",
+    "import hvplot\n",
     "import hvplot.pandas  # noqa\n",
-    "\n",
+    "import pandas as pd\n",
     "import statsmodels.formula.api as smf\n",
     "\n",
     "pd.options.plotting.backend = \"holoviews\""

diff --git a/approaches/dynamic-client/e2e-results-shard-size.ipynb b/approaches/dynamic-client/e2e-results-shard-size.ipynb
@@ -57,10 +57,8 @@
    ],
    "source": [
     "import hvplot\n",
-    "import holoviews as hv\n",
-    "import pandas as pd\n",
     "import hvplot.pandas  # noqa\n",
-    "\n",
+    "import pandas as pd\n",
     "import statsmodels.formula.api as smf\n",
     "\n",
     "pd.options.plotting.backend = \"holoviews\""
@@ -220,7 +218,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Fit a multiple linear regression to the results. The results show that the chunk size strongly impacts the action duration. Datasets with larger chunk sizes take longer to render. The shard size does not have a noticeable impact on rendering time."
+    "Fit a multiple linear regression to the results. The results show that the chunk size strongly impacts the time to render. Datasets with larger chunk sizes take longer to render. The shard size does not have a noticeable impact on rendering time."
    ]
   },
   {

diff --git a/approaches/dynamic-client/e2e-results-zarr-version-shards.ipynb b/approaches/dynamic-client/e2e-results-zarr-version-shards.ipynb
@@ -56,11 +56,10 @@
     }
    ],
    "source": [
-    "import hvplot\n",
     "import holoviews as hv\n",
-    "import pandas as pd\n",
+    "import hvplot\n",
     "import hvplot.pandas  # noqa\n",
-    "\n",
+    "import pandas as pd\n",
     "import statsmodels.formula.api as smf\n",
     "\n",
     "pd.options.plotting.backend = \"holoviews\""
@@ -714,7 +713,7 @@
    ],
    "source": [
     "model = smf.ols(\n",
-    "    \"duration ~ actual_chunk_size * C(zoom) + C(zarr_version) * C(zoom) + actual_chunk_size * C(zarr_version)\",\n",
+    "    \"duration ~ actual_chunk_size * C(zoom) + C(zarr_version) * C(zoom) + actual_chunk_size * C(zarr_version)\",  # noqa\n",
     "    data=df,\n",
     ").fit()\n",
     "model.summary()"

diff --git a/approaches/dynamic-client/e2e-results-zarr-version.ipynb b/approaches/dynamic-client/e2e-results-zarr-version.ipynb
diff --git a/approaches/dynamic-client/e2e-results.ipynb b/approaches/dynamic-client/e2e-results.ipynb
@@ -75,8 +75,6 @@
    ],
    "source": [
     "import carbonplan_benchmarks.analysis as cba\n",
-    "import holoviews as hv\n",
-    "import hvplot.pandas\n",
     "import pandas as pd\n",
     "\n",
     "pd.options.plotting.backend = \"holoviews\""

diff --git a/approaches/dynamic-client/future-areas.qmd b/approaches/dynamic-client/future-areas.qmd
@@ -2,4 +2,12 @@
 title: Future Areas
 ---
 
-COMING SOON.
+The benchmarking results and recommendations detailed in this cookbook motivates several possible avenues of exploration:
+
+1. Performance improvements for ndpyramid to reduce post-processing costs.
+2. Integration between ndpyramid and pangeo-forge-recipes to support generating pyramids during dataset creation.
+3. Extending the dynamic client approach to visualize datasets without pyramids.
+4. Formalizing the representation of pyramids and spatial overviews through a Zarr Enhancement Proposal (ZEP) on multiscales.
+5. Extending the dynamic client approach to visualize COGs using Kerchunk.
+6. Generalizing the dynamic client approach as a plug-in for general solutions like Mapbox or deck.gl.
+7. Supporting and benchmarking additional compression algorithms, data types, and bitrounding.
diff --git a/approaches/dynamic-client/recommendations.qmd b/approaches/dynamic-client/recommendations.qmd
@@ -2,4 +2,22 @@
 title: Recommendations
 ---
 
-COMING SOON.
+Here we provide recommendations for producing pyramids for performant Zarr visualization on the web. These recommendations are based on the [end-to-end benchmarking results](e2e-results.ipynb) for the [dynamic client](../dynamic-client.qmd) approach. These benchmark consider the use-case of rendering data on a map. However, we also discuss how time series visualization could factor into these decisions. As we eventually aim to remove the pyramid requirement for the dynamic client approach, it is worth mentioning that many of these recommendations should still apply when rendering raw data on a web map. However, it would be much more important to also consider the performance implications for scientific computational workflows in that case.
+
+## Zarr Version
+
+The end-to-end benchmarking results showed that [V2 and V3 data are comparable in performance](e2e-results-zarr-version.ipynb). Therefore, we recommend adopting the Zarr V3 specification if your preferred Zarr implementation includes the approved version of the [Zarr V3 spec](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html). At the time of writing, the [Zarrita Python library](https://github.com/scalableminds/zarrita) has implemented the approved Zarr V3 spec but is not recommended for production use. The [Zarr Python library](https://github.com/zarr-developers/zarr-python) is undergoing a [refactor](https://github.com/zarr-developers/zarr-python/discussions/1480) to bring the library up-to-date with the V3 spec.
+
+## Number of pixels per tile (spatial chunking)
+
+The number of pixels per tile (i.e., spatial chunking) must be a multiple of 16 and is generally 128, 256, or 512. The end-to-end benchmarks tested 128 and 256 pixels per tile and showed that the number of pixels per tile [does not impact rendering performance at a given chunk size](e2e-results-pixels-per-tile.ipynb). However, including more pixels per tile would reduce the proportion of other dimensions (e.g., time) that can be included in a chunk of a given size. Therefore, it would be worth considering fewer pixels per tile (e.g., 128) if visualizing time series is an important use case. By contrast, if only spatial rendering is important and more detail at coarser zoom levels is desired, you may consider increasing the number of pixels per tile to 256. If corresponding to "de facto" standards is particularly important, 256 is [commonly used](https://wiki.openstreetmap.org/wiki/Zoom_levels) as the spatial width for tiles. Increasing the number of pixels per tile should not increase total storage costs, as the number of zoom levels before reaching full resolution would be correspondingly smaller.
+
+## Chunk size (non-spatial chunking)
+
+The chunk size was the [strongest driver](e2e-results-zarr-version.ipynb) for the total time required to render datasets. For optimal rendering performance, we recommended targeting chunk sizes <1MB for the uncompressed data.
+
+## Zarr V3 sharding extension
+
+The end-to-end benchmarks showed that the time to render was [slower for sharded V3 datasets relative to V2 datasets](e2e-results-zarr-version-shards.ipynb) for zoom levels greater than 0. However, a primary benefit of the sharding extension is that it allows the same dataset to be accessed via large shards for analysis and smaller chunks for visualization. Further, a single file can store many chunks which benefits applications that rely on file-based operations. Given these benefits, we recommend leveraging the sharding specification after its [review and acceptance](https://github.com/zarr-developers/zarr-specs/issues/254) as a [Zarr Enhancement Proposal](https://zarr.dev/zeps/active/ZEP0000.html#how-does-a-zep-become-accepted) (ZEP) and your preferred Zarr implementation includes the approved sharding extension. The voting process for this ZEP expected to end on October 31, 2023. We found the the [shard size does not impact the time to render](e2e-results-shard-size.ipynb) and therefore recommend a follow-up study on optimal shard structures for computational workflows.
+
+We expect that the performance difference between sharded and non-sharded datasets could be minimized by future optimizations in the loading library, such as through the concatenation of range requests for adjacent chunks at higher zoom levels.