-
I'm writing a function to retrieve and apply some processing to a dataset using Pooch. While subsequent calls to the function don't re-download the data, they do result in re-running the processing steps, which takes some time. Is there a way for Pooch to recognize that the processing has already occurred, and to retrieve the processed file instead of the un-processed file? Here's a simplified version of the function: def ice_vel(
) -> xr.DataArray:
"""
MEaSUREs Phase-Based Antarctica Ice Velocity Map, version 1:
https://nsidc.org/data/nsidc-0754/versions/1#anchor-1
Data part of https://doi.org/10.1029/2019GL083826
Returns
-------
xr.DataArray
_description_
"""
path = pooch.retrieve(
url="https://n5eil01u.ecs.nsidc.org/MEASURES/NSIDC-0754.001/1996.01.01/antarctic_ice_vel_phase_map_v01.nc", # noqa
downloader=EarthDataDownloader(),
known_hash=None,
progressbar=True,)
grd = xr.load_dataset(path)
vel = (grd.VX**2 + grd.VY**2)**0.5
return vel |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 11 replies
-
Hi @mdtanker thanks for asking! This is where you'd want to make a custom processor. The processor allows you to insert custom code between the file download and the return of the For example, this code downloads one of our sample grids, converts the units to m/s², saves the converted grid, and returns the path to it instead: from pathlib import Path
def custom_processor(fname, action, pooch):
"Load the data, calculate something, save it back."
fname = Path(fname)
# Rename the file to ***-processed.nc
fname_processed = fname.with_stem(fname.stem + "-processed")
# Only recalculate if making a new download or the processed file doesn't exist yet
if action in ("download", "update") or not fname_processed.exists():
grid = xr.load_dataarray(fname)
# Convert from mGal to m/s²
processed = grid * 1e-5
# Save to disk
processed.to_netcdf(fname_processed)
return str(fname_processed)
# This is now the path to the processed grid (in m/s²)
path = pooch.retrieve(
url="doi:10.5281/zenodo.5882207/earth-gravity-10arcmin.nc",
known_hash="md5:56df20e0e67e28ebe4739a2f0357c4a6",
progressbar=True,
processor=custom_processor,
)
grd = xr.load_dataset(path) |
Beta Was this translation helpful? Give feedback.
Hi @mdtanker thanks for asking! This is where you'd want to make a custom processor. The processor allows you to insert custom code between the file download and the return of the
path
.For example, this code downloads one of our sample grids, converts the units to m/s², saves the converted grid, and returns the path to it instead: