Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pygmt.grdcut: Refactor to store output in virtualfiles for grids #3115

Draft
wants to merge 36 commits into
base: datatypes/gmtimage
Choose a base branch
from

Conversation

seisman
Copy link
Member

@seisman seisman commented Mar 17, 2024

This PR tries to get rid of temporary files from pygmt.grdcut (#2730).

The GMT_GRID wrapper (PR #2398) can store output 2-D grid in virtual files, but doesn't work for output images. PR #3128 will wrap GMT_IMAGE to store output images in virtual files, but it's still WIP.

This PR makes necessary changes to make pygmt.grdcut work for both grids and images. For images, we still use temporary files but it should be refactored again after PR #3128.

pygmt/src/grdcut.py Outdated Show resolved Hide resolved
pygmt/clib/session.py Outdated Show resolved Hide resolved
pygmt/src/grdcut.py Outdated Show resolved Hide resolved
@seisman seisman changed the title WIP: Initial try to support raster images using temporary files WIP: pygmt.grdcut: Support both grids and images Mar 27, 2024
@seisman seisman force-pushed the datatypes/gmtgrid branch 3 times, most recently from d8f1160 to d0517e0 Compare April 1, 2024 07:11
@seisman seisman changed the base branch from datatypes/gmtgrid to main April 1, 2024 11:32
@seisman seisman added the enhancement Improving an existing feature label Apr 19, 2024
weiji14 added a commit that referenced this pull request Jun 18, 2024
Special case `earth_day` to be loaded as GMT_IMAGE rather than GMT_GRID. Need to use `GMT_OUT|GMT_IS_REFERENCE` in virtualfile_out to avoid segfault, xref #3115.
pygmt/clib/session.py Outdated Show resolved Hide resolved
Comment on lines 2032 to 2034
with GMTTempFile(suffix=".tif") as tmpfile:
self.call_module("write", f"{vfname} {tmpfile.name} -Ti")
return xr.load_dataarray(tmpfile.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 3-band images, we might need to use rioxarray.open_rasterio instead of xarray.load_dataarray? I've been thinking of modifying pygmt.io.load_dataarray to support both 1-band grids and 3-band images (see commit aa8a474 and fc060ea in #2235).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused. After #3128 is finished, we should be able to use self.read_virtualfile(vfname, kind=kind).contents.to_dataarray() to convert the GMT_IMAGE structure into a dataarray, instead of replying on temporary files.

In other words, we should merge changes of #3128 into this branch, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, so this is just a placeholder for now. Yes, this PR will need to wait for #3128, though I'm thinking of cherry-picking some of the clib-related changes here in #3115 to #3128, or opening a separate PR for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original idea of this PR is to make grdcut support both grids (via GMT_GRID) and images (via temporary files because GMT_IMAGE is not wrapped yet). I can continue this PR without waiting for #3128. Maybe it's easier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could work too. Finish up this PR #3115 with the placeholder code, and then we'll refactor things in #3128 to remove the tempfile path. Either way, we should get both of these more-or-less ready at the same time, so that we have both of them for PyGMT v0.13.0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to finish this PR without waiting for PR #3128. Two tests are also added which can help test changes in PR #3128, too.

pygmt/datatypes/image.py Outdated Show resolved Hide resolved
pygmt/datatypes/image.py Outdated Show resolved Hide resolved
pygmt/src/grdcut.py Outdated Show resolved Hide resolved
pygmt/src/grdcut.py Outdated Show resolved Hide resolved
Comment on lines 105 to 109
case "file":
realpath = which(grid, download="a")
if isinstance(realpath, list):
realpath = realpath[0]
outkind = "image" if realpath.endswith(".tif") else "grid"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a file is given, there are two different ways to determine if the file is a grid or an image:

  1. Based on file extensions
  2. Read the file (header only if possible) via xarray or rasterio and check the data dims.

I guess option 2 is more robust and preferred.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently use option 1 because it's easy for coding.

pygmt/clib/session.py Outdated Show resolved Hide resolved
dtype=np.uint8,
),
coords={
"band": [1, 2, 3],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An off-topic question:

In [3]: import rioxarray

In [4]: rioxarray.open_rasterio("earth_day_01d_p.tif")
Out[4]:
<xarray.DataArray (band: 3, y: 180, x: 360)> Size: 194kB
[194400 values with dtype=uint8]
Coordinates:
  * band         (band) int64 24B 1 2 3
  * x            (x) float64 3kB -179.5 -178.5 -177.5 ... 177.5 178.5 179.5
  * y            (y) float64 1kB 89.5 88.5 87.5 86.5 ... -86.5 -87.5 -88.5 -89.5
    spatial_ref  int64 8B 0
Attributes:
    AREA_OR_POINT:             Area
    STATISTICS_MAXIMUM:        255
    STATISTICS_MEAN:           49.507654320988
    STATISTICS_MINIMUM:        10
    STATISTICS_STDDEV:         65.299010755209
    STATISTICS_VALID_PERCENT:  100
    _FillValue:                0
    scale_factor:              1.0
    add_offset:                0.0

The band here is [1, 2, 3], but tile_map, band is set to [0, 1, 2]:

"band": np.array(object=[0, 1, 2], dtype=np.uint8), # Red, Green, Blue

Are there any specific reasons for the differences or any conventions about band?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, the 123 numbering comes from GDAL/rasterio. E.g. if you do gdalinfo ~/.gmt/server/earth/earth_day/earth_day_01d_p.tif, you'll get:

Driver: GTiff/GeoTIFF
Files: /home/user/.gmt/server/earth/earth_day/earth_day_01d_p.tif
Size is 360, 180
...
Band 1 Block=360x7 Type=Byte, ColorInterp=Undefined
  Min=10.000 Max=255.000 
  Minimum=10.000, Maximum=255.000, Mean=49.508, StdDev=65.299
  NoData Value=0
  Metadata:
    STATISTICS_MAXIMUM=255
    STATISTICS_MEAN=49.507654320988
    STATISTICS_MINIMUM=10
    STATISTICS_STDDEV=65.299010755209
    STATISTICS_VALID_PERCENT=100
Band 2 Block=360x7 Type=Byte, ColorInterp=Undefined
  Min=10.000 Max=255.000 
  Minimum=10.000, Maximum=255.000, Mean=49.911, StdDev=62.129
  NoData Value=0
  Metadata:
    STATISTICS_MAXIMUM=255
    STATISTICS_MEAN=49.910941358025
    STATISTICS_MINIMUM=10
    STATISTICS_STDDEV=62.129078727248
    STATISTICS_VALID_PERCENT=100
Band 3 Block=360x7 Type=Byte, ColorInterp=Undefined
  Min=10.000 Max=255.000 
  Minimum=10.000, Maximum=255.000, Mean=65.605, StdDev=44.820
  NoData Value=0
  Metadata:
    STATISTICS_MAXIMUM=255
    STATISTICS_MEAN=65.605154320988
    STATISTICS_MINIMUM=10
    STATISTICS_STDDEV=44.819618148349
    STATISTICS_VALID_PERCENT=100

The band here is [1, 2, 3], but tile_map, band is set to [0, 1, 2]:

Yeah, I simply forgot about GDAL's start from 1 convention when implementing #2125 😅. I'm guessing there's a historical reason for this, and I'm not sure if this is the exact reason why GDAL counts from 1, but optical satellite bands (e.g. from Landsat) are usually numbered Band 1, Band 2, etc, so that could be why? Unsure if we should re-number things, or just leave it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should follow the GDAL convention and let band indexing start from 1.

)


@pytest.mark.skipif(not _HAS_RIOXARRAY, reason="rioxarray is not installed")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pytest.mark.skipif marker can be removed after #3128 is implemented.

Comment on lines +75 to +76
path = which("@earth_day_01d", download="a")
raster = rioxarray.open_rasterio(path)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines can be replaced by load_blue_marble after #2235.

@seisman seisman added this to the 0.14.0 milestone Aug 5, 2024
@seisman seisman removed this from the 0.14.0 milestone Sep 5, 2024
@seisman seisman changed the base branch from main to datatypes/gmtimage September 20, 2024 08:43
@seisman seisman changed the title pygmt.grdcut: Refactor to store output in virtualfiles for grids [Still use temporary files for images] pygmt.grdcut: Refactor to store output in virtualfiles for grids Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improving an existing feature run/benchmark Trigger the benchmark workflow in PRs run/test-gmt-dev Trigger the GMT Dev Tests workflow in PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants