Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GeoPandas Polygons and MultiPolygons #1285

Merged
merged 12 commits into from
Oct 16, 2023
Merged

Support GeoPandas Polygons and MultiPolygons #1285

merged 12 commits into from
Oct 16, 2023

Conversation

ianthomas23
Copy link
Member

This is a WIP to add support for rendering Polygons and MultiPolygons from GeoPandas GeoDataFrames. Fixes #1006.

The approach taken is to use .cx to obtain the subset of geometries within the rendered bounds and then use shapely.io.to_ragged_array to obtain contiguous arrays of coordinates and integer offsets. From here it is very similar to the existing SpatialPandas rendering code which uses numba to quickly traverse the geometries. There is no support for dask yet, nor any tests.

Obligatory pretty picture:
time_ragged_gpd

I have included three example notebooks:

  1. natural_earth.ipynb. This uses "naturalearth.land" from geodatasets which is directly read using GeoPandas and converted to SpatialPandas. Outputs are the same for both and rendering times (after the first slower render that compiles the numba code) are about the same on my dev machine (M1 mac) at about 0.26 seconds for 600x1200 pixels.
  2. nyc_buildings.ipynb. Uses the NYC buildings dataset which is used in SpatialPandas examples and has been pre-prepared for such. Download location and code to convert to GeoPandas parquet file is included at the top of the notebook. It demonstrates that the rendered outputs are the same. Time to render the whole dataset at 1000x1000 pixels is 1.6 s for GeoPandas and 0.7 s for SpatialPandas.
  3. nyc_buildings_zoom.ipynb. The same as the previous notebook with a zoom_factor to render a zoomed area centred on the middle of the dataset. For a zoom_factor of 30 GeoPandas is faster at 27 ms compared to SpatialPandas at 39 ms.

Here is a table of the render times as a function of zoom_level for my dev machine.

zoom_level GeoPandas SpatialPandas
1 1.6 s 0.7 s
3 0.45 s 0.23 ms
10 65 ms 61 ms
30 27 ms 39 ms
100 16 ms 30 ms

So for this size of dataset we render GeoPandas slower than SpatialPandas at full resolution, but it scales better with zoom factor so that it is faster at high zoom levels.

I think this is explained by the two different approaches (SpatialPandas I understand, GeoPandas I don't yet). SpatialPandas reads the whole dataset from file as ragged arrays as that is how it is stored. This is fast. To render a subset of the data the ragged arrays are kept as they are and a boolean array is used to identify which polygons are within the bounds and need to be rendered. The rendering loop iterates over all of the polygons and uses the boolean flags to not bother rendering polygons that are not needed.

For GeoPandas and the shapely to_ragged_array conversion, the spatial subset is calculated before the ragged arrays are obtained, and the ragged arrays are dynamically generated. At full resolution the to_ragged_array takes over a second for me. When it comes to rendering there is no need for any boolean flags, we only have the polygons within the bounds and we render each and every one of them.

Pinging @jorisvandenbossche and @martinfleis to see if I have taken a poor approach here and to check if my analysis is correct. I did wonder if the format of saving the GeoPandas file is important here as I just used the default to_parquet options, whereas the SpatialPandas file is highly optimised for this use case.

datashader/glyphs/polygon.py Outdated Show resolved Hide resolved
datashader/glyphs/points.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link

Cool!
I can't say I understand everything exactly, but generally the code that interacts with geopandas looks good. I think your analysis of the differences sounds correct (at least it also matches with what I wrote in the issue, and indeed using cx here ensures we first filter the geometries before converting to ragged arrays).

I did wonder if the format of saving the GeoPandas file is important here as I just used the default to_parquet options, whereas the SpatialPandas file is highly optimised for this use case.

For the timings it shouldn't matter, because once the data is read into memory, the representation in GeoPandas itself is the same independent from the format of the parquet file.
For a full workflow of reading + visualizing, it can of course matter. But right now the defaults are fine, I think (that will store the geometries as WKB). With a project like https://github.com/geoarrow/geoarrow-python we can also read parquet file directly into geoarrow format (i.e. the ragged arrays). But for the initial GeoPandas support here, standard GeoParquet is fine.

@martinfleis
Copy link

Thanks for looking into this!

I have tried timing the performance of cx versus sindex.query resulting in the same dataframe and it seems that at least with the NY buildings dataset you used, there's a space for improvements.

from spatialpandas import io
import shapely

ny = io.parquet.read_parquet("/Users/martin/Downloads/nyc_buildings.parq").to_geopandas()

x = -8230000
y = 4960000
offset = 10

%%timeit
ny.cx[x-offset:x+offset, y-offset:y+offset]

%%timeit
ny.iloc[ny.sindex.query(shapely.box(x-offset,y-offset, x+offset, y+offset))]

The results:

offset cx sindex n_hits
10 29 ms ± 905 µs 349 µs ± 41.6 µs 3
50 28.1 ms ± 372 µs 312 µs ± 9.32 µs 29
250 28.8 ms ± 561 µs 346 µs ± 2.37 µs 572
1000 29.8 ms ± 714 µs 997 µs ± 41.3 µs 7740
5000 46.3 ms ± 130 µs 11.6 ms ± 238 µs 121858
10000 85.5 ms ± 1 ms 37.4 ms ± 1.07 ms 356570
50000 161 ms ± 3.61 ms 120 ms ± 1.71 ms 1157859 (all of them)

All the way to the extent covering the whole array, sindex solution seems to outperform cx. I am not sure how generalisable this is as there were some experiments around this in geopandas (geopandas/geopandas#1949, geopandas/geopandas#2671) that did not prove this to be always a better solution but development on the GEOS side in the meantime could have changed the situation. It is surely worth exploring (also on the geopandas side, if we should not use it in cx internally after all).

@jorisvandenbossche
Copy link

Interesting! For datashader, I would say that's something we should decide on the geopandas side, and then datashader can just keep using .cx? (I am not sure whether datashader would be able to make any smarter decision on what to use compared to geopandas)

@jorisvandenbossche
Copy link

@martinfleis to be fully equivalent, you need a predicate="intersects" in the sindex.query call, and checking my previous benchmarks locally, that's actually making a significant difference for the case when you are selecting a large part of the data. At the same time, that's also a small difference that datashader might not care about, and so potentially a reason for here to use sindex.query instead of cx.

@martinfleis
Copy link

martinfleis commented Oct 6, 2023

True, there can be some minor differences as sindex just checks bounding boxes in this case. Though those are irrelevant for the usage in Datashader. And in the benchmark above, all the dfs are of equal shape, apart from 5000 where cx has one less polygons.

@ianthomas23
Copy link
Member Author

Thanks for both of your comments. I'll deal with Joris' suggestions and continue with dask-geopandas support which I suspect should be relatively straightforward.

Overall this seems really promising and the code additions are simple enough to be really low risk. So I'd be inclined to finish this off as soon as possible and merge it, with geopandas and dask-geopandas being optional dependencies. Then people can use it and we can work on improving the performance over time.

Beyond that it makes sense to also provide direct support for other geopandas geometry types (lines, points, etc) as they will be similar (and simpler) than this polygon support. Outside of Datashader there is probably work to be done in other HoloViz projects (HoloViews, GeoViews, hvPlot) but I'll need to speak to other HoloViz maintainers about this.

@jbednar
Copy link
Member

jbednar commented Oct 6, 2023

This all sounds great! Thanks, everyone.

@codecov
Copy link

codecov bot commented Oct 9, 2023

Codecov Report

Merging #1285 (0bddc97) into main (1c5df74) will increase coverage by 0.03%.
Report is 4 commits behind head on main.
The diff coverage is 88.39%.

@@            Coverage Diff             @@
##             main    #1285      +/-   ##
==========================================
+ Coverage   85.74%   85.77%   +0.03%     
==========================================
  Files          52       52              
  Lines       10870    10974     +104     
==========================================
+ Hits         9320     9413      +93     
- Misses       1550     1561      +11     
Files Coverage Δ
datashader/glyphs/polygon.py 96.15% <96.55%> (+1.34%) ⬆️
datashader/utils.py 76.72% <66.66%> (+0.08%) ⬆️
datashader/glyphs/points.py 87.68% <82.35%> (-0.62%) ⬇️
datashader/core.py 87.74% <80.64%> (-0.52%) ⬇️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ianthomas23
Copy link
Member Author

This is all working now, giving exactly the same results as SpatialPandas using either GeoPandas or Dask-GeoPandas. I have moved the imports from the top of polygon.py to where it is used (thanks to @maximlt and @hoxbro for this recommendation).

I have kept the use of .cx for the spatial filtering rather than sindex.query but I am happy to be led by GeoPandas maintainers in this going forward.

There is no change to the project install dependencies, but I have added a new set of optional dependencies so that users will be able to pip install datashader[geopandas] to get both geopandas and dask-geopandas installed at the same time. I have explicitly pinned shapely >= 2.0.0 so that .to_ragged_array is available and also added a runtime check just before we call it, in case an older version is installed.

I haven't yet added a new page to the docs about this. I was thinking that the best approach would be to add the Point and Line support and then write a new docs page about them all together.

There will indeed need to be some changes in HoloViews to support GeoPandas directly, as at the moment any attempt to pass a GeoPandas.GeoDataFrame from HoloViews to Datashader will force a conversion to SpatialPandas. Probably the best timing for this change is when the Point and Line support has been added to Datashader.

@jorisvandenbossche
Copy link

I have kept the use of .cx for the spatial filtering rather than sindex.query but I am happy to be led by GeoPandas maintainers in this going forward.

The upside of .cx is that it works for both geopandas and dask-geopandas at the same time. But if you want the fastest possible option for geopandas.GeoDataFrame, the .sindex.query approach will typically be a bit faster.

@ianthomas23
Copy link
Member Author

I have switched to using .sindex.query for geopandas and kept .cx for dask-geopandas. It is more complicated, partly because you cannot pass None for unspecified bounds but it seems that -np.inf and +np.inf are OK.

@ianthomas23 ianthomas23 added this to the v0.15.3 milestone Oct 13, 2023
@ianthomas23
Copy link
Member Author

Merging as is. Support for lines and points to follow shortly.

@ianthomas23 ianthomas23 merged commit 9e20700 into main Oct 16, 2023
16 checks passed
@ianthomas23 ianthomas23 deleted the geopandas_2023 branch October 16, 2023 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support GeoPandas GeoDataFrames
4 participants