-
-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support GeoPandas Polygons and MultiPolygons #1285
Conversation
Cool!
For the timings it shouldn't matter, because once the data is read into memory, the representation in GeoPandas itself is the same independent from the format of the parquet file. |
Thanks for looking into this! I have tried timing the performance of from spatialpandas import io
import shapely
ny = io.parquet.read_parquet("/Users/martin/Downloads/nyc_buildings.parq").to_geopandas()
x = -8230000
y = 4960000
offset = 10
%%timeit
ny.cx[x-offset:x+offset, y-offset:y+offset]
%%timeit
ny.iloc[ny.sindex.query(shapely.box(x-offset,y-offset, x+offset, y+offset))] The results:
All the way to the extent covering the whole array, |
Interesting! For datashader, I would say that's something we should decide on the geopandas side, and then datashader can just keep using |
@martinfleis to be fully equivalent, you need a |
True, there can be some minor differences as sindex just checks bounding boxes in this case. Though those are irrelevant for the usage in Datashader. And in the benchmark above, all the dfs are of equal shape, apart from 5000 where |
Thanks for both of your comments. I'll deal with Joris' suggestions and continue with Overall this seems really promising and the code additions are simple enough to be really low risk. So I'd be inclined to finish this off as soon as possible and merge it, with Beyond that it makes sense to also provide direct support for other |
This all sounds great! Thanks, everyone. |
Codecov Report
@@ Coverage Diff @@
## main #1285 +/- ##
==========================================
+ Coverage 85.74% 85.77% +0.03%
==========================================
Files 52 52
Lines 10870 10974 +104
==========================================
+ Hits 9320 9413 +93
- Misses 1550 1561 +11
... and 1 file with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
This is all working now, giving exactly the same results as SpatialPandas using either GeoPandas or Dask-GeoPandas. I have moved the imports from the top of I have kept the use of There is no change to the project install dependencies, but I have added a new set of optional dependencies so that users will be able to I haven't yet added a new page to the docs about this. I was thinking that the best approach would be to add the Point and Line support and then write a new docs page about them all together. There will indeed need to be some changes in HoloViews to support GeoPandas directly, as at the moment any attempt to pass a GeoPandas.GeoDataFrame from HoloViews to Datashader will force a conversion to SpatialPandas. Probably the best timing for this change is when the Point and Line support has been added to Datashader. |
The upside of |
I have switched to using |
Merging as is. Support for lines and points to follow shortly. |
This is a WIP to add support for rendering Polygons and MultiPolygons from GeoPandas GeoDataFrames. Fixes #1006.
The approach taken is to use
.cx
to obtain the subset of geometries within the rendered bounds and then useshapely.io.to_ragged_array
to obtain contiguous arrays of coordinates and integer offsets. From here it is very similar to the existing SpatialPandas rendering code which uses numba to quickly traverse the geometries. There is no support for dask yet, nor any tests.Obligatory pretty picture:
I have included three example notebooks:
natural_earth.ipynb
. This uses "naturalearth.land" from geodatasets which is directly read using GeoPandas and converted to SpatialPandas. Outputs are the same for both and rendering times (after the first slower render that compiles the numba code) are about the same on my dev machine (M1 mac) at about 0.26 seconds for 600x1200 pixels.nyc_buildings.ipynb
. Uses the NYC buildings dataset which is used in SpatialPandas examples and has been pre-prepared for such. Download location and code to convert to GeoPandas parquet file is included at the top of the notebook. It demonstrates that the rendered outputs are the same. Time to render the whole dataset at 1000x1000 pixels is 1.6 s for GeoPandas and 0.7 s for SpatialPandas.nyc_buildings_zoom.ipynb
. The same as the previous notebook with azoom_factor
to render a zoomed area centred on the middle of the dataset. For azoom_factor
of 30 GeoPandas is faster at 27 ms compared to SpatialPandas at 39 ms.Here is a table of the render times as a function of
zoom_level
for my dev machine.So for this size of dataset we render GeoPandas slower than SpatialPandas at full resolution, but it scales better with zoom factor so that it is faster at high zoom levels.
I think this is explained by the two different approaches (SpatialPandas I understand, GeoPandas I don't yet). SpatialPandas reads the whole dataset from file as ragged arrays as that is how it is stored. This is fast. To render a subset of the data the ragged arrays are kept as they are and a boolean array is used to identify which polygons are within the bounds and need to be rendered. The rendering loop iterates over all of the polygons and uses the boolean flags to not bother rendering polygons that are not needed.
For GeoPandas and the shapely to_ragged_array conversion, the spatial subset is calculated before the ragged arrays are obtained, and the ragged arrays are dynamically generated. At full resolution the
to_ragged_array
takes over a second for me. When it comes to rendering there is no need for any boolean flags, we only have the polygons within the bounds and we render each and every one of them.Pinging @jorisvandenbossche and @martinfleis to see if I have taken a poor approach here and to check if my analysis is correct. I did wonder if the format of saving the GeoPandas file is important here as I just used the default
to_parquet
options, whereas the SpatialPandas file is highly optimised for this use case.