-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow storing points/coordinates as struct arrays in addition to interleaved layout #26
Conversation
Thanks for making the PR! Given #16 (comment) I wasn't expecting the PR to switch to mandating storing data as structs. I'll have to keep thinking about getting rid of the fixed size list entirely. I'd love to hear more about the FlatGeobuf community's experience with their layout, which stores
Well, might be a little late for that. It's already in cuspatial and GDAL at least.
In my mind, I had thought most implementations' low-level memory storage was interleaved, but maybe I'm wrong, and my mind is colored by higher level constructs like WKB and GeoJSON. As one data point, deck.gl is entirely built upon interleaved |
format.md
Outdated
[variable sized list](https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout) | ||
arrays. In practice, this means we have additional arrays storing the offsets | ||
that denote where a new geometry, a new geometry part, or a new polygon ring | ||
starts. | ||
|
||
**Point**: `FixedSizeList<double>[n_dim]` | ||
**Point**: `Struct<x: double, y: double, [z: double, [m: double>]]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, is it necessary for m to be a double
at all? For example, if I have a Line representing the path of a ship on the ocean, I might want want m
to be a Date64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or more ambitiously but still grounded in the actual example I have in mind, I might want m to be struct<time: Date64, windspeed: float64, winddirection: dictionary, notes: string>
. Or I guess there's t
as well. I might also just be asking for arbitrary keys in the array where only x
, y
, and z
(?or not z?) are strictly typed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took me a while to respond here, but this helped me a lot with the wording around why multiple options were allowed (and why more may be considered in the future). A more realistic short-term perspective on this is that if you have an array that looks like struct<lon: float64, lat: float64, time: Date64, windspeed: float64, winddirection: dictionary, notes: string>
, you can zero-copy cast that to a geoarrow.point<struct<lon: float64, lat: float64>>
, from which it would be zero copy to create a linestring and be passed through geometry functions.
I said a bunch in #16 on both sides. But FWIW my gut feeling is that @paleolimbot is right and This may be opening a can of worms worth leaving closed, but I also wonder whether a benefit here is that it means it's no longer necessary to mandate any type on |
It's pointing to a comment of mine, so .. but indeed, I would personally argue for making this inner point representation more flexible, and allow both. Potentially at the cost of making the spec more complex and harder to implement/support. But there are good arguments for using one of both depending on the use case. And I want to be clear: the reason we initially decided to go with interleaved was not driven by WKB (as far as I remember ..), but because most of the related implementations we were aware of at the time actually use interleaved for their in-memory data (cuspatial, datashader, deck.gl, GeoMesa, also GEOS). |
💯 I do appreciate that I can try to explain the nested list approach as being "logically" exactly the same as GeoJSON coordinates while having a dual API that's really efficient to operate on |
Expanding on this, it seems that, generally speaking, many of the arguments for separate x/y fields are related to storage and querying (it compresses better, it integrates better with Parquet and its indexing system, efficiently bounding box queries, ...), while libraries often uses interleaved x/y for their in-memory data for processing and visualization purposes. And both those broad uses cases are important to consider for GeoArrow. As I mentioned above, it certainly comes at the cost of making the specification more complex and harder to implement and support, but so one option is to allow both the interleaved (with FixedSizeList as inner point level) vs separate x/y fields (with Struct as inner point level). So that the user can choose the representation that gives the best trade-off for their use case (or the library can convert any incoming data in geoarrow format to that representation it uses internally). Another option to allow this flexibility, somewhat combining both of those (and which was already mentioned on the side in the original discussion at #4 (comment)), could also be to always use a Struct as the inner point level, but to allow the field of that struct to be an interleaved xy FixedSizeList. This would allow something like For the simple case of 2D interleaved xy data, this |
Just to play devil's advocate:
|
Lots of good stuff here, thanks! First, I hope this PR comes across as convivial...it is! I've implemented this both ways in R, both ways in C++, and for a brief time I implemented a version that let you do both (R and C++, "do" meaning create and export Arrays from WKB and WKT plus a few kernels). I totally can and totally will do it all again! The real gem here is Arrow and it's important not to loose sight of that. Second, on the topic of "why not allow both". I did this in the R package and the prototype C++ library and it's horrible to write generic code for. You have a distinct type/operation for every geometry type, for every inner point representation and possibly for every dimension combination depending on what you're up to. Templating helps but you still end up with a god-awful factory file listing all the combinations if you want to do anything generic (in this case, I was writing WKT and WKB readers and writers). There are some operations that are zero copy with the Third, on the representation of points in other libraries. There are a lot of libraries that use Fourth, @bmschmidt in particular highlighted that a Struct gives us more flexibility. We can name our fields "longitude" and "latitude"; "northing" and "easting"; we can include a Fifth, it's just plain easier to write geometry operations with a Struct representation. This applies to high-level runtimes like Python, where you can use Arrow compute kernels or zero copy numpy magic to do Sixth, the embedded column statistics in Parquet thing is huge. Arrow is in-memory but the tools around it are designed to scale to bigger than memory and the geospatial ecosystem (in my biased opinion) could really use them! Pushing a potential intersection filter into a parquet read operation is a big deal and I think it's fine to lean into that. |
There's tons of great material already in this thread. I can contribute one of the GPU developers' motivations for originally designing the specification as Storing the interleaved coordinates also improves coalesced memory access. Kernel threads that use x,y pairs together will always benefit from their adjacency, whereas if they are separated by long blocks of other x and ys performance will take a hit. I don't have any on-hand benchmark numbers to demonstrate these design decisions, but they are why we suggested interleaved coordinates. |
Oh, @harrism also explains that we wanted vectorized memory access using xy interleaving https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/ |
cuSpatial's implementation actually exists, right @thomcom? |
I definitely agree with columnar xxxxx, yyyyy - never seen any compelling reason to align to rgl's arrangement xyzxyzxyz or xyzhxyzh ... in years of using it 🙏 - everything i did there was column oriented until time came to materialize mesh3d or vis (and, so much activity is about modifying verts) |
To avoid more arm-waving on my part, I wrote a post with some downloadable example data in the two forms ( https://dewey.dunnington.ca/post/2022/profiling-point-representations-in-geoarrow/ ). I think there's enough concrete benefits there to consider at the very least adding support for the
The "hard to support both" claim is just me arm-waving at this point...if and when I have a concrete example of this being difficult is the right time to bring that up. The previous example of my templating may just have been bad library design (I plan on trying again soon).
Is there a good way for me to use cuSpatial to avoid any more ignorance on my part? I'm excited for GPU geometry but don't have the hardware to test it!
This would help me and others understand the motivation and would make great documentation so that people like me don't try to pull something like this again. |
We do have some cloud solutions you can play with cuspatial on. Let me call on @taureandyernv to explain the latest state-of-the-art cloud solution. As for benchmarking, I'll talk to @harrism and @isVoid about what we can produce. |
Just a couple more points for non-interleaved representation. I'm currently prototyping the Geography type extension for DuckDB, using S2 as a geom lib. From this perspective, storing coordinates as xxx, yyy makes much more sense and would make integration with Geoarrow much easier. Here's the DuckDB representation proposal of the Geography data type. It maps naturally to DuckDB Struct/List vector types, which are stored as columns internally. From the storage and serialization/deserialization perspective, the data that exhibits spatial locality (like probe data, road network, admin boundaries) would benefit from non-interleaved xxx, yyy storage due to much better compression. I don't have a benchmark handy, but it's not hard to quickly write one. Finally, @paleolimbot's point about parquet storage and making use of statistics make perfect sense, when coordinates are stored separately, scan filters may be used for pruning with range queries like (lat >= a and lat < b), relying on column stats. |
Compression has been mentioned a few times, and it's a very testable one! library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(purrr)
library(ggplot2)
options <- expand.grid(
type = c("points", "multilines"),
storage = c("xxyyzz", "xyzxyz"),
file_format = c("parquet", "arrow"),
compression = c("uncompressed", "snappy", "gzip", "bz2", "lz4", "zstd", "brotli"),
stringsAsFactors = FALSE
)
tables <- pmap(unique(options[c("type", "storage")]), ~{
file_in <- glue::glue("~/Downloads/{..1}-{..2}.parquet")
read_parquet(file_in, as_data_frame = FALSE)
})
names(tables) <- do.call(paste, unique(options[c("type", "storage")]))
compressed_size <- function(type, storage, file_format, compression) {
table <- tables[[paste(type, storage)]]
tf <- tempfile()
on.exit(unlink(tf))
tryCatch({
if (file_format == "parquet") {
write_parquet(table, tf, compression = compression)
} else {
write_feather(table, tf, compression = compression)
}
as.integer(file.size(tf))
}, error = function(...) NA_integer_)
}
options$file_size <- pmap_int(options, compressed_size)
options <- options[is.finite(options$file_size), ]
ggplot(options, aes(x = type, y = file_size / 2 ^ 20, fill = storage)) +
geom_col(position = "dodge") +
facet_grid(compression ~ file_format) +
labs(y = "file size (MB)") While the struct representation saves a few MB here and there, I don't think compression is an important consideration in this particular decision. (Or perhaps the dataset I'm using is not a good one to illustrate the advantage). |
I benchmarked compression in the other issue and found, to my surprise:
I'd guess perhaps there is some mutual information between x and y in most datasets that interleaved reps can leverage? |
Quick thought on this: I wonder if the choice of encoding is really important here. I assume that the default encoding in the R API is |
I'm wondering how I should proceed here: I could update this PR to basically include two copies of the spec (one for fixed size list, one for struct) and note that we're investigating the performance tradeoffs? I could leave this PR open? I'd like to start in on a C/C++ implementation in earnest in a few weeks based on nanoarrow and it would be nice to have this discussion resolved (although I could also implement both again...I'm not scared!). @thomcom Has there been any movement on benchmarks? It seems like better performance on GPUs is one of the remaining arguments for keeping the fixed size list and I don't have the ability to help with that. |
This would be my vote. Make clear that both versions are "incubating"/"in testing", that we don't know now which to choose, but the end specification is expected to only have one to maximize ease of interoperability. |
Thanks for considering my input on the thread. I want to discuss the comments above
Im not an expert in GPU however I believe that the SoA approach x[], y[], z[], m[] etc should not be a problem for the GPU, because as I understand the memory access should be coalesced, as each thread in the warp will be reading x[i] y[i] z[i]. Quoting from Robert Crovella https://stackoverflow.com/users/1695960/robert-crovella https://stackoverflow.com/questions/37920862/how-many-simultaneous-read-instructions-per-thread-on-modern-gpu "A thread can issue multiple reads (on separate instructions in the instruction stream). Multiple reads from a single thread can be in flight. A read request by itself does not stall the thread. Only when the results of that read are required for a future operation will it stall (if the read data is not available.) The compiler is aware of this and will attempt to schedule reads before the data are actually required. In such scenarios, multiple reads (from a single thread or warp) can be in flight." There are multiple memory controllers on a modern GPU and I believe that each can be processing a memory request concurrently. So I believe there should be no or very little performance penalty for SoA vs AoS in this case. On the other hand, for the AoS pattern xyzxyzxyz on a CPU it is much more complicated to do efficient SIMD , as extra instructions are needed to deinterleave the data to separate all the x to one SIMD register all the y to another SIMD register etc and then do the computation and then interleave them again to write the result. The extra instructions to deinterleave and interleave complicate writing the program (also as they are different depending on the SIMD architecture) and decrease perfomance Id like to propose that we do some simple benchmarks. maybe something like Also with regard to the compression it depends on the CRS if the data is lat and lon then each delta will also be a float and difficult to compress, however if the data is in a planar coordinate system, then each x y usually represents meters and it may be possible to represent them as integers and get much better delta encoding. Again Im not an expert (just adding my 2 cents) Please let me know if Im mistaken in my analysis. |
I'm trying to find an existing benchmark so we don't have to reinvent the wheel. Watch this space.
You are correct, however the ideal for GPUs to help saturate memory bandwidth is actually sort of "SoAoS", where the final S is "small vector structures". In other words, each thread reads not a single float, but a float2 or float4 vector. (Or in other other words each warp reads not 128 bytes but 256 or 512 bytes). Note we can easily cast an array of interleaved xyxyxy to an array of float2 structs and achieve vectorized loading, but pure columnar SoA data requires a separate load for x and y. See https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/ That said, we are not talking about 2x performance improvement here, more like 10-20% for bandwidth-bound kernels. Compute bound kernels are not likely to experience any difference (but real applications that are not bandwidth bound are rare in the wild).
I suggest not the sqrt one as that will be more compute bound. What I would test is a simple kernel where the input and output are both x-y coordinates, and the computation is trivial, e.g. scale coordinates by a parameter. One version loads from separate x and y arrays, one loads from a single interleaved array. |
just one minor comment from me:
You can always encode coordinates as integers by scaling, it doesn't matter what the units are (they might be mm ...) it depends on the domain and the extent they represent, and what and where counts as "precise" or "accurate" is entirely independent of the crs, every coord sys has regions where its particular properties can't hold. (Loving the discussion, thanks!) |
I updated the text of this PR to support both! This topic can and should get more discussion outside the realm of this one PR. A notable thing for me that came out of this discussion is that |
@thomcom I recently figured out you can get a GPU + cuSpatial up an running on Google colab and couldn't resist taking it for a spin. Related to the point representation question, it seems that cuDF doesn't have a structure for a fixed-size list? Does reading Parquet directly into GPU memory via I see that the custom shapefile reader was deprecated + removed although it is/was useful for demonstrating cuSpatial utility internally: the geopandas read + copy to GPU operation is slow despite cuSpatial's computation being very fast. I imagine GeoParquet + GeoArrow encoding will be useful here and I wonder if it would save any trouble if the encoding used cuDF native types? |
Regarding the shapefile reader: it was very limited (single polygons only, no multipolygons, no points, no linestrings, etc.) and depended on GDAL. This dependency was making it hard for us to support installation from pip, so we decided to remove it for now. We do have tentative plans for providing GPU-accelerated Geo I/O with an effort starting in the near future (23.08 release most likely). We think that this will be a much better way forward. I'll let @thomcom and / or @trxcllnt answer regarding fixed-size lists in cuDF. |
Hey @paleolimbot ! You are correct that cudf doesn't have a fixed size list implementation. IIRC we worked around it by using an extra level of I'm quite eager to get an integrate with GeoParquet + GeoArrow so I've been watching your work. :) Agree that the GeoPandas integration is very slow, we're planning on starting dedicated I/O work soon. |
I'm hoping to get some big example files up soon! I was massively impressed by the cudf read parquet + cuSpatial quadtree point-in-polygon join (I got it up to 130 million points before running out of memory). |
Tangentially related, but are there some sort of standardised spatial dataset/benchmark queries? Or is this something that would be interesting to draft together? It would be great to have something like a TPC-H/S schema that could be deterministically generated for arbitrary scales where the spatial columns have properties (point density, avg line length, polygon hole-to-shell ratio) similar to those found in real/common datasets (osm, tiger, copernicus). |
It would be super nice if that existed (or maybe it does and I don't know about it). I've collected a few things with appropriate licenses here: https://github.com/paleolimbot/geoarrow-data and would like to put some things here: https://github.com/geoarrow/geoarrow-data . The 130 million points I mentioned were the all the buildings in the Microsoft US buildings dataset and the join I mentioned was against a shapefile of all U.S. zipcodes. Nationwide road segments and rivers might be nice as well (with a single small state...maybe Vermont...to use as a small-scale trial). |
We've done a lot of work on spatial benchmarks and would be happy to collaborate on designing a schema for them. |
If we want to add support for struct arrays in addition to fixed size list arrays, should they have a different Arrow extension name? I suppose you'd be able to inspect the child types to tell which encoding the array is using, but it might be easier if it's in the extension name? |
I could go either way on that: putting more information in the extension name is not really a problem but you also have to inspect the structure anyway to make sure the storage type is correct. A case that's come up where extra information in the extension name might help is for XYM geometries since not all Arrow implementations are able to create field names for a list. In case you're curious what "inspect the child types" looks like in C, the current geoarrow-c implementation for both interleaved and struct-based schemas is here: https://github.com/geoarrow/geoarrow-cpp/blob/main/src/geoarrow/schema_view.c#L9 and tests with all the fun ways you can specify invalid schemas are here: https://github.com/geoarrow/geoarrow-cpp/blob/main/src/geoarrow/schema_view_test.cc#L159 . |
Ok! A bunch of data is available in all the geoarrows (wkb, interleaved, struct): https://github.com/geoarrow/geoarrow-data . Feel free to open issues on that repo to add other data sets! (For example, whatever data set cuspatial uses or generates would be welcome!) |
Thanks a lot for everybody involved here for the interesting discussion and the many arguments, examples and benchmarks brought up! This has been moving slowly (and apologies from my part about not helping a lot in moving it forward lately), but let's try to change that now! I think there is clear consensus that the struct array representation has several advantages and that we want to at least enable this representation. Based on my reading, I think there is also general consensus that it is acceptable to (at least short term) do this by allowing both representations (and the PR has already been updated to reflect that). That way, we enable using the struct representation, while still keeping the interleaved representation to support those cases where it is already being used (e.g. to support passing it zero-copy using geoarrow between libraries that both use interleaved). So the concrete proposal is to get the current PR merged with the proposal to describe both interleaved and struct as options for the representation of coordinates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments. Thanks for writing this up @paleolimbot!
format.md
Outdated
@@ -103,6 +124,17 @@ is a list of xy vertices. The child name of the outer list should be "polygons"; | |||
the child name of the middle list should be "rings"; the child name of the | |||
inner list should be "vertices". | |||
|
|||
**Well-known-binary (WKB)**: `Binary` or `LargeBinary` or `FixedSizeBinary` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave the addition of WKB (or WKT, or geojson?) for another PR, to focus here on the struct representation?
format.md
Outdated
@@ -217,25 +265,25 @@ Logical representation: | |||
|
|||
``` | |||
[ | |||
[[(0, 0), (0, 1), (0, 2)]], | |||
[[{x: 0, y: 0}, {x: 0, y: 1}, {x: 0, y: 2}]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paleolimbot would you be OK with keeping the examples as is here? I personally find the current version a bit easier to read, and the example is mostly to show the additional nesting, regardless of the exact inner representation. We could add a note that for the struct version, all buffers are the same, except the coordinates buffer would be two buffers for x and y.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted the examples! I did add enough clarification so as not to make them confusing (hopefully).
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Thank you everybody for the lively debate! If there aren't any objections to this proposal as worded in the latest version, I'll merge tomorrow! |
The comment by @apps4uco in #16 is the latest in a number expressing surprise at the way that the current geoarrow spec suggest that points be stored. While I think that our current spec (xyzxyzxyzxyz) has value somewhere, it's unintuitive to enough everybody I've talked to that I think we should consider changing our initial spec version to (xxxxxx, yyyyyy). Since we started blogging and tweeting about this - largely for the purposes of generating this kind of discussion - all we have had are comments about why we didn't choose (xxxxxx, yyyyyy). I became involved after this initial decision had been made and I think it's the perfect time to change it (i.e., before any implementations actually exist).
The best arguments to store points as
Struct <x: [...], y: [...]>
by default are:x
ory
values from the inner points array meaning we might be able to implement more operations without custom C or C++ kernels).FixedSizeList
array might be deprecated in Arrow, meaning we probably shouldn't rely on it to store dimension information (https://issues.apache.org/jira/browse/ARROW-14999).z
orm
values don't have to be rewritten to be used forxyz
,xym
, orxyzm
geometries (the first child is always X and the second child is always Y regardless of dimension). A lot of geometry kernels fall into this category. I'm personally sick of the number of times I've had to parameterizecoord_size
,z_index
andm_index
...this really does make writing kernels harder.The only reason I know about to use the current spec (
[x y, x y, x y]
) is that it more closely matches how WKB does it and that it uses fewer buffers. This isn't to say it doesn't have any other advantages, but I think there's enough anecdotal evidence to put the burden of proof the other way around: if storing coordinates as[x y, x y, x y]
in an Arrow array has value, its value must (in my opinion) be proved.]