-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
st_intersection 30 times slower in 0.9_6 compared to 0.7_4 #1510
Comments
Please report GEOS versions for both cases - and your OS, with your installation choices (MacOS or Windows CRAN binaries, source install, etc.). Windows binaries are now using a more recent GEOS, as I think are MacOS. In both cases, |
sf 0.9_6 is using GDAL 3.8.0 and sf 0.7_4 is using GDAL 3.6.1. I am running this on Windows (although, I have 0.7_4/3.6.1 on a Mac and it's just as quick). I am doing this on a work machine that uses JFrog Artifactory to mirror Windows binaries - installing a package looks like this: install.packages("sf", depends=TRUE, type = "win.binary") |
OK, you mean GEOS, not GDAL. So either something changed in sf, GEOS, or both. Let's see if @edzer can comment before proceeding. I'm assuming that the platforms are otherwise comparable. |
Oops! I should have said GDAL 2.2.3 vs 3.0.4 |
I asked about GEOS, not GDAL. Even though sf may do topology predicates and operations through GDAL, it is the version of GEOS that matters. The older variant of CRAN Windows binaries uses https://github.com/rwinlib/gdal2 GEOS 3.6.1, the newer https://github.com/rwinlib/gdal3 GEOS 3.8.0. |
sf uses GEOS directly, not through GDAL (with the exception of specifying ops in the |
Sorry, didn't check. Do you know of obvious changes in GEOS or the sf interface to GEOS (in Rcpp?) since early 2018? |
I don't see, and cannot recall, any changes made to |
Shall I try to repllicate on Win10 - CRAN sf 0.9-6 and source install from GDAL2 of 0.7-4? Or do you have access? |
That would be great - I don't have (simple) access to windows, and can't install 0.7-4 on my working machine (requires proj_api.h, no longer avail in PROJ 7). |
On sf 0.9-7 with development GEOS (with NG overlay I think) - so it could be a validity handling issue - @geogale don't worry about this, we test as-yet unreleased GEOS etc. versions. This is a message, and the inputs and the output are all valid, F32 time 617 seconds:
|
On Win10, R 4.0.2 and sf 0.9-6 and GDAL3 GEOS 3.8.0 16.95 min (old T420s laptop) |
With GEOS 3.9.0dev, sf was 617 seconds for intersection, but rgeos with the same GEOS is 154 seconds for the same intersection (F32). |
> system.time(Boundary_Intersection <- st_intersection(Boundary1,Boundary2))
user system elapsed
27.705 0.007 27.716
Warning message:
attribute variables are assumed to be spatially constant throughout all geometries
> sf_extSoftVersion()
GEOS GDAL proj.4 GDAL_with_GEOS USE_PROJ_H
"3.6.0" "2.2.4" "4.8.0" "false" "false"
> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux bullseye/sid
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] sf_0.9-7
loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 class_7.3-15 crayon_1.3.4 dplyr_0.8.3
[5] assertthat_0.2.1 grid_3.6.2 R6_2.4.1 DBI_1.1.0
[9] magrittr_1.5 e1071_1.7-3 units_0.6-5 pillar_1.4.3
[13] KernSmooth_2.23-16 rlang_0.4.4 tools_3.6.2 glue_1.3.1
[17] purrr_0.3.3 compiler_3.6.2 pkgconfig_2.0.3 classInt_0.4-2
[21] tidyselect_1.0.0 tibble_2.1.3 indicates to me that the regression is not in sf... |
And it is a big regression, isn't it? What are the timings for GEOS 3.8.1 (I'm stuck with 3.9.0dev at the moment - do you have 3.8.0 or 3.8.1 running)? It isn't rwinlib either. |
With
I see
so, yeah, a factor 20. |
FYI.
Below are my values. I actually realized this issue around a month ago.
Therefore, I was using st_intersects() instead.
sf_extSoftVersion()
GEOS GDAL proj.4 GDAL_with_GEOS USE_PROJ_H
"3.8.0" "3.0.4" "6.3.1" "true" "true"
sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
system.time(Boundary_Intersection <- st_intersection(Boundary1,Boundary2))
user system elapsed
1026.41 3.73 1027.08
Best,
Olcay
…On Tue, Oct 13, 2020 at 9:29 AM Edzer Pebesma ***@***.***> wrote:
With
GEOS 3.8.1, GDAL 3.1.3, PROJ 7.1.1
I see
> system.time(Boundary_Intersection <- st_intersection(Boundary1,Boundary2))
user system elapsed
562.452 0.067 562.728
Warning message:
attribute variables are assumed to be spatially constant throughout all geometries
so, yeah, a factor 20.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1510 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AARWS4CL7RIMCJ4NZWUW6YDSKRP3VANCNFSM4SOK4KTA>
.
--
OLCAY SAHIN
|
Oops: current rgeos with |
Or it's in parts that sf uses, but rgeos doesn't, like the STRtree? |
Unfortunately, for R profiling, |
I tried compiling with |
Maybe I can check STRtree separately in
|
This https://stackoverflow.com/questions/23685206/is-my-using-gperftools-to-profile-a-r-script-with-rcpp-correct didn't look helpful. Ask Dirk? |
sf 0.7-5 is the earliest version that works without proj_api.h; compiled against current GEOS/GDAL/PROJ it also has the regression: library(sf)
# Linking to GEOS 3.8.1, GDAL 3.1.3, PROJ 7.1.1
utils::packageVersion("sf")
# [1] ‘0.7.5’
#download.file("https://www.dropbox.com/s/izol20rxf90apye/Areas.gpkg?dl=1",
# destfile = "Areas.gpkg" , mode='wb')
Boundary1 <- st_transform(st_read("Areas.gpkg"),crs=27700)
# Reading layer `Areas' from data source `/tmp/Areas.gpkg' using driver `GPKG'
# Simple feature collection with 1000 features and 1 field
# geometry type: MULTIPOLYGON
# dimension: XY
# bbox: xmin: -116.1928 ymin: 68218.7 xmax: 640329.7 ymax: 868949.7
# epsg (SRID): 27700
# proj4string: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs
#download.file("https://www.dropbox.com/s/z8pham6w9ldesfy/Local_Authority_Districts__May_2020__Boundaries_UK_BFE-shp.zip?dl=1",
# destfile = "Local_Authority_Districts__May_2020__Boundaries_UK_BFE-shp.zip" , mode='wb')
#unzip("Local_Authority_Districts__May_2020__Boundaries_UK_BFE-shp.zip", exdir = ".")
Boundary2 <- st_transform(st_read("Local_Authority_Districts__May_2020__Boundaries_UK_BFE.shp"),crs=27700)
# Reading layer `Local_Authority_Districts__May_2020__Boundaries_UK_BFE' from data source `/tmp/Local_Authority_Districts__May_2020__Boundaries_UK_BFE.shp' using driver `ESRI Shapefile'
# Simple feature collection with 379 features and 10 fields
# geometry type: MULTIPOLYGON
# dimension: XY
# bbox: xmin: -116.1928 ymin: 5333.81 xmax: 655989 ymax: 1220302
# epsg (SRID): 27700
# proj4string: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs
system.time(Boundary_Intersection <- st_intersection(Boundary1, Boundary2))
# user system elapsed
# 581.509 0.383 582.846
# Warning message:
# attribute variables are assumed to be spatially constant throughout all geometries Is one of you able to time the sf 0.7-5 binary on Windows? |
I'll try that in a moment and report back. The However, after asking on geos-devel how to build with overlay-ng (I thought it was enabled in master by default but wasn't) https://lists.osgeo.org/pipermail/geos-devel/2020-October/009750.html, I have:
So maybe GEOS 3.7 introduced the regression but overlay-ng will not only remove it, but will also be faster. I still need to run checks and revdep checks. Maybe 3.9 will build on Solaris (3.7 and 3.8 are broken)?? |
I can only install sf 0.7-5 on Windows with rwinlib/GDAL2, so PROJ 4.9.3. I tried to mix GDAL2 and GDAL3 yesterday without success, as GDAL2 is Rtools < 4, and GDAL3 >= 4 IIUC. I don't think the sf change is related to |
I cannot detect the regression through rgeos, unfortunately:
To try with sf, I'd need to re-build GDAL for each GEOS version too. Looks more like container instances with a succession of version sets. |
I built
and get
This is the identical |
Could you try Overlay-NG - a recent nightly (I had 9 October on my desktop, master from git on gitea on laptop), ../configure --enable-overlayng, and see if you also get the speed-up I'm seeing. As rgeos seems not to see a regression for |
With overlayng-enabled GEOS compiled from github, and > library(sf)
Linking to GEOS 3.9.0dev, GDAL 3.1.1, PROJ 7.1.0 I see > system.time(Boundary_Intersection <- st_intersection(Boundary1,Boundary2))
user system elapsed
16.668 0.040 16.711
Warning message:
attribute variables are assumed to be spatially constant throughout all geometries |
Very good! This is the same improvement that I see. My revdeps are running, so I'll see whether there are any breakages attributable to Overlay-NG. |
And the other issue seems to be related to precision, default set in rgeos to 1e8, set to zero (not set) in sf. When setting to 1e8: library(sf)
# Linking to GEOS 3.8.1, GDAL 3.1.3, PROJ 7.1.1
utils::packageVersion("sf")
# [1] ‘0.9.7’
#download.file("https://www.dropbox.com/s/izol20rxf90apye/Areas.gpkg?dl=1",
# destfile = "Areas.gpkg" , mode='wb')
Boundary1 <- st_set_precision(st_transform(st_read("Areas.gpkg"),crs=27700),
1e8)
# Reading layer `Areas' from data source `/tmp/Areas.gpkg' using driver `GPKG'
# Simple feature collection with 1000 features and 1 field
# geometry type: MULTIPOLYGON
# dimension: XY
# bbox: xmin: -116.1928 ymin: 68218.7 xmax: 640329.7 ymax: 868949.7
# projected CRS: OSGB 1936 / British National Grid
#download.file("https://www.dropbox.com/s/z8pham6w9ldesfy/Local_Authority_Districts__May_2020__Boundaries_UK_BFE-shp.zip?dl=1",
# destfile = "Local_Authority_Districts__May_2020__Boundaries_UK_BFE-shp.zip" , mode='wb')
#unzip("Local_Authority_Districts__May_2020__Boundaries_UK_BFE-shp.zip", exdir = ".")
Boundary2 <- st_set_precision(st_transform(st_read("Local_Authority_Districts__May_2020__Boundaries_UK_BFE.shp"),crs=27700),
1e8)
# Reading layer `Local_Authority_Districts__May_2020__Boundaries_UK_BFE' from data source `/tmp/Local_Authority_Districts__May_2020__Boundaries_UK_BFE.shp' using driver `ESRI Shapefile'
# Simple feature collection with 379 features and 10 fields
# geometry type: MULTIPOLYGON
# dimension: XY
# bbox: xmin: -116.1928 ymin: 5333.81 xmax: 655989 ymax: 1220302
# projected CRS: OSGB 1936 / British National Grid
system.time(Boundary_Intersection <- st_intersection(Boundary1, Boundary2))
# user system elapsed
# 22.916 0.038 22.962
# Warning message:
# attribute variables are assumed to be spatially constant throughout all geometries |
For me, that closes the issue. |
@edzer just to confirm my layperson understanding. Are you suggesting setting the precision to 1e8 for all |
Yes, perturbing the inputs can have a large effect on algorithm runtime. Most of the time it will probably improve. In other cases it will cause the overlay to fail entirely or return an incorrect result. |
With |
Following this thread with interest, don't understand every part of it but very grateful that this has been discovered and documented for the benefit of everyone working with medium-large datasets. From @edzer's comment above, I guess that this would mean running a command such as st_precision(Boundary1) = 0.00000001 and the same for If a default setting like that has the potential for factor 30 speed-ups, and given most users will not know about let alone read this issue, my first impression is: in favour. |
Robin, I set it to 1e8, not 1e-8. You can indeed set it to |
Aha, so: st_precision(Boundary1) = 1e8 It's not clear to me from the example in https://r-spatial.github.io/sf/reference/st_precision.html what appropriate values are. Is it worth updating that help page with an example that is closer to a real-world scenario? |
P.s. also just spotted that there is the equivalent code above.
|
I agree that for instance
is a more readable form. The documentation of |
Both are readable and it's partly a matter of style, I think style is a continuum and that both ways of writing it represents different ends of the continuum. I would probably have gone for an intermediary approach, but definitely wasn't aiming to comment on the style, just the substance. In answer to your question, no I didn't make it there and probably should have checked. From there I can see the key quote:
Worth saying something like that in the docs for |
For what it's worth I had a think about how to type that 'tidy pipeline' and came up with this, a balance that is readable and concise? f = "Local_Authority_Districts__May_2020__Boundaries_UK_BFE.shp"
Boundary2_original = st_read(f)
Boundary2 = Boundary2_original
st_transform(crs=27700) %>%
st_set_precision(units::set_units(10, nm)) I didn't know about the |
Also as a heads-up @geogale, I wrote a function that you can point to one of the datasets provided by data.gov.uk/ESRI and it auto downloads/unzips the shapefile + tries to read it in a single command - take a look if of use/interest, I've been using it a lot of late: https://github.com/Robinlovelace/ukboundaries/blob/master/R/ukborders-package.R#L6 Reprex: library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
u = "https://opendata.arcgis.com/datasets/910f48f3c4b3400aa9eb0af9f8989bbe_0.zip?outSR=%7B%22latestWkid%22%3A27700%2C%22wkid%22%3A27700%7D"
Boundary2_original = ukboundaries::duraz(u)
#> Using default data cache directory ~/.ukboundaries/cache
#> Use cache_dir() to change it.
#> Reading layer `Local_Authority_Districts__May_2020__UK_BUC' from data source `/tmp/RtmpEfD8o1/reprex174185e970dd2/Local_Authority_Districts__May_2020__UK_BUC.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 379 features and 10 fields
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -116.1928 ymin: 7054.1 xmax: 655653.8 ymax: 1218618
#> projected CRS: OSGB 1936 / British National Grid
#> Removing the following files: Local_Authority_Districts__May_2020__UK_BUC.dbf Local_Authority_Districts__May_2020__UK_BUC.prj Local_Authority_Districts__May_2020__UK_BUC.shp Local_Authority_Districts__May_2020__UK_BUC.shx
Boundary2 = Boundary2_original %>%
st_transform(crs=27700) %>%
st_set_precision(units::set_units(10, nm))
plot(Boundary2) Created on 2020-10-14 by the reprex package (v0.3.0) |
Thanks @Robinlovelace - I have no doubt that's been well used lately! I may be reading the data from the place where it lives before ending up in an Esri interface - but that would be telling! |
Fun (?) fact: GEOS 3.9.0 with Overlay-NG generates the same slivers, but reports them in different order, see r-spatial/asdar-book.org@144f3d5. Same changed order of output breaks geozoning, uribo/kuniezu#6, plotly/plotly.R#1862, sf itself #1512, hypertidy/sfdct#13. |
A note that this behaviour changes in GEOS 3.9 with OverlayNG. OverlayNG should generally run faster if floating (= no) precision is specified, since it can use faster noding algorithms. Conversely, specifying a precision invokes snap-rounding noding, which is generally slower (anywhere from 1.5x to 3x). So the recommendation to use a precision model should probably be revisited (once confirmed by experience by the sf community). |
Hi. I am running the following code:
The intersection takes about 15 minutes to complete when using R 4.0.2 and sf 0.9_6 but only around 30 seconds when using R 3.4.0 and sf 0.7_4. Is there a reason why it is now a lot slower?
The text was updated successfully, but these errors were encountered: