Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification of benchmarks #4

Open
martinfleis opened this issue Feb 22, 2022 · 4 comments
Open

Clarification of benchmarks #4

martinfleis opened this issue Feb 22, 2022 · 4 comments

Comments

@martinfleis
Copy link
Contributor

Hi,

I'll make a PR changing some of the geopandas benchmarks to more performant versions but before that I'd like to ask for some clarifications. I understand that the benchmarks are artificial but before I'll start coding I want to make sure I understand what the main goal is.

  1. distance
    • you are trying to get a NxN matrix with pairwise distance between all points (both ways?), right?
  2. sample
    • I truly don't understand what is this trying to do :D. Are you trying to get n random points that are within the polygon? Sort-of Monte Carlo simulation?

I think I understand the rest.

@martinfleis
Copy link
Contributor Author

I think I got it. See #5

@kadyb
Copy link
Owner

kadyb commented Feb 22, 2022

Personally, I wanted to focus on comparing the functions available in packages from a user's perspective, rather than writing the most efficient alternatives. I also think we should compare similar functions in terms of features ({sf} as a reference?). I know it's possible to write efficient code using eg. {Rcpp}, {GEOS} and {data.table}, but I think that's beyond the reach of the vast majority of users.

distance
you are trying to get a NxN matrix with pairwise distance between all points (both ways?), right?

Exactly!

sample
I truly don't understand what is this trying to do :D. Are you trying to get n random points that are within the polygon? Sort-of Monte Carlo simulation?

Not quite sort of Monte Carlo simulation. I think sampling points in polygons is a standard practice in GIS :P Later, the coordinates can be retrieved from these geometries, or they can be used to extract values from the raster. Please check out sf::st_sample() as a reference. Ideally, you would implement this as a function in {geopandas}.

@martinfleis
Copy link
Contributor Author

Personally, I wanted to focus on comparing the functions available in packages from a user's perspective, rather than writing the most efficient alternatives.

Yup, I've used only functions that are available. As you can see from the discussion on intersects, there could be even faster options.

compare similar functions in terms of features ({sf} as a reference?)

As far as I know, the intersects in sf uses spatial index under the hood, that is why I opted to use it as well. But I understand if you ignore that solution :).

Ideally, you would implement this as a function in {geopandas}.

We don't have anything like this right now but the code I used in #5, replacing your custom loop, is likely quite close to how it would look like if we had it (I'll open an issue to add it in future).

@kadyb
Copy link
Owner

kadyb commented Feb 22, 2022

As far as I know, the intersects in sf uses spatial index under the hood, that is why I opted to use it as well. But I understand if you ignore that solution :).

My mistake, in that case {geopandas} should also use spatial indexes. Not sure if {terra} works the same way, but I believe it does. Edit: {terra} doesn't use spatial indexes.

By "compare similar functions in terms of features", I meant that the functions in {terra} and {sf} have more options (arguments), so I suspect there will be overhead (but probably negligible) due to conditions/transformations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants