Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LargeN usage #44

Closed
duccioa opened this issue Aug 22, 2023 · 4 comments
Closed

LargeN usage #44

duccioa opened this issue Aug 22, 2023 · 4 comments

Comments

@duccioa
Copy link

duccioa commented Aug 22, 2023

Hello,
I am a bit confused by the usage and the documentation of the parameters largeN in classIntervals and I would be grateful for some guidance.
The documentation says

default 3000L, the QGIS sampling threshold; over 3000, the observations presented to "fisher" and "jenks" are either a samp_prop= sample or a sample of 3000, whichever is larger

In classIntervals(), largeN is used as following:

function (var, n, style = "quantile", rtimes = 3, ..., intervalClosure = c("left", 
    "right"), dataPrecision = NULL, warnSmallN = TRUE, warnLargeN = TRUE, 
    largeN = 3000L, samp_prop = 0.1, gr = c("[", "]")) {
       #.....

        nobs <- length(unique(var))
        
        #.....

        if (warnLargeN && (style %in% c("kmeans", "hclust", "bclust", 
            "fisher", "jenks"))) {
            if (nobs > largeN) {
                warning("N is large, and some styles will run very slowly; sampling imposed")
                sampling <- TRUE
                nsamp <- ifelse(samp_prop * nobs > 3000, as.integer(ceiling(samp_prop * 
                  nobs)), 3000L)
            }
        }
       #....
}

Where nobs <- length(unique(var)).

My understanding is that largeN is the threshold above which we consider var to require sampling.

What I find difficult to understand is that then largeN is not used to compute the sampling but we use the value 3000.
3000 is also the default of largeN, but the two values are not used in the same way. One is used as a threshold and the other one is hard coded to calculate the sample size.

This also gives a problem when length(var) < largeN:

library(classInt)

large_n = 1000
x = 1:(large_n + 1)
classInt::classIntervals(x, n = 10, style = "fisher", largeN = large_n, samp_prop = 0.05)
#> Warning in classInt::classIntervals(x, n = 10, style = "fisher", largeN = 1000,
#> : N is large, and some styles will run very slowly; sampling imposed
#> Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'

Created on 2023-08-22 by the reprex package (v2.0.1)

Shouldn't it be something like nsamp <- min(largeN, nobs * samp_prop) ?

Thank you very much for your time.
Duccio

@rsbivand
Copy link
Member

@duccioa Yes, I think your analysis is correct. I'll try to prepare a fix in a branch during this week. May I ask you to review the changes when I'm ready?

@duccioa
Copy link
Author

duccioa commented Aug 22, 2023

Roger (If I may call you by your name), I would be honored. I am a big fan of your work in the r-spatial community.

rsbivand added a commit that referenced this issue Aug 24, 2023
rsbivand added a commit that referenced this issue Aug 24, 2023
rsbivand added a commit that referenced this issue Aug 27, 2023
rsbivand added a commit that referenced this issue Aug 29, 2023
rsbivand added a commit that referenced this issue Aug 29, 2023
@rsbivand
Copy link
Member

@duccioa Thanks very much! I've merged into the main branch now.

@rsbivand
Copy link
Member

rsbivand commented Sep 5, 2023

Submitted to CRAN.

@rsbivand rsbivand closed this as completed Sep 5, 2023
freebsd-git pushed a commit to freebsd/freebsd-ports that referenced this issue Oct 9, 2023
- Take maintainership

ChangeLog:

Address LargeN usage:
r-spatial/classInt#44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants