Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Define cut() in StatsBase? #228

Closed
nalimilan opened this issue Jan 15, 2017 · 10 comments
Closed

RFC: Define cut() in StatsBase? #228

nalimilan opened this issue Jan 15, 2017 · 10 comments

Comments

@nalimilan
Copy link
Member

cut used to be defined in DataArrays, since it returns a PooledDataArray. With the move to CategoricalArrays, where should it live? I could add it to that package, but I figure people are more likely to look for it in StatsBase instead. So should we add it here? Of course that would add a dependency on CategoricalArrays.

This would also allow providing an efficient countmap/addcounts! method for CategoricalArray. Else, CategoricalArrays will have to depend on StatsBase to provide it.

@johnmyleswhite
Copy link
Member

I'd be happy to see cut not be automatically linked to categorical variables, in which case I think it makes sense for it to live in StatsBase.

@nalimilan
Copy link
Member Author

Ah, so that's a third possibility. The implicit assumption in my description was that it would return a CategoricalArray by default, which the option of passing a different return type as first argument. Do you suggest returning an Array{String} by default? On the one hand, it sounds simpler; on the other hand, that will be a terrible waste of space and of information (orderedness).

@nalimilan
Copy link
Member Author

Any more comments?

@ararslan
Copy link
Member

I like the idea of a type argument to allow returning a CategoricalArray, but I'm not sure introducing a dependency on CategoricalArrays is the best choice for StatsBase. It to me seems like StatsBase should try to remain as lightweight as possible (and indeed currently the only package dependency is Compat, not even StatsFuns). With that in mind, I'm not sure what the best default would be, e.g. Array{String} or what have you, but it would make sense to me for it to be some Base type.

@andreasnoack
Copy link
Member

I'd vote for having it in CategoricalArrays. Wouldn't there be other statistical functions you'd like to define for CategoricalArrays such that a dependency on StatsBase would be required anyway? Maybe we could define a placeholder method in StatsBase such that other array packages can define their own.

@ararslan
Copy link
Member

Maybe we could define a placeholder method in StatsBase such that other array packages can define their own.

That's consistent with what we do with the modeling stuff (though that should be moved to StatsModels).

@nalimilan
Copy link
Member Author

It would be weird to provide a function in StatsBase without any implementation, though. This is less surprising for statistical models since you cannot fit them without additional packages anyway. Also I'm not sure what other packages would need to define methods for cut.

I'm fine with implementing cut in CategoricalArrays. Though at some point I'm afraid keeping StatsBase lightweight could turn out to be limiting, unless we create another heavier stats package. It feels weird that one needs to load one package to call cut (because of the dependency on CategoricalArrays), and another package to compute frequency tables (because of the dependency on NamedArrays); there are probably other examples of this. It could be useful to provide a batteries included package once the ecosystem settles.

@andreasnoack
Copy link
Member

It could be useful to provide a batteries included package once the ecosystem settles.

I completely agree with that but when we originally discussed StatsBase it also seemed that StatsBase shouldn't be the batteries included statistics package. E.g. a package for full featured statistical modeling would have to depend on Distributions but we decided that the dependency order should be the other way around. Maybe it is time to create an umbrella package for statistics such that user has an easy way to get a lot of functionality. The main problem is the DataFrames situation but I guess we could still begin the work on an umbrella package if we clearly state that it is work in progress.

@ararslan
Copy link
Member

Perhaps we could repurpose Stats.jl for that. 🙂

@nalimilan
Copy link
Member Author

Yes, the name StatsBase is quite explicit, we could put extra features into (or load an reexport extra packages from) Stats instead (incidentally, that's how it's called in R).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants