Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should the faster countmap for CategoricalArray sit in this repo? #340

Open
xiaodaigh opened this issue Jan 14, 2018 · 9 comments
Open

Should the faster countmap for CategoricalArray sit in this repo? #340

xiaodaigh opened this issue Jan 14, 2018 · 9 comments

Comments

@xiaodaigh
Copy link
Contributor

Should the faster countmap for CategoricalArray sit in this repo? Or in the CategoricalArrays.jl. Arguments can be made for both. But I am leaning towards CategoricalArrays.jl.

So that

  1. This only supports Base types
  2. Changes in CategoricalArrays.jl should not impact StatsBase.jl
@nalimilan
Copy link
Member

That's a difficult question. I don't think CategoricalArrays should depend on StatsBase given that StatsBase depends on other packages like SpecialFunctions, which isn't pure Julia. OTOH StatsBase could depend on CategoricalArrays since it doesn't add new dependencies (except that we are about to add a dependency on JSON, which is unfortunate but needed for now, see JuliaData/CategoricalArrays.jl#96). I'm not sure how others feel about this. See also #228.

For now my approach has been to support CategoricalArrays in FreqTables. Maybe we should move counting functionality there, and include FreqTables in the Stats meta-package we've been talking about for a long time. The advantage of that approach is that people wouldn't be lead to believe we only provide very basic functions like countmap and counts, which are not very user-friendly and only accept one variable at a time.

@xiaodaigh
Copy link
Contributor Author

How about a CategoricalArraysTools.jl package where things like faster sort and countmap can go?

@nalimilan
Copy link
Member

Well, these functions are not only for CategoricalArray, and actually your faster sorting methods are for strings. I think it makes more sense to have frequency tables in FreqTables and sorting in SortingAlgorithms.

@xiaodaigh
Copy link
Contributor Author

I mean there are goinfg to be sort and countmap that are specialised for CategoricalArrays

@nalimilan
Copy link
Member

Yes, but since these functions also support other types it would be weird to have to load a package which has "CategoricalArray" in its name.

@xiaodaigh
Copy link
Contributor Author

I don't mean all fast sorting methods. Just the methods that has signature sort(::CategoricalArrays,...) and also countmap(::CategoricalArrays,...). Those specialised functions sit in CategoricalArrayTools.jl

That would work I think.

@nalimilan
Copy link
Member

Then you'd have the slow methods by default, unless you do using CategoricalArrayTools? That's an unusual pattern AFAIK and it's not very intuitive.

@xiaodaigh
Copy link
Contributor Author

thats a good point. so we should have them in CategoricalArrays

@nalimilan
Copy link
Member

For reference, optimized methods can now be implemented in StatsBase using DataAPI.refarray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants