Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optionally make AbstractDataVec be like an R factor #6

Closed
HarlanH opened this issue Jul 15, 2012 · 20 comments
Closed

optionally make AbstractDataVec be like an R factor #6

HarlanH opened this issue Jul 15, 2012 · 20 comments
Labels

Comments

@HarlanH
Copy link
Contributor

HarlanH commented Jul 15, 2012

There should be a way to enforce a fixed set of pool items in a DV, and to optionally flag the ordering as important. It may also be useful to have meta-data for constrast construction -- or maybe this isn't the appropriate place for it (cf. R).

@doobwa
Copy link
Contributor

doobwa commented Aug 15, 2012

Allowing a predetermined set of pool items is implemented 2711fd8

I agree that ordering flags and contrast options are still needed.

@HarlanH
Copy link
Contributor Author

HarlanH commented Sep 23, 2012

Some valuable discussion in #58. This story should be expanded -- define methods for AbstractDataVec to allow categorical/factor-like behavior, with varying performance trade-off depending on Pooled or non-pooled implementations.

@HarlanH
Copy link
Contributor Author

HarlanH commented Sep 25, 2012

OK, here's my interface-level proposal. Implementation details would differ between DataVecs and PooledDataVecs.

Each ADV would have a field called datatype::DataType, probably implemented as:

bitstype 8 DataType
@enum DataType NOMINAL ORDINAL INTERVAL RATIO

By default, an ADV{T<:Number} would default to a RATIO type (except maybe Bool, which might default to NOMINAL). Any other type would default to NOMINAL. Non-numeric types can only be NOMINAL or ORDINAL. Numeric types can be set to be any type, which would give proper categorical behavior for, e.g. UIDs.

Each ADV would have an optional Domain, which can be a Set or Range. If present, elements would be checked for membership against the Domain, and an error thrown if an element is not in the Domain. A common use case would be an ASCIIString DV with NOMINAL type and a Set of possible values, which would be equivalent to an R factor.

Each ADV of ORDINAL type may have an Ordering specified, which is a function that provides an ordering of the elements, ala isless(a,b). By default, isless is used, which gives alphanumeric ordering for strings, numeric ordering for numbers, chronological ordering for dates (if we had a date type), etc. A common use case would be an ASCIIString DV with ORDINAL type, a Domain with Ordering "Completely Agree", "Agree", "Neither Agree nor Disagree", etc.

Methods might look like:

# maximally verbose way -- there would be shortcuts
x = DataVec(["Low", "Medium", "High"])
setType(x, ORDINAL)
setDomain(x, ["High", "Medium", "Low"])
orderingDict = {"High" => 3, "Medium" => 2, "Low" => 1}
setOrdering(x, (a,b) -> isless(orderingDict[a], orderingDict[b]))

# or probably something like this could be made to do the same:
x = DataVec(["Low", "Medium", "High"], @options datatype=ORDINAL)

# use the obvious ways
if getType(x) == ORDINAL
  ...

push(x, "Medium") #OK
push(x, "Tiny") #error!

Statistical routines would read this meta-data and act appropriately when building model matrices and similar.

Notes:

x[1] < x[3] will give true, because this evaluates as "High" < "Low". Not sure if there's any way around this. But x .< "Medium" should give the expected answer.

Because implementation details would differ for a DataVec vs PooledDataVec, probably want to use getters and setters instead of fields.

The Domain for PDVs would presumably double as the pool.

Thoughts, @doobwa , @johnmyleswhite , @tshort ?

@doobwa
Copy link
Contributor

doobwa commented Sep 25, 2012

Thanks for digging into this.

One concern is that some methods will behave differently depending on the type. Would it be cleaner to have NominalDataVec, OrdinalDataVec, etc? Seems like it might get a bit crowded if we go that direction. (One might view this as an implementation detail, but since we're talking about having a DataType field I figured it was fair game.) For example, how would mean(dv) work?

@tshort
Copy link
Contributor

tshort commented Sep 25, 2012

I'm not very proficient here, but it looks well thought out. As far as function names, I'd prefer settype, setdomain, and so on. Or, use underscores (set_type, set_domain). camelCase doesn't seem to be used much in Julia code.

Concatenation or other combining may get tricky for some combinations.

@HarlanH
Copy link
Contributor Author

HarlanH commented Sep 25, 2012

Chris, that's an interesting idea. It might well be more Julian to use the
type system and multiple dispatch here. I wonder then how pooled data might
work? Without multiple inheritance, we'd either need NominalDataVec and
NominalPooledDataVec <: AbstractNominalDataVec, or NominalPooledDataVec and
OrdinalPooledDataVec <: AbstractPooledDataVec. Or, we could just have
Nominal and Ordinal be Pooled by implementation, and Interval and Ratio be
non-pooled? That would be closer to what we have now.

Tom, yes, combinations are an interesting point that I hadn't thought about
yet.

More to ponder...!

On Tue, Sep 25, 2012 at 2:50 PM, Chris DuBois [email protected]:

Thanks for digging into this.

One concern is that some methods will behave differently depending on the
type. Would it be cleaner to have NominalDataVec, OrdinalDataVec, etc?
Seems like it might get a bit crowded if we go that direction. (One might
view this as an implementation detail, but since we're talking about having
a DataType field I figured it was fair game.) For example, how would
mean(dv) work?


Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/6#issuecomment-8865631.

@tshort
Copy link
Contributor

tshort commented Sep 25, 2012

I had to do some googling just to figure out what each of these meant. I'm inclined to think that your "by implementation" idea is the best:

"Or, we could just have Nominal and Ordinal be Pooled by implementation, and Interval and Ratio be non-pooled?"

It might make sense to have Nominal and Ordinal share an abstract type because they will share some functions.

Is it really worth it to separate out Ratio and Interval types? I haven't run across that in R before.

@HarlanH
Copy link
Contributor Author

HarlanH commented Sep 26, 2012

It may or may not be useful to have Interval. It's not supported by R --
you're right. See this Wikipedia page:
http://en.wikipedia.org/wiki/Level_of_measurement I don't know whether
making this distinction would be useful, or annoying in practice.

The only question about the R-like solution, with Nominal and Ordinal being
Pooled, is what to do with things like the "Categorical User ID" case. We
certainly can allow that to be part of a NominalDataVec{UInt64, UInt64}
(both the pool and the data are UInt64s), but it's going to be quite
space-inefficient, with a fairly major performance hit relative to a
non-pooled implementation. The question is whether that's enough motivation
to have both the NominalDataVec and NominalPooledDataVec cases, and making
everything that much more complex.

On Tue, Sep 25, 2012 at 7:44 PM, Tom Short [email protected] wrote:

I had to do some googling just to figure out what each of these meant. I'm
inclined to think that your "by implementation" idea is the best:

"Or, we could just have Nominal and Ordinal be Pooled by implementation,
and Interval and Ratio be non-pooled?"

It might make sense to have Nominal and Ordinal share an abstract type
because they will share some functions.

Is it really worth it to separate out Ratio and Interval types? I haven't
run across that in R before.


Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/6#issuecomment-8874281.

@tshort
Copy link
Contributor

tshort commented Sep 26, 2012

For the "Categorical User ID" case, my first thought is to just use NominalDataVecs. If there's demand, a non-pooled type could be added as another type that shares an abstract type with Nominal and Ordinal.

On naming, what do you think of Factor as the name for NominalDataVec or as the abstract type that covers Nominal and Ordinal? That might help R users.

@doobwa
Copy link
Contributor

doobwa commented Sep 26, 2012

Or how about CategoricalVec? Didn't pandas end up using Categorical as a name for this?

I agree that Factor would be good for R converts like me, but I never really liked that name in the first place.

@HarlanH
Copy link
Contributor Author

HarlanH commented Sep 26, 2012

I'm OK with us eventually ending up with R's solution (although I agree
with Chris -- Categorical or Nominal is better than Factor), I just want us
to make sure it's the most reasonable option...

Here's another random thought. What if we make a distinction between is-a
and has-a relationships in the type hierarchy. That is, what if the is-a
relationships (and user-visible types) are Nominal/Categorical, Ordinal,
Interval (maybe) and Ratio. But objects of these types have (rather than
are) VectorDataVec (what we now call DataVec) or PooledDataVec objects in
a (Union?) slot. Then, as long as VDV and PDV objects have a consistent
interface (via an abstract type above them), the N/O/I/R objects can use
either one. Then, we'd set things up so that Interval and Ratio always use
VDVs, while Nominal and Ordinal start with PDVs but can convert to VDVs if
they overflow a 16-bit pool. The Pooled/Vector distinction would be
entirely invisible to the user.

In the long run, this might make additional sense when we start thinking
about optimizations for memory-mapped and indexed data, where instead of a
single underlying vector in the VDV case, you probably want a blocked data
structure instead.

(I started writing this proposal as an unlikely brainstorm, but now I sorta
like it...!)

On Tue, Sep 25, 2012 at 8:37 PM, Tom Short [email protected] wrote:

For the "Categorical User ID" case, my first thought is to just use
NominalDataVecs. If there's demand, a non-pooled type could be added as
another type that shares an abstract type with Nominal and Ordinal.

On naming, what do you think of Factor as the name for NominalDataVec or
as the abstract type that covers Nominal and Ordinal? That might help R
users.


Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/6#issuecomment-8875221.

@ghost
Copy link

ghost commented Sep 26, 2012

It sounds like its worth implementing or trying out some test code to see how it feels. On the conversion when overflowing a 16-bit pool, I think we still need provisions for a larger pool. This is especially important for strings; it doesn't take many repeats to justify having a pool.

@HarlanH
Copy link
Contributor Author

HarlanH commented Sep 26, 2012

OK, I'll plan on starting a "newdatavec" branch soon and playing with some of
these ideas...

On Tue, Sep 25, 2012 at 9:07 PM, Tom Short [email protected] wrote:

It sounds like its worth implementing or trying out some test code to see
how it feels. On the conversion when overflowing a 16-bit pool, I think we
still need provisions for a larger pool. This is especially important for
strings; it doesn't take many repeats to justify having a pool.


Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/6#issuecomment-8875702.

@johnmyleswhite
Copy link
Contributor

I've been thinking about this lately as model matrices are close to being the only major hole for me left in DataFrames. I'm starting to think that R's factor type is an error: it conflates the storage properties of our PooledDataVec with the modeling properties of a categorical variable.

Put another way: there's no reason why the categoricalness of a variable needs to depend upon the way in which it's stored. If I want to store a categorical variable as a Float64, that shouldn't be a problem.

This line of argument leads to thinking of Factor as a property of Formula and not a new data type. The only trouble with that is the absence of pre-specified levels for that factor.

But that's actually a serious problem for DataStream's as well, because you don't want to assume that you will read the entire data set just to learn about the levels of a factor. You want to have a natural way of specifying the levels manually.

@HarlanH
Copy link
Contributor Author

HarlanH commented Dec 13, 2012

Yes. I agree that conflating storage and types of data is a problem. Although R has its global string pool that minimizes some of the issues, at least for strings.

Do you have any thoughts about the Nominal/Ordinal/Interval/Ratio property idea? Would that address your concerns?

Is Factor a property of Formula, or an operation you can apply to a DataVec (of whatever type) to form contrasts?

I don't see why DataStreams can't use a Nominal type and just grow the set of levels as they're seen.

@johnmyleswhite
Copy link
Contributor

I like the idea of distinguishing all of the classical levels of measurement. I think that Factor might need to be split into Ordinal, etc. if we do that.

I would think that Factor could be both a keyword for the Formula DSL and an operation you can do inside of Julia to produce dummy variables like Panda's get_dummies() function.

The trouble with DataStream's is that growing the set of levels could be a nightmare for things like fitting a logistic online using SGD. Suddenly you need to insert a new value/column/matrix section into all of your parameter estimates. It's doable, but a hassle. It gets much worse when you have things like online estimation of a Hessian that's derived from the parameters, which are derived from the dummy columns. In that case a new dummy column has to send signals to all of the other data structures that they need to be enlarged.

@HarlanH
Copy link
Contributor Author

HarlanH commented Dec 13, 2012

Yes, I like having both implicit and explicit control over dummy variables.

That DataStream problem seems like an inherent problem that we're not going
to able to fix with better data structures. It needs an algorithmic
solution...

On Thu, Dec 13, 2012 at 5:56 PM, John Myles White
[email protected]:

I like the idea of distinguishing all of the classical levels of
measurement. I think that Factor might need to be split into Ordinal,
etc. if we do that.

I would think that Factor could be both a keyword for the Formula DSL and
an operation you can do inside of Julia to produce dummy variables like
Panda's get_dummies() function.

The trouble with DataStream's is that growing the set of levels could be a
nightmare for things like fitting a logistic online using SGD. Suddenly you
need to insert a new value/column/matrix section into all of your parameter
estimates. It's doable, but a hassle. It gets much worse when you have
things like online estimation of a Hessian that's derived from the
parameters, which are derived from the dummy columns. In that case a new
dummy column has to send signals to all of the other data structures that
they need to be enlarged.


Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-11357516.

@johnmyleswhite
Copy link
Contributor

I agree: we need an algorithmic solution. My sense is that you need to specify in advance all of the levels for a DataStream's factors, possibly using a PooledDataVec that has unseen levels pre-allocated. My thinking on this is still pretty hazy, but I'm probably only a week or two away from releasing general purpose SGD code for simple linear models fit to arbitrary DataStream's as long as there are no categorical variables involved.

nalimilan pushed a commit that referenced this issue Jul 8, 2017
Deprecate complete_cases() if favor of completecases(), and complete_cases!() in favor of dropnull!(). Add a dropnull() variant.

Also change completecases() to return a BitArray instead of an Array{Bool}.
nalimilan pushed a commit that referenced this issue Jul 8, 2017
Deprecate complete_cases() if favor of completecases(), and complete_cases!() in favor of dropnull!(). Add a dropnull() variant.

Also change completecases() to return a BitArray instead of an Array{Bool}.
nalimilan pushed a commit that referenced this issue Jul 8, 2017
Deprecate complete_cases() if favor of completecases(), and complete_cases!() in favor of dropnull!(). Add a dropnull() variant.

Also change completecases() to return a BitArray instead of an Array{Bool}.
nalimilan pushed a commit that referenced this issue Jul 8, 2017
Deprecate complete_cases() if favor of completecases(), and complete_cases!() in favor of dropnull!(). Add a dropnull() variant.

Also change completecases() to return a BitArray instead of an Array{Bool}.
nalimilan pushed a commit that referenced this issue Jul 8, 2017
Deprecate complete_cases() if favor of completecases(), and complete_cases!() in favor of dropnull!(). Add a dropnull() variant.

Also change completecases() to return a BitArray instead of an Array{Bool}.
rofinn pushed a commit that referenced this issue Aug 17, 2017
Deprecate complete_cases() if favor of completecases(), and complete_cases!() in favor of dropnull!(). Add a dropnull() variant.

Also change completecases() to return a BitArray instead of an Array{Bool}.
nalimilan pushed a commit that referenced this issue Aug 25, 2017
Deprecate complete_cases() if favor of completecases(), and complete_cases!() in favor of dropnull!(). Add a dropnull() variant.

Also change completecases() to return a BitArray instead of an Array{Bool}.
quinnj pushed a commit that referenced this issue Sep 2, 2017
Deprecate complete_cases() if favor of completecases(), and complete_cases!() in favor of dropnull!(). Add a dropnull() variant.

Also change completecases() to return a BitArray instead of an Array{Bool}.
quinnj pushed a commit that referenced this issue Sep 2, 2017
Deprecate complete_cases() if favor of completecases(), and complete_cases!() in favor of dropnull!(). Add a dropnull() variant.

Also change completecases() to return a BitArray instead of an Array{Bool}.
@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

@nalimilan, is this covered by your work in CategoricalArrays.jl now?

@nalimilan
Copy link
Member

Yes, it's so old that I'm not even sure what this issue was about.

nalimilan pushed a commit that referenced this issue May 26, 2022
Replace read_rda() by FileIO integration
nalimilan pushed a commit that referenced this issue May 26, 2022
Replace read_rda() by FileIO integration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants