optionally make AbstractDataVec be like an R factor #6

HarlanH · 2012-07-15T15:03:07Z

There should be a way to enforce a fixed set of pool items in a DV, and to optionally flag the ordering as important. It may also be useful to have meta-data for constrast construction -- or maybe this isn't the appropriate place for it (cf. R).

doobwa · 2012-08-15T19:04:54Z

Allowing a predetermined set of pool items is implemented 2711fd8

I agree that ordering flags and contrast options are still needed.

HarlanH · 2012-09-23T14:31:30Z

Some valuable discussion in #58. This story should be expanded -- define methods for AbstractDataVec to allow categorical/factor-like behavior, with varying performance trade-off depending on Pooled or non-pooled implementations.

HarlanH · 2012-09-25T18:17:33Z

OK, here's my interface-level proposal. Implementation details would differ between DataVecs and PooledDataVecs.

Each ADV would have a field called datatype::DataType, probably implemented as:

bitstype 8 DataType
@enum DataType NOMINAL ORDINAL INTERVAL RATIO

By default, an ADV{T<:Number} would default to a RATIO type (except maybe Bool, which might default to NOMINAL). Any other type would default to NOMINAL. Non-numeric types can only be NOMINAL or ORDINAL. Numeric types can be set to be any type, which would give proper categorical behavior for, e.g. UIDs.

Each ADV would have an optional Domain, which can be a Set or Range. If present, elements would be checked for membership against the Domain, and an error thrown if an element is not in the Domain. A common use case would be an ASCIIString DV with NOMINAL type and a Set of possible values, which would be equivalent to an R factor.

Each ADV of ORDINAL type may have an Ordering specified, which is a function that provides an ordering of the elements, ala isless(a,b). By default, isless is used, which gives alphanumeric ordering for strings, numeric ordering for numbers, chronological ordering for dates (if we had a date type), etc. A common use case would be an ASCIIString DV with ORDINAL type, a Domain with Ordering "Completely Agree", "Agree", "Neither Agree nor Disagree", etc.

Methods might look like:

# maximally verbose way -- there would be shortcuts
x = DataVec(["Low", "Medium", "High"])
setType(x, ORDINAL)
setDomain(x, ["High", "Medium", "Low"])
orderingDict = {"High" => 3, "Medium" => 2, "Low" => 1}
setOrdering(x, (a,b) -> isless(orderingDict[a], orderingDict[b]))

# or probably something like this could be made to do the same:
x = DataVec(["Low", "Medium", "High"], @options datatype=ORDINAL)

# use the obvious ways
if getType(x) == ORDINAL
  ...

push(x, "Medium") #OK
push(x, "Tiny") #error!

Statistical routines would read this meta-data and act appropriately when building model matrices and similar.

Notes:

x[1] < x[3] will give true, because this evaluates as "High" < "Low". Not sure if there's any way around this. But x .< "Medium" should give the expected answer.

Because implementation details would differ for a DataVec vs PooledDataVec, probably want to use getters and setters instead of fields.

The Domain for PDVs would presumably double as the pool.

Thoughts, @doobwa , @johnmyleswhite , @tshort ?

doobwa · 2012-09-25T18:50:39Z

Thanks for digging into this.

One concern is that some methods will behave differently depending on the type. Would it be cleaner to have NominalDataVec, OrdinalDataVec, etc? Seems like it might get a bit crowded if we go that direction. (One might view this as an implementation detail, but since we're talking about having a DataType field I figured it was fair game.) For example, how would mean(dv) work?

tshort · 2012-09-25T18:53:28Z

I'm not very proficient here, but it looks well thought out. As far as function names, I'd prefer settype, setdomain, and so on. Or, use underscores (set_type, set_domain). camelCase doesn't seem to be used much in Julia code.

Concatenation or other combining may get tricky for some combinations.

HarlanH · 2012-09-25T20:50:33Z

Chris, that's an interesting idea. It might well be more Julian to use the
type system and multiple dispatch here. I wonder then how pooled data might
work? Without multiple inheritance, we'd either need NominalDataVec and
NominalPooledDataVec <: AbstractNominalDataVec, or NominalPooledDataVec and
OrdinalPooledDataVec <: AbstractPooledDataVec. Or, we could just have
Nominal and Ordinal be Pooled by implementation, and Interval and Ratio be
non-pooled? That would be closer to what we have now.

Tom, yes, combinations are an interesting point that I hadn't thought about
yet.

More to ponder...!

On Tue, Sep 25, 2012 at 2:50 PM, Chris DuBois [email protected]:

Thanks for digging into this.

One concern is that some methods will behave differently depending on the
type. Would it be cleaner to have NominalDataVec, OrdinalDataVec, etc?
Seems like it might get a bit crowded if we go that direction. (One might
view this as an implementation detail, but since we're talking about having
a DataType field I figured it was fair game.) For example, how would
mean(dv) work?

—
Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/6#issuecomment-8865631.

tshort · 2012-09-25T23:44:18Z

I had to do some googling just to figure out what each of these meant. I'm inclined to think that your "by implementation" idea is the best:

"Or, we could just have Nominal and Ordinal be Pooled by implementation, and Interval and Ratio be non-pooled?"

It might make sense to have Nominal and Ordinal share an abstract type because they will share some functions.

Is it really worth it to separate out Ratio and Interval types? I haven't run across that in R before.

HarlanH · 2012-09-26T00:22:39Z

It may or may not be useful to have Interval. It's not supported by R --
you're right. See this Wikipedia page:
http://en.wikipedia.org/wiki/Level_of_measurement I don't know whether
making this distinction would be useful, or annoying in practice.

The only question about the R-like solution, with Nominal and Ordinal being
Pooled, is what to do with things like the "Categorical User ID" case. We
certainly can allow that to be part of a NominalDataVec{UInt64, UInt64}
(both the pool and the data are UInt64s), but it's going to be quite
space-inefficient, with a fairly major performance hit relative to a
non-pooled implementation. The question is whether that's enough motivation
to have both the NominalDataVec and NominalPooledDataVec cases, and making
everything that much more complex.

On Tue, Sep 25, 2012 at 7:44 PM, Tom Short [email protected] wrote:

I had to do some googling just to figure out what each of these meant. I'm
inclined to think that your "by implementation" idea is the best:

"Or, we could just have Nominal and Ordinal be Pooled by implementation,
and Interval and Ratio be non-pooled?"

It might make sense to have Nominal and Ordinal share an abstract type
because they will share some functions.

Is it really worth it to separate out Ratio and Interval types? I haven't
run across that in R before.

—
Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/6#issuecomment-8874281.

tshort · 2012-09-26T00:37:09Z

For the "Categorical User ID" case, my first thought is to just use NominalDataVecs. If there's demand, a non-pooled type could be added as another type that shares an abstract type with Nominal and Ordinal.

On naming, what do you think of Factor as the name for NominalDataVec or as the abstract type that covers Nominal and Ordinal? That might help R users.

doobwa · 2012-09-26T00:43:30Z

Or how about CategoricalVec? Didn't pandas end up using Categorical as a name for this?

I agree that Factor would be good for R converts like me, but I never really liked that name in the first place.

HarlanH · 2012-09-26T00:55:11Z

I'm OK with us eventually ending up with R's solution (although I agree
with Chris -- Categorical or Nominal is better than Factor), I just want us
to make sure it's the most reasonable option...

Here's another random thought. What if we make a distinction between is-a
and has-a relationships in the type hierarchy. That is, what if the is-a
relationships (and user-visible types) are Nominal/Categorical, Ordinal,
Interval (maybe) and Ratio. But objects of these types have (rather than
are) VectorDataVec (what we now call DataVec) or PooledDataVec objects in
a (Union?) slot. Then, as long as VDV and PDV objects have a consistent
interface (via an abstract type above them), the N/O/I/R objects can use
either one. Then, we'd set things up so that Interval and Ratio always use
VDVs, while Nominal and Ordinal start with PDVs but can convert to VDVs if
they overflow a 16-bit pool. The Pooled/Vector distinction would be
entirely invisible to the user.

In the long run, this might make additional sense when we start thinking
about optimizations for memory-mapped and indexed data, where instead of a
single underlying vector in the VDV case, you probably want a blocked data
structure instead.

(I started writing this proposal as an unlikely brainstorm, but now I sorta
like it...!)

On Tue, Sep 25, 2012 at 8:37 PM, Tom Short [email protected] wrote:

For the "Categorical User ID" case, my first thought is to just use
NominalDataVecs. If there's demand, a non-pooled type could be added as
another type that shares an abstract type with Nominal and Ordinal.

On naming, what do you think of Factor as the name for NominalDataVec or
as the abstract type that covers Nominal and Ordinal? That might help R
users.

—
Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/6#issuecomment-8875221.

ghost · 2012-09-26T01:07:20Z

It sounds like its worth implementing or trying out some test code to see how it feels. On the conversion when overflowing a 16-bit pool, I think we still need provisions for a larger pool. This is especially important for strings; it doesn't take many repeats to justify having a pool.

HarlanH · 2012-09-26T11:01:07Z

OK, I'll plan on starting a "newdatavec" branch soon and playing with some of
these ideas...

On Tue, Sep 25, 2012 at 9:07 PM, Tom Short [email protected] wrote:

It sounds like its worth implementing or trying out some test code to see
how it feels. On the conversion when overflowing a 16-bit pool, I think we
still need provisions for a larger pool. This is especially important for
strings; it doesn't take many repeats to justify having a pool.

—
Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/6#issuecomment-8875702.

johnmyleswhite · 2012-12-13T22:23:39Z

I've been thinking about this lately as model matrices are close to being the only major hole for me left in DataFrames. I'm starting to think that R's factor type is an error: it conflates the storage properties of our PooledDataVec with the modeling properties of a categorical variable.

Put another way: there's no reason why the categoricalness of a variable needs to depend upon the way in which it's stored. If I want to store a categorical variable as a Float64, that shouldn't be a problem.

This line of argument leads to thinking of Factor as a property of Formula and not a new data type. The only trouble with that is the absence of pre-specified levels for that factor.

But that's actually a serious problem for DataStream's as well, because you don't want to assume that you will read the entire data set just to learn about the levels of a factor. You want to have a natural way of specifying the levels manually.

HarlanH · 2012-12-13T22:45:15Z

Yes. I agree that conflating storage and types of data is a problem. Although R has its global string pool that minimizes some of the issues, at least for strings.

Do you have any thoughts about the Nominal/Ordinal/Interval/Ratio property idea? Would that address your concerns?

Is Factor a property of Formula, or an operation you can apply to a DataVec (of whatever type) to form contrasts?

I don't see why DataStreams can't use a Nominal type and just grow the set of levels as they're seen.

johnmyleswhite · 2012-12-13T22:56:51Z

I like the idea of distinguishing all of the classical levels of measurement. I think that Factor might need to be split into Ordinal, etc. if we do that.

I would think that Factor could be both a keyword for the Formula DSL and an operation you can do inside of Julia to produce dummy variables like Panda's get_dummies() function.

The trouble with DataStream's is that growing the set of levels could be a nightmare for things like fitting a logistic online using SGD. Suddenly you need to insert a new value/column/matrix section into all of your parameter estimates. It's doable, but a hassle. It gets much worse when you have things like online estimation of a Hessian that's derived from the parameters, which are derived from the dummy columns. In that case a new dummy column has to send signals to all of the other data structures that they need to be enlarged.

HarlanH · 2012-12-13T22:59:54Z

Yes, I like having both implicit and explicit control over dummy variables.

That DataStream problem seems like an inherent problem that we're not going
to able to fix with better data structures. It needs an algorithmic
solution...

On Thu, Dec 13, 2012 at 5:56 PM, John Myles White
[email protected]:

I like the idea of distinguishing all of the classical levels of
measurement. I think that Factor might need to be split into Ordinal,
etc. if we do that.

I would think that Factor could be both a keyword for the Formula DSL and
an operation you can do inside of Julia to produce dummy variables like
Panda's get_dummies() function.

The trouble with DataStream's is that growing the set of levels could be a
nightmare for things like fitting a logistic online using SGD. Suddenly you
need to insert a new value/column/matrix section into all of your parameter
estimates. It's doable, but a hassle. It gets much worse when you have
things like online estimation of a Hessian that's derived from the
parameters, which are derived from the dummy columns. In that case a new
dummy column has to send signals to all of the other data structures that
they need to be enlarged.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-11357516.

johnmyleswhite · 2012-12-13T23:04:31Z

I agree: we need an algorithmic solution. My sense is that you need to specify in advance all of the levels for a DataStream's factors, possibly using a PooledDataVec that has unseen levels pre-allocated. My thinking on this is still pretty hazy, but I'm probably only a week or two away from releasing general purpose SGD code for simple linear models fit to arbitrary DataStream's as long as there are no categorical variables involved.

Deprecate complete_cases() if favor of completecases(), and complete_cases!() in favor of dropnull!(). Add a dropnull() variant. Also change completecases() to return a BitArray instead of an Array{Bool}.

quinnj · 2017-09-07T03:05:38Z

@nalimilan, is this covered by your work in CategoricalArrays.jl now?

nalimilan · 2017-09-07T13:02:52Z

Yes, it's so old that I'm not even sure what this issue was about.

Replace read_rda() by FileIO integration

HarlanH mentioned this issue Jul 19, 2012

Metadata for columns and/or DataFrames #35

Closed

HarlanH mentioned this issue Sep 23, 2012

PooledDataVecs should have a user-specifiable type parameter allowing 1, 2, 4, or 8-byte levels #58

Closed

HarlanH mentioned this issue Jan 25, 2013

PooledDataArray's with more than 2^16 levels #172

Closed

This was referenced Feb 21, 2013

Add levels!/unique! for PooledDataArrays. #201

Closed

More PooledDataArray level-related functions #203

Closed

nalimilan mentioned this issue Nov 9, 2013

Rename "makefactors" argument to readtable() #399

Closed

cjprybol mentioned this issue Aug 18, 2017

WIP: DataTables.jl Backport #1214

Closed

4 tasks

nalimilan closed this as completed Sep 7, 2017

nalimilan pushed a commit that referenced this issue Jan 29, 2019

Fix #6 by supporting AbstractString inputs

7ca4166

nalimilan pushed a commit that referenced this issue May 26, 2022

Merge pull request #6 from JuliaStats/ast/fileio_integration

2ff1091

Replace read_rda() by FileIO integration

nalimilan pushed a commit that referenced this issue May 26, 2022

Merge pull request #6 from JuliaStats/ast/fileio_integration

0ba4dd4

Replace read_rda() by FileIO integration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optionally make AbstractDataVec be like an R factor #6

optionally make AbstractDataVec be like an R factor #6

HarlanH commented Jul 15, 2012

doobwa commented Aug 15, 2012

HarlanH commented Sep 23, 2012

HarlanH commented Sep 25, 2012

doobwa commented Sep 25, 2012

tshort commented Sep 25, 2012

HarlanH commented Sep 25, 2012

tshort commented Sep 25, 2012

HarlanH commented Sep 26, 2012

tshort commented Sep 26, 2012

doobwa commented Sep 26, 2012

HarlanH commented Sep 26, 2012

ghost commented Sep 26, 2012

HarlanH commented Sep 26, 2012

johnmyleswhite commented Dec 13, 2012

HarlanH commented Dec 13, 2012

johnmyleswhite commented Dec 13, 2012

HarlanH commented Dec 13, 2012

johnmyleswhite commented Dec 13, 2012

quinnj commented Sep 7, 2017

nalimilan commented Sep 7, 2017

optionally make AbstractDataVec be like an R factor #6

optionally make AbstractDataVec be like an R factor #6

Comments

HarlanH commented Jul 15, 2012

doobwa commented Aug 15, 2012

HarlanH commented Sep 23, 2012

HarlanH commented Sep 25, 2012

doobwa commented Sep 25, 2012

tshort commented Sep 25, 2012

HarlanH commented Sep 25, 2012

tshort commented Sep 25, 2012

HarlanH commented Sep 26, 2012

tshort commented Sep 26, 2012

doobwa commented Sep 26, 2012

HarlanH commented Sep 26, 2012

ghost commented Sep 26, 2012

HarlanH commented Sep 26, 2012

johnmyleswhite commented Dec 13, 2012

HarlanH commented Dec 13, 2012

johnmyleswhite commented Dec 13, 2012

HarlanH commented Dec 13, 2012

johnmyleswhite commented Dec 13, 2012

quinnj commented Sep 7, 2017

nalimilan commented Sep 7, 2017