Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PooledDataVecs should have a user-specifiable type parameter allowing 1, 2, 4, or 8-byte levels #58

Closed
doobwa opened this issue Sep 13, 2012 · 8 comments
Labels

Comments

@doobwa
Copy link
Contributor

doobwa commented Sep 13, 2012

The motivation: in some instances PooledDataVecs might have many levels (i.e. .more than 65000). The plan is to automatically convert the refs to Uint32 if we are about to overflow. One option is to make PooledDataVec a parametric type depending both on the type in the vector but also on the type of the reference, e.g. PooledDataVec{U, T}.

See #55 for a discussion.

@HarlanH
Copy link
Contributor

HarlanH commented Sep 16, 2012

Hm, my original intention was for PooledDataVec to be an implementation of AbstractDataVec, with different performance characteristics but (mostly or maybe entirely) identical interfaces to DataVec. Thus the UInt16 decision and the proposal for auto-conversion. Note that a PooledDataVec was NOT intended to be either necessary or sufficient for Factor-types, just the typical underlying representation. I'd prefer that we build appropriate meta-data and methods for either type that allows it to be used as a categorical/factor, with arbitrary element types, which would address Chris' UserID use case, I think. If you've got unique values in each row, the additional indirection doesn't help performance any. (cc @tshort , @StefanKarpinski for comments...)

@doobwa
Copy link
Contributor Author

doobwa commented Sep 17, 2012

Aside from the typical number of levels, what is your view on the difference between PooledDataVecs and Factors/CategoricalVec? Do you see one as a subset of the other? For me, the differences are currently mostly just semantics. On the other hand, we may want Factor to contain some additional metadata over what PooledDataVecs currently contain (e.g. perhaps a default contrast?). To help this conversation along, would you mind elaborating on some of your motivating cases you have in mind for PooledDataVecs?

@StefanKarpinski
Copy link
Member

For the record, I think that automatic conversion on overflow is a bad way to go. It adds checking overhead left and right and it will make all sorts of type inference things go completely haywire. It's much better, imo, to just let programmer pick how big a type they want, and change it easily. If and when you encounter the case that you need more levels, change the type in the user code.

I think Harlan's got the right idea about abstracting categorical data as an abstract type and having pooled vectors be a specific implementation. It's a perfectly valid use case to have unique categorical values and want to treat them categorically, but not pool their storage (since there's only one of each value). I'm not sure, but I'm guessing most cases with 65K+ levels would be in that case, although maybe not. Adding all those additional pointers (8 bytes each!) is not going to help when dealing with big data.

That was why it seemed likely to me that having pooled vectors use Uint16 would be sufficient. But you can always have pooled vector types with a type parameter (using only one byte when there are ≤ 256 levels is a nice win) and have separate unpooled categorical vectors so that you can pick the best implementation for each use case.

@doobwa
Copy link
Contributor Author

doobwa commented Sep 17, 2012

Now that I realize that we're planning a separate type for categorical data, I agree that automatic conversion is a bad direction. Maybe I should close this? Is there an existing issue for developing/discussing this new type?

@HarlanH
Copy link
Contributor

HarlanH commented Sep 23, 2012

OK, yes, let's call #6 the issue for better supporting categorical values. I'll make a note there about the issues that arose here.

@HarlanH
Copy link
Contributor

HarlanH commented Sep 23, 2012

And this issue should be to allow PDV's to have parametric sizes, but still defaulting to 2 bytes.

@johnmyleswhite
Copy link
Contributor

This shouldn't too hard if we make PDA's a parametric type whose parameter is the type of the refs. I already pulled all of the hardcoded references to Uint16 out of the code, which will hopefully make this change easier.

@garborg
Copy link
Contributor

garborg commented Jan 12, 2015

Closing here. At least partially resolved upstream.

@garborg garborg closed this as completed Jan 12, 2015
nalimilan pushed a commit that referenced this issue May 26, 2022
Support AltRep SXP and related fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants