Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop using bizarre types #451

Closed
lewisl opened this issue Jun 8, 2019 · 6 comments
Closed

stop using bizarre types #451

lewisl opened this issue Jun 8, 2019 · 6 comments

Comments

@lewisl
Copy link

lewisl commented Jun 8, 2019

When CSV reads a csv file and creates a DataFrame, it uses bizarro types for the columns. These types are unsupported by most packages in Base.

There is no reason for this:

CSV.Column{Union{Missing, String},Union{Missing, PooledString}}

Just use the standard column types used by the DataFrames package. All you do is make work for others. You also create work for the maintainers of Base, assuming they are willing to create methods to accommodate your new type. Generally, it should become unJulian to create new types when an existing type is essentially the same. The bar should be VERY HIGH on creating new types.

Please eliminate these types. Functions that don't work with them include convert, lowercase, and occursin. CORRECTION: The problem was there were actually missing values present, which broke the conversions or input. My comment still stands. Types and methods are great; overuse is not.

@lewisl
Copy link
Author

lewisl commented Jun 8, 2019

To explain context, I could not use skip missing because I needed to generate a mask based on occursin that would have the same index positions as numeric arrays (one-hot) generated from the original DataFrame.

Thus,


function find_in_col(searchstr, col)
    res = []
    for it in col
        if ismissing(it)
            push!(res, false)
        else 
            push!(res, occursin(searchstr, lowercase(it)))
        end
    end
    return res
end

Now, that I have this I could make an array comprehension, but it's quick enough and this is an infrequent task of data preparation, not something that occurs in a loop.

@JeffBezanson
Copy link

Base has a lot of support for the Missing type. Yes, there are sometimes reasons to prefer e.g. DataValue, but I'm not aware of any missing data type in Julia that automatically supports every function, if that is even possible.

@quinnj
Copy link
Member

quinnj commented Jun 9, 2019

These types are unsupported by most packages in Base.

On the contrary, CSV.Column implements the AbstractArray interface, which makes it work with a huge number of packages, including Base and stdlibs.

There is no reason for this

Again, on the contrary, using a custom, read-only CSV.Column type allows for an important, 3-fold mix of useful functionality: 1) we don't need to materialize a full Vector, but Column is a view into the underlying file data itself w/ position/type information, this is at least 2-3x memory savings; 2) it's vastly more efficient for the common case when a user doesn't need every column in a file anyway, this can lead to 15-30% performance improvements when parsing files; and 3) by implementing the AbstractArray interface, it can be used with pretty much any data ecosystem package for full functionality.

Just use the standard column types used by the DataFrames package

I'm curious where/why you believe there are "standard" column types that DataFrames uses? DataFrames has tried very hard over the years to abstract away all columnar operations to work on any AbstractArray, including PooledArrays, CategoricalArrays, StringArray, NamedArrays, and the list goes on. Or are you referring to the difference between Int64 and Union{Int64, Missing}? Using a Union{T, Missing}, is definitely the standard, blessed way to represent missing data in Julia (see the official manual chapter here).

Generally, it should become unJulian to create new types when an existing type is essentially the same. The bar should be VERY HIGH on creating new types.

I very much disagree with this statement. IMO, one of the most powerful features of Julia is allowing customer, user-defined types that participate in full compiler optimization and special treatment as those provided by the language itself. This allows powerful things like all of the custom array types, custom number types, and endless optimization, machine learning, and differential equation packages to all work seamlessly together. Standard interfaces are indeed critically important to implement and adhere to, but custom types open up a world of data structure optimization w/o inconveniencing users. If you've somehow gotten off on the wrong foot in Julia by thinking it's bad to define custom types, I'd urge you to re-read the official Julia docs manula and chat on the public slack or discourse forums; it's a pretty widely held belief that custom types are a bread and butter type of feature of the language.

It sounds like you've resolved your issue then? Do you have any more comments or concerns? In the future, I'd recommend opening first-time issues w/ a bit more politeness; it's always fair to question why things were done a certain way and actually very helpful to point out inconveniences you run into; it's less productive to start off demanding a change one way or another without perhaps having the full context of why certain design decisions were made. This is open source, I work on this in the evenings after I put my kids to bed because I know the space well and enjoy the challenge of applying a fun, performant new language and its features to a pretty old software problem (reading csvs); I'm always happy to discuss the whys or hows of what's going on the in the package, but I'd prefer to do it with a polite tone.

@quinnj quinnj closed this as completed Jun 9, 2019
@lewisl
Copy link
Author

lewisl commented Jun 9, 2019 via email

@nalimilan
Copy link
Member

FWIW if you prefer to get a DataFrame with more classic types (at the expense of making a copy), you can use DataFrame(CSV.File(...)) or CSV.read(..., copycols=true).

@lewisl
Copy link
Author

lewisl commented Jun 9, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants