stop using bizarre types #451

lewisl · 2019-06-08T16:47:33Z

When CSV reads a csv file and creates a DataFrame, it uses bizarro types for the columns. These types are unsupported by most packages in Base.

There is no reason for this:

CSV.Column{Union{Missing, String},Union{Missing, PooledString}}

Just use the standard column types used by the DataFrames package. All you do is make work for others. You also create work for the maintainers of Base, assuming they are willing to create methods to accommodate your new type. Generally, it should become unJulian to create new types when an existing type is essentially the same. The bar should be VERY HIGH on creating new types.

Please eliminate these types. Functions that don't work with them include convert, lowercase, and occursin. CORRECTION: The problem was there were actually missing values present, which broke the conversions or input. My comment still stands. Types and methods are great; overuse is not.

The text was updated successfully, but these errors were encountered:

lewisl · 2019-06-08T17:13:52Z

To explain context, I could not use skip missing because I needed to generate a mask based on occursin that would have the same index positions as numeric arrays (one-hot) generated from the original DataFrame.

Thus,


function find_in_col(searchstr, col)
    res = []
    for it in col
        if ismissing(it)
            push!(res, false)
        else 
            push!(res, occursin(searchstr, lowercase(it)))
        end
    end
    return res
end

Now, that I have this I could make an array comprehension, but it's quick enough and this is an infrequent task of data preparation, not something that occurs in a loop.

JeffBezanson · 2019-06-08T23:15:22Z

Base has a lot of support for the Missing type. Yes, there are sometimes reasons to prefer e.g. DataValue, but I'm not aware of any missing data type in Julia that automatically supports every function, if that is even possible.

quinnj · 2019-06-09T00:13:49Z

These types are unsupported by most packages in Base.

On the contrary, CSV.Column implements the AbstractArray interface, which makes it work with a huge number of packages, including Base and stdlibs.

There is no reason for this

Again, on the contrary, using a custom, read-only CSV.Column type allows for an important, 3-fold mix of useful functionality: 1) we don't need to materialize a full Vector, but Column is a view into the underlying file data itself w/ position/type information, this is at least 2-3x memory savings; 2) it's vastly more efficient for the common case when a user doesn't need every column in a file anyway, this can lead to 15-30% performance improvements when parsing files; and 3) by implementing the AbstractArray interface, it can be used with pretty much any data ecosystem package for full functionality.

Just use the standard column types used by the DataFrames package

I'm curious where/why you believe there are "standard" column types that DataFrames uses? DataFrames has tried very hard over the years to abstract away all columnar operations to work on any AbstractArray, including PooledArrays, CategoricalArrays, StringArray, NamedArrays, and the list goes on. Or are you referring to the difference between Int64 and Union{Int64, Missing}? Using a Union{T, Missing}, is definitely the standard, blessed way to represent missing data in Julia (see the official manual chapter here).

Generally, it should become unJulian to create new types when an existing type is essentially the same. The bar should be VERY HIGH on creating new types.

I very much disagree with this statement. IMO, one of the most powerful features of Julia is allowing customer, user-defined types that participate in full compiler optimization and special treatment as those provided by the language itself. This allows powerful things like all of the custom array types, custom number types, and endless optimization, machine learning, and differential equation packages to all work seamlessly together. Standard interfaces are indeed critically important to implement and adhere to, but custom types open up a world of data structure optimization w/o inconveniencing users. If you've somehow gotten off on the wrong foot in Julia by thinking it's bad to define custom types, I'd urge you to re-read the official Julia docs manula and chat on the public slack or discourse forums; it's a pretty widely held belief that custom types are a bread and butter type of feature of the language.

It sounds like you've resolved your issue then? Do you have any more comments or concerns? In the future, I'd recommend opening first-time issues w/ a bit more politeness; it's always fair to question why things were done a certain way and actually very helpful to point out inconveniences you run into; it's less productive to start off demanding a change one way or another without perhaps having the full context of why certain design decisions were made. This is open source, I work on this in the evenings after I put my kids to bed because I know the space well and enjoy the challenge of applying a fun, performant new language and its features to a pretty old software problem (reading csvs); I'm always happy to discuss the whys or hows of what's going on the in the package, but I'd prefer to do it with a polite tone.

lewisl · 2019-06-09T01:21:36Z

All fair. Actual missing values were the problem. I respect you guys and need to tone it down. There is a trend to more types. The burden of accommodating new types is on the consuming function/module/package. So, one package’s new type creates a burden diffused across many other functions. Implementing new methods or verifying that a more inclusive abstract type can be used can lag. It would help if Julia had typecasting so that a sending (returning) function could choose safe types to offer as alternative returns. Performance implications, but provides a stopgap until key packages have new methods or updated function signatures. I find myself doing more manual conversions. Examples are substrings (Most Base string functions accept but not some packages). Subarrays, reshaped arrays, deferred transpose arrays all cause some problems. I expect some challenges with SparseArrays, but it is amazing how many array functions just work, albeit more slowly (unavoidable). I understand deferring the in-memory rearranging provides perf and reduced memory benefits. It seems like providing these benefits below the break even threshold for the benefits and preferring high end cases that some research communities really benefit from at the risk of increased complexity for many (some?) other users. These deferred (or “lazy”) types offer real benefits but usage is not always transparent and the benefit might be realized primarily at very large scale. I have one module that relies on preallocating arrays and updating in place for performance. These have to be preallocated in a concrete type so I have to go through some hoops with Union types or by selectively preallocating different types of arrays. It’s usually a surprise when I discover a new type to handle. Have you ever looked at the type for views on unions of different types of multid arrays, some of which can used fixed strides and some of which cannot? It all can work but it gets complicated. As a simple example, there is probably no way to make returning strings instead of type Substring the default (for example...) while offering an an option to return the substring type for someone doing a Shakespeare concordance. The more sophisticated developer could use the named parameter that changes the return type. A different trade off. There seems a tendency to prefer advanced usage and sophisticated use cases that can hurt approachability of the language. The features are awesome but the trade offs should be at least thought about. Open source economics are challenging. There are no pay checks. Until and unless corporate consumers of the software decide it’s worth paying some people: at a foundation or their own employees who “get” to devote a large % of their time to key infrastructure OSS. Julia has received some nice grants but is still young in its life cycle to get the support to pay more “volunteers”. - Lewis On Jun 8, 2019, at 5:13 PM, Jacob Quinn <[email protected]<mailto:[email protected]>> wrote: These types are unsupported by most packages in Base. On the contrary, CSV.Column implements the AbstractArray<https://docs.julialang.org/en/latest/manual/interfaces/#man-interface-array-1> interface, which makes it work with a huge number of packages, including Base and stdlibs. There is no reason for this Again, on the contrary, using a custom, read-only CSV.Column type allows for an important, 3-fold mix of useful functionality: 1) we don't need to materialize a full Vector, but Column is a view into the underlying file data itself w/ position/type information, this is at least 2-3x memory savings; 2) it's vastly more efficient for the common case when a user doesn't need every column in a file anyway, this can lead to 15-30% performance improvements when parsing files; and 3) by implementing the AbstractArray interface, it can be used with pretty much any data ecosystem package for full functionality. Just use the standard column types used by the DataFrames package I'm curious where/why you believe there are "standard" column types that DataFrames uses? DataFrames has tried very hard over the years to abstract away all columnar operations to work on any AbstractArray, including PooledArrays, CategoricalArrays, StringArray, NamedArrays, and the list goes on. Or are you referring to the difference between Int64 and Union{Int64, Missing}? Using a Union{T, Missing}, is definitely the standard, blessed way to represent missing data in Julia (see the official manual chapter here<https://docs.julialang.org/en/latest/manual/missing/>). Generally, it should become unJulian to create new types when an existing type is essentially the same. The bar should be VERY HIGH on creating new types. I very much disagree with this statement. IMO, one of the most powerful features of Julia is allowing customer, user-defined types that participate in full compiler optimization and special treatment as those provided by the language itself. This allows powerful things like all of the custom array<https://github.com/JuliaArrays> types, custom number<https://github.com/JuliaMath> types, and endless optimization, machine learning, and differential equation packages to all work seamlessly together. Standard interfaces are indeed critically important to implement and adhere to, but custom types open up a world of data structure optimization w/o inconveniencing users. If you've somehow gotten off on the wrong foot in Julia by thinking it's bad to define custom types, I'd urge you to re-read the official Julia docs manula and chat on the public slack or discourse forums; it's a pretty widely held belief that custom types are a bread and butter type of feature of the language. It sounds like you've resolved your issue then? Do you have any more comments or concerns? In the future, I'd recommend opening first-time issues w/ a bit more politeness; it's always fair to question why things were done a certain way and actually very helpful to point out inconveniences you run into; it's less productive to start off demanding a change one way or another without perhaps having the full context of why certain design decisions were made. This is open source, I work on this in the evenings after I put my kids to bed because I know the space well and enjoy the challenge of applying a fun, performant new language and its features to a pretty old software problem (reading csvs); I'm always happy to discuss the whys or hows of what's going on the in the package, but I'd prefer to do it with a polite tone. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#451?email_source=notifications&email_token=AAIYWLKVTBU4GUMSGUYZ4UTPZRDL5A5CNFSM4HWHU3FKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXIA6QY#issuecomment-500174659>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAIYWLIDP7EIESPWFIKY6XDPZRDL5ANCNFSM4HWHU3FA>.

nalimilan · 2019-06-09T10:54:35Z

FWIW if you prefer to get a DataFrame with more classic types (at the expense of making a copy), you can use DataFrame(CSV.File(...)) or CSV.read(..., copycols=true).

lewisl · 2019-06-09T16:08:33Z

Thank you. I’ll try both. I’d forgotten about the copycols parameter. You are right that these both solve it allowing everyone the choice of what works in a given situation. This should be the approach for any of the “lazy evaluation” methods (informal use of the word methods) that provide performance and memory conservation when needed for huge datasets and not for smaller things—a really simple choice. From: Milan Bouchet-Valat <[email protected]> Reply-To: "JuliaData/CSV.jl" <[email protected]> Date: Sunday, June 9, 2019 at 3:54 AM To: "JuliaData/CSV.jl" <[email protected]> Cc: Lewis Levin <[email protected]>, Author <[email protected]> Subject: Re: [JuliaData/CSV.jl] stop using bizarre types (#451) FWIW if you prefer to get a DataFrame with more classic types (at the expense of making a copy), you can use DataFrame(CSV.File(...)) or CSV.read(..., copycols=true). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#451?email_source=notifications&email_token=AAIYWLI46IKF2YGBLG3I36DPZTOOZA5CNFSM4HWHU3FKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXIH4RY#issuecomment-500203079>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAIYWLMVM522AHZB4DJB2DTPZTOOZANCNFSM4HWHU3FA>.

quinnj closed this as completed Jun 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stop using bizarre types #451

stop using bizarre types #451

lewisl commented Jun 8, 2019 •

edited

Loading

lewisl commented Jun 8, 2019

JeffBezanson commented Jun 8, 2019

quinnj commented Jun 9, 2019

lewisl commented Jun 9, 2019 via email

nalimilan commented Jun 9, 2019

lewisl commented Jun 9, 2019 via email

stop using bizarre types #451

stop using bizarre types #451

Comments

lewisl commented Jun 8, 2019 • edited Loading

lewisl commented Jun 8, 2019

JeffBezanson commented Jun 8, 2019

quinnj commented Jun 9, 2019

lewisl commented Jun 9, 2019 via email

nalimilan commented Jun 9, 2019

lewisl commented Jun 9, 2019 via email

lewisl commented Jun 8, 2019 •

edited

Loading