Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC/Julep: Introduce getproperty on Array for built-in data tables. #30646

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

andyferris
Copy link
Member

@andyferris andyferris commented Jan 8, 2019

This is a bit of a fun one, while the v1.2 development cycle is young. I'm not sure if this approach will receive universal support, but discussion of this may be helpful, so here goes.

Julia is an awesome language for maths with arrays (linear algebra) and for a while now I've been wondering if we can make it equally ergonomic and awesome for other typical data operations with array-like data structures, such as tables. For example, relations are collections of (named) tuples and Array{<:NamedTuple} seems like a relatively natural candidate to behave as a relation. In fact, the community (e.g. Tables.jl) generally seems to be trending towards "table rows are things that we can do getproperty on to get cell values" and "table columns iterate cell values", and from tables themselves we might want to extract rows or columns via iteration or getproperty, respectively.

This PR uses a simple trick to make our built-in arrays behave more friendly as data tables. It automatically broadcasts getproperty over array Array (greedily via map, in this basic implementation, though we theoretically can do a lazy approach and support setproperty! and so-on). Here's a simple example:

julia> table = [(a=1, b=true), (a=2, b=false), (a=3, b=true)]
3-element Array{NamedTuple{(:a, :b),Tuple{Int64,Bool}},1}:
 (a = 1, b = true) 
 (a = 2, b = false)
 (a = 3, b = true) 

julia> table.a
3-element Array{Int64,1}:
 1
 2
 3

julia> table.b
3-element Array{Bool,1}:
  true
 false
  true

And finally, it's notable that this is useful for general data manipulation and broader contexts. It works on structs and so-on, not just NamedTuple elements. For example, for doing complex linear algebra this is relatively nice:

julia> a = rand(Complex{Float64}, 2, 2)
2×2 Array{Complex{Float64},2}:
 0.876236+0.403676im  0.349794+0.233915im
 0.476042+0.530344im  0.913785+0.263034im

julia> a.re
2×2 Array{Float64,2}:
 0.876236  0.349794
 0.476042  0.913785

julia> a.im
2×2 Array{Float64,2}:
 0.403676  0.233915
 0.530344  0.263034

Looking forward speculatively, I wonder if we could make this a part of the AbstractArray spec for Julia 2.0 (obviously a breaking change) where library writers would use getfield and helper functions to manipulate custom array internals, and mostly external users don't directly access the fields/properties of arrays anyway.

A simple trick to make our built-in arrays behave as data tables.
Automatically broadcasts `getproperty` over array `Array` (greedily via
`map`).
@andyferris andyferris added speculative Whether the change will be implemented is speculative julep Julia Enhancement Proposal needs tests Unit tests are required for this change needs docs Documentation for this change is required labels Jan 8, 2019
@c42f
Copy link
Member

c42f commented Jan 8, 2019

This seems tantalizing and thought provoking and obviously incomplete :-) Some quick thoughts.

We should have a good native interface for relational algebra, I agree 100% with that. If that ends up involving getproperty with AbstractArray, that's a good reason to do something like this to "reserve expectations", even though we couldn't follow it all the way through until 2.0.

It seems to me that we'd want a view-based version of this for memory efficiency. Also, ideally, have setindex work, though the difficulties of #11902 seem the same as what we'd have here?

Agreed that getproperty seems to be great interface for "data things" and the ecosystem is converging there. But does it encourage people to do the wrong thing for "non data" uses? (Is .re and .im really a public interface for Complex? I don't think so.)

@vchuravy
Copy link
Member

vchuravy commented Jan 8, 2019

But does it encourage people to do the wrong thing for "non data" uses? (Is .re and .im really a public interface for Complex? I don't think so.)

For me that is somewhat the crux here. It encourages people to know and care about the internals of data-types, which up till know for composability we have encouraged not to. I am somewhat fine with that with the category of types that are records/rows, but already the Complex example it makes me hesitant about this.

The slightly crazy proposal would be to make A.real work, which would return a FieldView(A, f), where f is the accessor function, but now we are chiefly in the the realm of using lazy array operations and I don't think we want to go down the road of A.real.abs

@piever
Copy link
Contributor

piever commented Jan 8, 2019

I wanted to point out that all of this is already implemented in StructArrays:

julia> using StructArrays

julia> table = StructArray([(a=1, b=true), (a=2, b=false), (a=3, b=true)])
3-element StructArray{NamedTuple{(:a, :b),Tuple{Int64,Bool}},1,NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Bool,1}}}}:
 (a = 1, b = true) 
 (a = 2, b = false)
 (a = 3, b = true) 

julia> table.a
3-element Array{Int64,1}:
 1
 2
 3

julia> table.b
3-element Array{Bool,1}:
  true
 false
  true

julia> a = StructArray{Complex{Float64}}(undef, 2, 2);

julia> rand!(a)
2×2 StructArray{Complex{Float64},2,NamedTuple{(:re, :im),Tuple{Array{Float64,2},Array{Float64,2}}}}:
 0.212628+0.705269im   0.396557+0.974033im
 0.452444+0.0929322im   0.42598+0.739531im

julia> a.re
2×2 Array{Float64,2}:
 0.212628  0.396557
 0.452444  0.42598 

with the extra advantage that storage is column-wise, so that table.a is not allocating.

The reason why using StructArray is not fully satisfying is that standard operations in Base tend to give you an Array, broadcast being the chief example, for instance things like broadcast((a,b) -> (a=a, b=b), 1:10, 1:10) return an Array even though storing the result as a StructArray may be more efficient, especially as the number of fields in the named tuple increases.

So a counter-proposal would be, rather than using Array of NamedTuple as a table, to "by default" materialize iterators of NamedTuple to a StructArray. Given that broadcast does not always return a Array (it for example returns a BitArray when the return type is Bool), what I would find ideal is to define a concept of defaultarray(::Type) which is the default array for a given type. It could be BitArray for Bool, DataValueArray for DataValue, StructArray for NamedTuple, StringArray for AbstractString, CategoricalArray for categorical values etc, analogously to arrayof in IndexedTables. Possible ways forward would be:

  1. The extreme option: move all packages that define the defaultarray for types in Base to Base and makes this the default materializer for broadcast (not sure this is a good idea)

  2. A milder option: create a low-dependency package that defines defaultarray as well as a macro @defaultarray that acts analogously to @views and changes all calls to broadcast (and potentially other things like list comprehension) to actually materialize the result in the array type recommended by defaultarray. Base only needs to support having custom materializer in these operations (broadcast and maybe map and list comprehension), but this is I imagine non-breaking.

I hope the "counter-proposal" does not derail the discussion, I wrote it here as I think it's relevant in that in my view it would make this PR less necessary.

@JeffBezanson
Copy link
Sponsor Member

Well said, @c42f , I agree.

I don't like this specific proposal, since it is basically a "vectorized" function of the sort we don't tend to have anymore. Now, I know the alternative map(_.x, A) (which we can't even write yet of course) is really verbose. I don't have a great solution. $ is available; maybe A$x? Unfortunately .. is not really available since it's a separate operator used for other things.

So a counter-proposal would be, rather than using Array of NamedTuple as a table, to "by default" materialize iterators of NamedTuple to a StructArray.

That would be totally fine. Just to state the obvious though, an array of NamedTuples should have the same interface. It would be nice to have something like StructArrays more integrated with Base functions, as you describe.

@davidanthoff
Copy link
Contributor

Query.jl also uses this kind of construct for groups: if one groups a table, one gets an array of Groupings, and each group is an AbstractArray of rows. And you can use the g.colA syntax to then access a column in this group, essentially in the same way as in the original proposal here. The implementation is here (and it is a view implementation, like @c42f suggested, it doesn't make a copy).

I think that it makes for a very neat interface in this (and other) table context. But, I'm not convinced that this should be made available in base for all arrays. The examples with e.g. complex numbers seem really cases that we don't want to think of as a table. I think I'm generally not convinced that we should generically treat Vector{Foo} as a table just because Foo has some properties. There seem to me way too many examples where that would be an entirely incorrect interpretation.

@andyferris
Copy link
Member Author

since it is basically a "vectorized" function of the sort we don't tend to have anymore

I thought of this as quite different to the vectorized functions that we got rid of. To me, this is more in the spirit of +.

We have overloaded + with AbstractArrays because we have decided arrays of numbers form a vector space. (Note: we don't ensure the eltype is a Number, we just map the + over all the elements and hope it works). This is a useful enough operation (Julia's linear algebra support) that we tolerate this even in the presence of stuff like Vector{String} and so-on, where the operation clearly doesn't have a correct interpretation.

Upon usage, it feels very natural to use the same function extract a column of table by its name as to extract a cell from a row by its name. I speculate that there is possibly some simple algebra of tables/columns/rows/cell values that you could write down in a formal sense. Just like + on vector spaces, it is simply a happy coincidence that the column extraction operation is more-or-less equivalent to mapping/broadcasting/"vectorizing" the same operation (getproperty) over the elements. (I suspect that in both linear algebra and relational algebra, the simplicity of the relationship is key to its utility). In this "algebra" you end up with the beautiful symmetry table[i].name === table.name[i].

Just to state the obvious though, an array of NamedTuples should have the same interface.

To me, this is key, to make it an interface different implementations can share.

And yes, we can introduce more efficient implemetations (along the lines of StructArray / TypedTables.Table), but since Array has no fields it seemed nice and convenient to experiment with the abstract interface here first...

It encourages people to know and care about the internals of data-types

Gosh, this really wasn't the intention... if you should not mess with the fields of one element, you definitely should not mess with the fields of all the elements!

Again, we don't restrict + at the method level for only arrays suitable for linear algebra. The example with Complex was just a cutesy consequence of the PR. (The fact that it "worked" is a bit like the way matrix multiplication of arrays of strings sometimes "works").

@c42f
Copy link
Member

c42f commented Jan 9, 2019

The symmetry of table[i].name === table.name[i] is striking and beautiful (I really appreciated this while reviewing the TypedTables documentation). A formal connection would be super interesting too though I guess the connection to the original relational algebra is tenuous because a relation is a set rather than something which is indexable (is there a standard mathematical extension which mirrors the conventions used in practical database implementations?).

My gut feeling so far about this proposal for getproperty is that it's too specific to data analysis to have in the AbstractArray interface, and it would be better to agree on an AbstractTable interface. (But maybe we could have AbstractTable <: AbstractArray like you do in TypedTables?) This means we wouldn't have AbstractArray{<:NamedTuple} <: AbstractTable. But maybe that's not a problem if we could customize the container that iterators are materialized into in broadcast and list comprehensions?

@andyferris
Copy link
Member Author

andyferris commented Jan 9, 2019

My gut feeling so far about this proposal for getproperty is that it's too specific to data analysis to have in the AbstractArray interface, and it would be better to agree on an AbstractTable interface.

I would like to address this - it is a very reasonable concern and gets to the crux of why I wanted to submit this PR and open a discussion.

A little while after starting Julia, I was wondering: why do we pile all this linear algebra stuff on top of AbstractArray? We have a flexible type system and free abstractions, I thought, so why not have a seperate part of the type tree (maybe a subtype of arrays, maybe not) that behave as linear algebra vectors and matrices? That is, why not simply leave arrays with a more bare-bones interface and have a Tensor wrapper or AbstractTensor interface (this is what I called it in my head) to enable the linear algebra extensions?

These days, I realize the joy and productivity of Julia has something to do with how rich the interface for arrays (and other basic types) are. You can use really simple syntax to define a container like an Array, and do heaps with it. Multiple dispatch means an external library, like LinearAlgebra, can pile on a variety of new "verbs" (functions) to do linear algebra with arrays (notably, the verbs can also work with other, non-array data, if you want). To tie it together, LinearAlgebra only really relies on a very slight amount of "type piracy" with + and * to make it feel totally built-in, rather than as an external library.

Now, to address being "specific to data analysis" - I actually feel this a very strategic target for us as a technical language, equally important as linear algebra. Comparing against python, for example, its my opinion that AbstractArray and LinearAlgebra completely dominates over numpy in terms of ease, flexibility and performance. This addresses a large fraction of the "technical computing" audience but still leaves data analysis. We have some great packages, but to succeed I'd love us to similarly completely dominate over pandas. Ideally, it needs to be compelling enough to convince users of pandas, R, etc to switch - with the kind of jump in productivity and performance that convinced MATLAB and numpy users to learn a brand-new language for doing scientific computing.

Thus, I'm imagining a future where it's ridiculously easy to create, access and manipulate a table. As easy as linear algebra is now. Really, for the strategic direciton I'd absolutely have to bow to Jeff, Stefan, Viral and so-on, but to me it seems that it could potentially be worthwhile to have a "linear algebra" level of intergration between arrays and data analysis. (Whether or not that is precisely what is suggested in this PR is a different story, of course, but I saw the getproperty overload here as being analogous to LinearAlgebra taking liberties with + and * - most other data analysis functionality can be provided orthogonally to Base, and in my observation the fields/properties of user-defined arrays are rarely used as an external interface).

@c42f
Copy link
Member

c42f commented Jan 9, 2019

Thus, I'm imagining a future where it's ridiculously easy to create, access and manipulate a table

Yes, exactly! What you've got here addresses (in prototype) the access case when you happen to have an array of named tuples. On the other hand, if we made it so that common constructions like comprehensions of named tuples could somehow return a Table <: AbstractArray{<:NamedTuple}, rather than a normal Array, this seems just as easy to use, and more targeted in the use of getproperty.

But for comprehensions that would mean breaking compatibility in returning something which is not an Array.

I'm not sure whether it's been done to death in data circles, but is it also worth mentioning the availability of the { ... } syntax, yet again, as a syntactic construct for creating and manipulating data?

@andyferris
Copy link
Member Author

Keep in mind that row-based storage of tables is still valid and entirely useful (and not to be treated as second-class), and should definitely follow the same interface as columnar storage, even if/when columnar storage makes it to Base.

IMO the idea of having a RowTable wrapper for Array{<:NamedTuple} to gift it a tabular interface sucks for exactly the same reason my earlier thoughts about a Tensor wrapper for Array{<:Number} sucked for giving it a linear algebra interface.

Like Jeff said, "Just to state the obvious though, an array of NamedTuples should have the same interface." If you look at Tables.jl however it seems that this interface currently is getproperty for column access (unless we introduce something like $ and get Tables.jl to use that instead of . - it's uglier but currently table$:name is fully available for use by packages. Unfortunately, maintaining the symmetry of (table$:name)[i] === table[i]$:name would mean field access of rows now gets uglier too!).

is it also worth mentioning the availability of the { ... } syntax, yet again, as a syntactic construct for creating and manipulating data?

I wonder if this would be suitable for the columnar-storage version of collect? That is, [(a=f(x), b=g(x)) for x in X] makes a row-based table and {(a=f(x), b=g(x)) for x in X} makes a column-based table? Another other option is it becomes the syntax sugar for column-based table construction, {a=[...], b=[...]}. Maybe it can be both simulatenously? (For those reading this, there has been a very long history of suggestions of usage of {...}, notably as a shortcut for Tuple{...} amongst many other things.)

@piever
Copy link
Contributor

piever commented Jan 10, 2019

Just to state the obvious though, an array of NamedTuples should have the same interface.

This is an interesting point, to find a common interface I think there are three options:

  1. This PR: the column interface is getproperty and we overload it for array of NamedTuple
  2. Your previous proposal of adding a new syntax for column access, for example df$a
  3. Keep two distinct concepts of table: a row container (like Array{NamedTuple}, StructArrays, IndexedTable, TypedTables.Table) or a column container (NamedTuple of arrays, I think DataFrame). The distinction in my mind is whether getindex returns a column or a row. We could then use a simple infix or postfix operator to go from one to the other. Here I'm using \prime just as a placeholder for this operator, I'm not sure what the optimal choice would be. So if v is an Array of NamedTuple, v′ should probably return a NamedTuple of lazy arrays,v′.a would return a lazy array corresponding to the a column, and one could also do v′[3:5] to select columns from 3 to 5. For an IndexedTable, this would return columns(t), a NamedTuple of arrays (we can do it eagerly here because it doesn't allocate).

UPDATE: rewrote the example with a postfix operator as I think it works better with getproperty syntactically

@davidanthoff
Copy link
Contributor

Just for the record: my current thinking about {} is that it should not be used for anything in base or the language, but left "free" so that macros can use it in their DSLs. I think there is something very powerful about not giving meaning to every syntax in base, and thereby leaving room for macros to come up with elegant syntax for their domains that doesn't conflict with something in base.

@stevengj
Copy link
Member

stevengj commented Jan 10, 2019

Another possible syntax would be something like table[.a](which currently fails to parse), and have it lower to Base.refproperty(table, :a) or similar.

I would tend to prefer something where the syntax indicates the type of operation, as opposed to table.a where you don't know until compile-time what is happening. Having it syntactically apparent gives us the options to (a) fuse it with dot calls in a broadcast and (b) change it to use a view in @views.

@StefanKarpinski
Copy link
Sponsor Member

Ooo! I really like that syntax. table[.a] is quite nice.

@JeffBezanson
Copy link
Sponsor Member

We could potentially make .a mean x->map(_.a, x), giving the syntax .a(table). A bit fiddly perhaps but maybe could work.

@davidanthoff
Copy link
Contributor

One thing to keep in mind is auto complete. It would be nice if proposal could at some point lead to a good auto complete story in IDEs. I do think that means something like .a(table) would not be ideal, I think in general the table variable should probably come first.

@JeffBezanson
Copy link
Sponsor Member

Maybe a good syntax is just something like cols(table).a. You just need to say you're viewing it as a column container for purposes of accessing a property. That could be defined for normal Vectors.

@c42f
Copy link
Member

c42f commented Jan 11, 2019

On the other hand there's something incredibly appealing and natural about the symmetry of table[i].col == table.col[i] and this already works and makes sense for the various Table types which exist in the ecosystem. If we're willing to accept cols(array).a, we may as well write it RowTable(array).a, for some RowTable wrapper type which presents an AbstractTable interface to row storage. I don't think this is about viewing a vector as columns, but rather creating a Table from an Array. The core issue of this PR being whether an Array should be a Table automatically, no wrappers required.

Anyway, if the . syntax won't fly, can we at least have a syntax which preserves the symmetry? @ appears to be kind of available, as table@col doesn't currently parse (though I don't exactly understand why) and looks pretty nice:

table[i]@col == table@col[i]

@c42f
Copy link
Member

c42f commented Jan 11, 2019

we may as well write it RowTable(array).a, for some RowTable

Or better, simply table(array).a?

@c42f
Copy link
Member

c42f commented Jan 11, 2019

Perhaps using getproperty for AbstractArray would be less objectionable if we had actual syntactic conventions for public vs private field access. Right now, there's just no way to know from syntax alone whether you're dipping into the internals or using the public API for a type. This is not actually a good situation for maintaining large codebases, but we've been largely ignoring it.

This line of thought leads to the following possible solution:

  • Create a very simple way for types to opt out of having the getfield fallback for getproperty on private fields. For example, an underscore prefix convention for private field names would allow the default getproperty to infer which ones should be left out (unfortunately breaking, though I've very rarely seen an underscore prefix used in practice).
  • For implementers, add a syntax for getfield for internal field access. value@field could be that syntax, if I'm not mistaken in it being available. It could even lower to getfield(value, :_field).

If we could solve the private vs public interface problem at the syntactic level, we could much more happily overload getproperty for AbstractArray. It would also ideal if we did this in a way which gives the most information to linters; having it be syntactic would surely help with that.

@stevengj
Copy link
Member

stevengj commented Jan 11, 2019

@c42f, the issue to me is not public vs. private, it is broadcasted vs. non-broadcasted. Even if x.foo syntactically implied getproperty on a "public" property foo, you still wouldn't be able to distinguish syntactically (i.e., without knowing types & dispatch) between table.column (field access broadcasted over some array-like collection) and something like factorization.U or pythonobject.member where there is no collection.

@andyferris
Copy link
Member Author

andyferris commented Jan 11, 2019

Exactly - I actually expected that people would think it's crazy that all AbstractArrays might be forced into using the same getproperty overload, making it hard to access the "real" fields of a SparseArray, or the strange interaction with factorizations that Steven just pointed out, or whatever.

This PR should not change the interface you use on the elements (or what consititutes good practice for interacting with the elements from the perspective of software maintainability).

@andyferris
Copy link
Member Author

Regarding .a.

table[.a] looks similar to table[:a], which already is relatively normal for dataframes. Just wondering what might be gained by the new syntax.

We could potentially make .a mean x->map(_.a, x), giving the syntax .a(table). A bit fiddly perhaps but maybe could work.

Haha this reminded me of a crazy experiment I did once. It's very interesting to me that making Symbol callable can be quite useful for tabular operations.

(s::Symbol)(x) = getproperty(x, s)

This gives things like

map(:a, table) # extract column `a` from `table`
:a.(table) # like above, with broadcast
count(:a, table) # count how many `true`s in column `a` of `table`
filter(:a, table) # filter rows of table where row.a == true

# from SplitApplyCombine
group(:a, :b, table) # group table.b with grouping keys from table.a
innerjoin(:a, :b, table1, table) # join tables on `table1.a == table2.b`

@yurivish
Copy link
Contributor

yurivish commented Jan 11, 2019

Callable symbols are an idiom in Clojure (where they're called keywords):

Keywords implement IFn for invoke() of one argument (a map) with an optional second argument (a default value).

(def population {:zombies 2700, :humans 9})
(:zombies population)
;=> 2700

@c42f
Copy link
Member

c42f commented Jan 11, 2019

This PR should not change the interface you use on the elements

Of course not. But looking at the reasons for not using getproperty expressed in the thread, one of the main early objections was the worry that it encouraged private field access. So thinking about that seems relevant.

@JeffBezanson
Copy link
Sponsor Member

JeffBezanson commented Jan 11, 2019

If we're willing to accept cols(array).a, we may as well write it RowTable(array).a, for some RowTable wrapper type

Not entirely --- the data ecosystem already uses columns and rows functions in this way, and you might want to define columns to return various different types. In some cases it might just return its argument.

@stevengj
Copy link
Member

table[.a] looks similar to table[:a], which already is relatively normal for dataframes. Just wondering what might be gained by the new syntax.

The ability to fuse with dot calls or transform to a view with @views.

@c42f
Copy link
Member

c42f commented Jan 11, 2019

Not entirely --- the data ecosystem already uses columns and rows functions in this way, and you might want to define cols to return various different types. In some cases it might just return its argument.

Very true, which is why I suggested a function called table to make a "table like thing" from which it should be possible to get (a view of) the columns, regardless of whether they're stored column or row-wise.

@tkf
Copy link
Member

tkf commented Jan 11, 2019

As a variation to @stevengj's idea and sticking with that what we need is a broadcasted version of getproperty (i.e., I want to insert one more . somewhere), how about:

table :: Vector{NamedTuple{(:average, :SD),Tuple{Float64,Float64}}}
table.(.average)         # get :average column
table.(.SD ./ .average)  # compute CV (w/o materializing :average and :SD columns)
table[.average .> 1]     # Boolean indexing (w/o materializing the Boolean vector)

(They throw parse error ATM.)

table.(.average) is three more characters than table.average but being able to fuse row-wise operation and getproperty sounds appealing to me. I'm thinking they return concrete vectors like Vector{Float64} (rather than a view) so that it can be more naturally fused with standard Julia dot syntax like table.(.x .+ .y) ./ z. Once #19198 is resolved, we'd also have a lazy non-materializing/view version of column table.(.average) (e.g., @lazy table.(.average)).

I suggested table.(.average) instead of table[.average] to get a column since non-materializing Boolean indexing like table[.average .> 1] sounds useful too. This doesn't fit in current broadcasting mechanism, but I imagine a similar approach that lowers the expression to a lazy object and then materializes it would work.

(I expect there are many pandas users who don't like prefixing every column name with dataframe name. I guess it would be appealing for such users.)

@c42f
Copy link
Member

c42f commented Jan 12, 2019

@tkf Now we're talking! This syntax seems tricky and "inside out" compared to the usual broadcasting of a function over a collection. But it's incredibly appealing if we can have a broadcast-driven syntax which scopes the fields to their tabular context.

In particular, can this or something similar express all the unary tabular operations of the relational algebra? It's kind of close.

@tkf
Copy link
Member

tkf commented Jan 12, 2019

In particular, can this or something similar express all the unary tabular operations of the relational algebra? It's kind of close.

I'm a noob when it comes to the relational algebra, but reading Wikipedia pages, how about

Projection: unique(table.((Age = .Age, Weight = .Weight))) for, using Julia 1.0 syntax (and ignoring allocation ATM),

unique((row -> (Age = row.Age, Weight = row.Weight)).(table))

Selection: table[.Age == .Weight] for

table[(row -> row.Age == row.Weight).(table)]

Rename... seems to require a function to do it but can already be done? I imagine you can do it by using a function of the form rename(row::NamedTuple, from => to) like rename.(table, Ref(:Name => :EmployeeName)).

This syntax seems tricky and "inside out" compared to the usual broadcasting of a function over a collection.

Yeah, to be honest, I thought this was a bit odd syntax too. So I wondered if it was possible to "derive" this syntax from smaller sub-rules. Here is something I came up:

  • An identifier prefixed by a dot like .a is lowered to a "property Lens". For now, a lens l = .a is something that supports get(x, l) to do x.a. (See the full lens law in the link).

  • "Define" (x::Any)(l::Lens) = get(x, l). This is impossible until v0.5 "cannot add methods to an abstract type" when overriding call #14919 is resolved (hence the quotes). But, supposing it's possible, we have:

    l = .a
    @assert x(l) === x.a
  • Make container-of-callables broadcastable (i.e., Broadcasting tuple functions #22129) with the syntax container.(args...) which is equivalent to ((f, args...) -> f(args...)).(container, args...) in Julia 1.0.

From those rules, I think we have table.(.average). But this is still not enough for table.(.SD ./ .average). So here is one more rule:

  • A maximum expression in rhs containing at least one identifier prefixed with a dot and no dot call 1 is lowered to a "getter"; i.e., something you can passed to the second argument of get. For example:

    g = .SD / .average
    @assert row(g) === get(row, g) === row.SD / .average

This let us do table.(.SD / .average) (although ./ should be / instead).

Note that table.(.average) only uses a "half" of the lens functionalities. It'd be nice to if we can combine the syntax for #11902 / #21912 at the same time. So, how about lowering

y = x(.a = value)

to

l = .a
s = Setter(l, value)
y = x(s)

which is defined to be equivalent to

l = .a
y = set(x, l, value)

i.e., replace the field a of x with value and create a new object y.

This can be handy for updating columns like table.(.Name = titlecase(.Name)).

Some more thoughts:

Footnotes

  1. This is an ugly part... but required for the whole table.(.average) and table.(.SD / .average) to not get lowered to a getter.

@bramtayl
Copy link
Contributor

Split-apply-combine already has invert(table).a, which makes sense to me. Swap levels of organization, then do the access. table.a doesn't seem intuitive to me: getproperty should access items, not sub-items

@bramtayl
Copy link
Contributor

Off-topic: I really like the semantics of invert. invert(([1, 2], ["a", "b"])) makes a lot more sense to me than zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
julep Julia Enhancement Proposal needs docs Documentation for this change is required needs tests Unit tests are required for this change speculative Whether the change will be implemented is speculative
Projects
None yet
Development

Successfully merging this pull request may close these issues.