Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert to using only symbols for column names. #509

Merged
merged 1 commit into from
Jan 29, 2014

Conversation

tshort
Copy link
Contributor

@tshort tshort commented Jan 28, 2014

Obviously, this is a big change to the user interface.

I've only changed the tests and source code. If this is acceptable, I'll update the documentation.

@johnmyleswhite
Copy link
Contributor

I'm on board with this conceptually. Let me review the actual code and we'll try to merge this soon.

I'm not so worried about breaking things as long as we don't update METADATA without a few day's notice.

@@ -306,4 +306,5 @@ function data(rc::RComplex)
BitArray(imag(rc.data) .== R_NA_FLOAT64))
end

DataFrame(rl::RList) = DataFrame(map(x->data(x), rl.data), rl.attr["names"].data)
DataFrame(rl::RList) = DataFrame(map(x->data(x), rl.data),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're reading in files from RDA format, we probably need to do column name cleaning before we can make symbols since there will almost certainly be columns with a . in their names.

@johnmyleswhite
Copy link
Contributor

Going to take me a bit more time to fully review, but passing tests makes me think most things will work. We do need to be cautious about the RDA files, which will be full of invalid symbols.

@tshort
Copy link
Contributor Author

tshort commented Jan 28, 2014

That's a good point. Note that this pull request does not force symbols to be valid Julia identifiers. For example, you can do d[symbol("hi there")] = [1:5].

julia> d = DataFrame(x = [1:5], ɸ = 4)
5x2 DataFrame
|-------|----|-----|
| Row # | :x | :ɸ |
| 1     | 1  | 4   |
| 2     | 2  | 4   |
| 3     | 3  | 4   |
| 4     | 4  | 4   |
| 5     | 5  | 4   |

julia> d[symbol("hi there")] = [2:6]
5-element Array{Int64,1}:
 2
 3
 4
 5
 6

julia> d
5x3 DataFrame
|-------|----|-----|-----------|
| Row # | :x | :ɸ | :hi there |
| 1     | 1  | 4   | 2         |
| 2     | 2  | 4   | 3         |
| 3     | 3  | 4   | 4         |
| 4     | 4  | 4   | 5         |
| 5     | 5  | 4   | 6         |

Do you know if Julia has a name-cleansing function?

@johnmyleswhite
Copy link
Contributor

I don't believe that Julia has a name-cleansing function. We have a simple one in cleannames!.

@johnmyleswhite
Copy link
Contributor

Wrote some code to check and standardize identifiers: https://gist.github.com/johnmyleswhite/8681455

@johnmyleswhite
Copy link
Contributor

This all looks good to me. If you're comfortable with it, I'm good to go.

@nalimilan
Copy link
Member

I guess the main reason for this change is to improve performance, since symbols are more efficient to handle than strings? Just to be sure I understand. This indeed sounds like a good idea.

@tshort
Copy link
Contributor Author

tshort commented Jan 29, 2014

John, I'm good with it.

@nalimilan, symbols have the following advantages:

  • Symbols seem a bit more Julian. For example, [:a => 33] is more common than ["a" => 33].
  • It cuts down on the noise a bit for indexing operations (fewer, less obtrusive characters).
  • It seems more compatible if Julia allows us to override ., so we can do df.colA.
  • Performance should be better (untested).

Strings have some advantages:

  • Easier operations on column names like uppercasing or operations with regular expressions. With symbols, you have to convert to strings and then back to symbols.
  • Strings are more familiar to R users.

@nalimilan
Copy link
Member

@tshort Thanks for the explanation!

@johnmyleswhite
Copy link
Contributor

So, I'd like to merge this as soon as possible. But it would be good to let @garborg see this change first, so that he can handle the way it interacts with his changes to readtable, which I also really like. Hopefully Sean can rebase his PR on top of these changes without a ton of work.

@johnmyleswhite
Copy link
Contributor

For future reference, the thing I like about symbols is that they make it harder to generate column names at the REPL which aren't valid Julia identifiers. This will help cut down on the number of cases in which somebody creates a column name that they subsequently find it to hard to work with.

Also, we currently sometimes accepted symbols and sometimes accepted strings as column names in a few functions. Forcing more consistency will make things easier.

@garborg
Copy link
Contributor

garborg commented Jan 29, 2014

I don't see any issues with rebasing the readtable PR.

@johnmyleswhite
Copy link
Contributor

Ok. Then, @tshort, pull the switch whenever you're ready. Let the breakage begin!

@tshort
Copy link
Contributor Author

tshort commented Jan 29, 2014

@johnmyleswhite, I don't think I can pull the switch to commit. Either I don't have permissions, or Github has changed, and I can't find the right button to push.

@johnmyleswhite
Copy link
Contributor

You have permissions now.

tshort added a commit that referenced this pull request Jan 29, 2014
Convert to using only symbols for column names.
@tshort tshort merged commit fc1113e into JuliaData:master Jan 29, 2014
@johnmyleswhite
Copy link
Contributor

Question: should we not show the : when displaying the names of each column in pretty-printed output?

@tshort
Copy link
Contributor Author

tshort commented Jan 29, 2014

I was wondering that, too. I left it as is, thinking that it might
reinforce the use of symbols. I'm fine either way.

On Wed, Jan 29, 2014 at 12:51 PM, John Myles White <[email protected]

wrote:

Question: should we not show the : when displaying the names of each
column in pretty-printed output?

Reply to this email directly or view it on GitHubhttps://github.com//pull/509#issuecomment-33610383
.

@johnmyleswhite
Copy link
Contributor

Since we were showing strings without any quotes, I'd be onboard with dropping the initial colon. But let's hear what a few other folks think before making a decision.

FYI: I really like working with symbols now that I'm running this patch. No more pesky quote matching. 😄

@HarlanH
Copy link
Contributor

HarlanH commented Jan 29, 2014

Just to chime in. I like this change -- I think it'll be better for users.
But it also sorta makes adding column metadata to DataFrames a higher
priority, so that the nice new print methods can use a prettyName column
name in the header. (Adding unit and domain and level of measurement and
some info about NAs would be useful too.) There's an existing github issue
around metadata somewhere... Ah #35.

On Wed, Jan 29, 2014 at 2:10 PM, John Myles White
[email protected]:

Since we were showing strings without any quotes, I'd be onboard with
dropping the initial colon. But let's hear what a few other folks think
before making a decision.

FYI: I really like working with symbols now that I'm running this patch.
No more pesky quote matching. [image: 😄]

Reply to this email directly or view it on GitHubhttps://github.com//pull/509#issuecomment-33618562
.

@johnmyleswhite
Copy link
Contributor

I'm onboard with metadata, but if we're going to add it we need to agree for sure on what we unambiguously intend to support.

I'd really like us to put out a spec for DataFrames by March 1st that contains stuff we promise to support indefinitely into the future and code that passes tests based on that spec. That way we can direct users to 0.3 for a long time to come while we make changes without fear of screwing anybody over.

@nalimilan
Copy link
Member

Supporting column meta-data, in particular labels, is indeed the logical complement of restricting column names to valid symbol names.

Regarding the printing, I also think it would be better to remove the : since once you know how it works they do not add anything. Obviously, that's a detail anyway.

@johnmyleswhite
Copy link
Contributor

Let's move discussion to #35.

simonster added a commit that referenced this pull request Jan 31, 2014
nalimilan pushed a commit that referenced this pull request May 26, 2022
Convert to using only symbols for column names.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants