Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrictions on column labels #7

Closed
TomAugspurger opened this issue May 19, 2020 · 17 comments
Closed

Restrictions on column labels #7

TomAugspurger opened this issue May 19, 2020 · 17 comments

Comments

@TomAugspurger
Copy link

One of the uncontroversial points from #2 is that DataFrames have column labels / names. I'd like to discuss two specific points on this before merging the results into that issue.

  1. What type can the column labels be? Should they be limited to just strings?
  2. Do we require uniqueness of column labels?

I'm a bit unsure whether these are getting too far into the implementation side of things. Should we just take no stance on either of these?


My responses:

  1. We should probably labels to be any type.

Operations like crosstab / pivot places a column from the input dataframe into the column labels of the output.

We'll need to be careful with how this interacts with the indexing API, since a label like the tuple ('my', 'label') might introduce ambiguities (e.g. the full list of labels is ['my', 'label', ('my', 'label')].

Is it reasonable to require each label to be hashable? Pandas requires this, to facilitate lookup in a hashtable.

  1. We cannot require uniqueness.

dataframes are commonly used to wrangle real-world data into shape, and real-world data is messy. If an implementation wants to ensure uniqueness (perhaps on a per-object basis) then is can offer that separately. But the API should at least allow for it.

@datapythonista
Copy link
Member

I think those are very good points to discuss.

My preference would be:

  1. Only strings allowed
  2. Required to be unique

The reason is the same in both cases, I think there is a trade-off between complexity of the standard/implementation, and flexibility of the tool. From my own experience with the kind of data and projects I worked on, the increase in complexity is not worth.

I do see use cases, for example:

  • The columns represent years, and an int is a more natural way to keep the labels
  • Opening a csv file with duplicate column names

But even if cases like this would be a bit trickier, I still think that being able to assume string types and uniqueness will simplify enough things that it's worth.

Of course, we can leave this out of the standard, and let dataframe implementations decide. But I think consumers of dataframes will also have an IMHO unreasonable increase of complexity. Think of df[col] raising a Not unique column error, or returning a two columns dataframe. Both cases will require a decent amount of extra code, to handle it in the consumer in a robust way. And df[col] (or equivalent in the standard) is probably the most common API call they'll have.

For the types, not my preference, but would be ok accepting a small subset of types (e.g. bool, int, may be datetime.date). But I wouldn't allow things like float (df[3.141592] is nonsense to me, I can expand why if needed), or tuples, or surely mutable objects.

@maartenbreddels
Copy link

  • Only strings allowed
  • Required to be unique

I agree. I think that limitation is fair, not too limiting, and makes it much easier as a user and library author.

If you think about going from DataFrames to other libraries, e.g. visualization, where they can be a label in a plot, or to a JSON like structures, where they can be a key, it's going to be messy if we don't require that.

@TomAugspurger
Copy link
Author

TomAugspurger commented May 28, 2020

and makes it much easier as a user

Unless you want to read a dataset that has duplicate labels :) Though perhaps we just require that any IO routine that reads from a store that allows duplicates (e.g. read_csv) includes a parameter to mangle duplicate labels? That seems like a decent tradeoff: deal with all the messiness of real-world data at the boundary.

@amueller
Copy link

I'm not sure I understand how the pivot-table point was addressed. Does that mean that pivoting will convert a value to string for the column name?

@TomAugspurger
Copy link
Author

I think the string-only column names hasn't been addressed / discussed much.

@datapythonista
Copy link
Member

I'm not sure I understand how the pivot-table point was addressed.

That's a good point. My opinion:

I don't think pivoting should be part of the standard. Surely a nice feature for some users, but I'm personally fine with different implementations. In every package, or provided as third-party extensions.

I think it should be quite easy to add a wrapper to a dataframe that maps any (hashable) value to a string, and shows those to the user instead of a string (so the user don't see the actual string values). So, even if the underlying dataframe has this restriction, implementations or third party packages can provide something "fancier" to the user. It's surely somehow tricky, but moving complexity out of the core dataframe standard, into implementations and third party packages seems like a good deal to me.

@rgommers
Copy link
Member

+1 as well for string-only and uniqueness.

I don't think pivoting should be part of the standard.

That is a separate decision that is unnecessary to mix in here.

Does that mean that pivoting will convert a value to string for the column name?

That seems like a decent solution. Any implementation can ensure the resulting strings are unique (e.g. append '_0', '_1' in case of duplicates).

@TomAugspurger
Copy link
Author

It seems like the preference here is to require that column labels must be

  1. unique
  2. strings

We'll want to specify / provide guidance on when and how to mangle duplicate columns should they arise (reading from a CSV file). And we'll want to specify what should happen when a dataframe operation introduces duplicate labels (each of these should probably raise)

  • concat([df1, df2], axis="columns") with duplicate labels between df1 and df2
  • non-unique indexers like df.loc[:, ['A', 'A']]
  • `pd.DataFrame(columns=['A', 'a']).rename(columns=str.upper)

@tdimitri
Copy link

tdimitri commented Jun 13, 2020

Tom, I appreciate you trying to move things forward. I also agree that column labels should be unique and strings. This also makes saving a DataFrame/Dataset/Table for cross platform much easier.

What happens with a column name collision? I suggest we have a kwarg (perhaps 'collide' ?) that determines what happens. By default we might append '_1',' _2', '_3', etc. for each collision.

If the user specifies:
collide=None (default)-- allow configured name collision resolution (like appending a number). This can be configured
collide=False - then an error will be raised if col name collides
collide=True - override any defaults, and allow collisions.
collide='_x' (or any string) then '_x' will be appended to the column name upon collision.
collide=func the function will be called when a collision occurs

@TomAugspurger
Copy link
Author

What happens with a column name collision? I suggest we have a kwarg (perhaps 'collide' ?) that determines what happens.

At least for data IO methods like read_csv we'll want to provide that (pandas calls it mangle_dupe_cols).;

I'm less certain of the need for it in methods that might introduce duplicate column labels in the course of normal operation, likeconcat. If we're trying to minimize the surface area of the API, a collide argument in pd.concat would be equivalent handling the duplicates prior to concatenating, pd.concat([df1.rename(...), df2.rename(...)]). This is less user-friendly, but is easier on us.

@jbrockmendel
Copy link
Contributor

Just checking, requiring unique all-string columns means "to satisfy the spec, you must support unique all-string columns", not "to satisfy the spec, you must support only unique all-string columns", right?

@rgommers
Copy link
Member

rgommers commented Jul 8, 2022

@jbrockmendel the intent was the latter (only). That seems to be preferred from both a usability and "avoid complexity" point of view. This issue is quite old, but IIRC in more recent conversations there was pretty universal agreement on this. And the interchange protocol has that requirement as well: https://data-apis.org/dataframe-protocol/latest/design_requirements.html#protocol-design-requirements

@jbrockmendel
Copy link
Contributor

so to be spec-compliant pandas would have to deprecate support for non-unique columns and non-string columns?

@rgommers
Copy link
Member

rgommers commented Jul 8, 2022

For context: there is a significant tension between backwards compatibility in libraries and not trying to standardize simply the way things work now for behavior/features that many maintainers don't like. As a result, library maintainers are not planning to implement the whole standard in their main namespace (or perhaps with some kind of switch, see gh-79).

I'd expect this to be one of those things where Pandas would either not want to deprecate this at all, or quite slowly.

@jbrockmendel
Copy link
Contributor

I expect this is one of many things I'll have to get used to, but I find this confusing.

Saying "implementation X must support Y" seems reasonable. Continuing with "and it must not support Z" seems unnecessary and counterproductive. If I apply the same reasoning to the description of arrays being contiguous, that means you're not allowed to support strided arrays, so e.g. df.iloc[::2] would have to either be disallowed or make a copy?

@rgommers
Copy link
Member

rgommers commented Jul 9, 2022

Saying "implementation X must support Y" seems reasonable. Continuing with "and it must not support Z" seems unnecessary and counterproductive.

Agreed - I think there are very few cases of this though. In general, we'd expect libraries to offer a superset of functionality of what's in a standard. So that means that if a user or downstream package author restricts themselves to the standardized set of APIs and to inputs that are supported, they have portable code. And if they go beyond it, they don't.

Maybe I was wrong above. It's possible for a library to support non-unique/non-string columns, as long as the behavior is compliant for any methods/functions in the standard. Additionally, it'd be good to have sane defaults outside of that, so for example all I/O routines and other standard ways of creating dataframes would default to producing unique string names. Otherwise it's too easy to write non-portable code. But then an explicit .rename(['col1', 'col2', 42]) may be fine.

f I apply the same reasoning to the description of arrays being contiguous, that means you're not allowed to support strided arrays, so e.g. df.iloc[::2] would have to either be disallowed or make a copy?

No, that's certainly not intended. The copy is only necessary in __dataframe__, for interchange with another library.

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Dec 20, 2023

if a user or downstream package author restricts themselves to the standardized set of APIs and to inputs that are supported, they have portable code. And if they go beyond it, they don't

this has generally been the guiding principle - so pandas need not forbid non-string column names, but anyone writing something like

df = data.__dataframe_consortium_standard__()
df.assign((df.col('a') + df.col('b')).rename(1999))

can't expect the above to produce dataframe-agnostic code

I think the opening issue / question has been addressed then, so closing, but please let me know if I've misunderstood and I'll reopn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants