Restrictions on column labels #7

TomAugspurger · 2020-05-19T21:29:32Z

One of the uncontroversial points from #2 is that DataFrames have column labels / names. I'd like to discuss two specific points on this before merging the results into that issue.

What type can the column labels be? Should they be limited to just strings?
Do we require uniqueness of column labels?

I'm a bit unsure whether these are getting too far into the implementation side of things. Should we just take no stance on either of these?

My responses:

We should probably labels to be any type.

Operations like crosstab / pivot places a column from the input dataframe into the column labels of the output.

We'll need to be careful with how this interacts with the indexing API, since a label like the tuple ('my', 'label') might introduce ambiguities (e.g. the full list of labels is ['my', 'label', ('my', 'label')].

Is it reasonable to require each label to be hashable? Pandas requires this, to facilitate lookup in a hashtable.

We cannot require uniqueness.

dataframes are commonly used to wrangle real-world data into shape, and real-world data is messy. If an implementation wants to ensure uniqueness (perhaps on a per-object basis) then is can offer that separately. But the API should at least allow for it.

The text was updated successfully, but these errors were encountered:

datapythonista · 2020-05-21T08:49:44Z

I think those are very good points to discuss.

My preference would be:

Only strings allowed
Required to be unique

The reason is the same in both cases, I think there is a trade-off between complexity of the standard/implementation, and flexibility of the tool. From my own experience with the kind of data and projects I worked on, the increase in complexity is not worth.

I do see use cases, for example:

The columns represent years, and an int is a more natural way to keep the labels
Opening a csv file with duplicate column names

But even if cases like this would be a bit trickier, I still think that being able to assume string types and uniqueness will simplify enough things that it's worth.

Of course, we can leave this out of the standard, and let dataframe implementations decide. But I think consumers of dataframes will also have an IMHO unreasonable increase of complexity. Think of df[col] raising a Not unique column error, or returning a two columns dataframe. Both cases will require a decent amount of extra code, to handle it in the consumer in a robust way. And df[col] (or equivalent in the standard) is probably the most common API call they'll have.

For the types, not my preference, but would be ok accepting a small subset of types (e.g. bool, int, may be datetime.date). But I wouldn't allow things like float (df[3.141592] is nonsense to me, I can expand why if needed), or tuples, or surely mutable objects.

maartenbreddels · 2020-05-28T13:15:03Z

Only strings allowed

Required to be unique

I agree. I think that limitation is fair, not too limiting, and makes it much easier as a user and library author.

If you think about going from DataFrames to other libraries, e.g. visualization, where they can be a label in a plot, or to a JSON like structures, where they can be a key, it's going to be messy if we don't require that.

TomAugspurger · 2020-05-28T13:21:48Z

and makes it much easier as a user

Unless you want to read a dataset that has duplicate labels :) Though perhaps we just require that any IO routine that reads from a store that allows duplicates (e.g. read_csv) includes a parameter to mangle duplicate labels? That seems like a decent tradeoff: deal with all the messiness of real-world data at the boundary.

amueller · 2020-05-28T16:03:20Z

I'm not sure I understand how the pivot-table point was addressed. Does that mean that pivoting will convert a value to string for the column name?

TomAugspurger · 2020-05-28T16:28:47Z

I think the string-only column names hasn't been addressed / discussed much.

datapythonista · 2020-05-28T16:33:19Z

I'm not sure I understand how the pivot-table point was addressed.

That's a good point. My opinion:

I don't think pivoting should be part of the standard. Surely a nice feature for some users, but I'm personally fine with different implementations. In every package, or provided as third-party extensions.

I think it should be quite easy to add a wrapper to a dataframe that maps any (hashable) value to a string, and shows those to the user instead of a string (so the user don't see the actual string values). So, even if the underlying dataframe has this restriction, implementations or third party packages can provide something "fancier" to the user. It's surely somehow tricky, but moving complexity out of the core dataframe standard, into implementations and third party packages seems like a good deal to me.

rgommers · 2020-05-29T07:37:27Z

+1 as well for string-only and uniqueness.

I don't think pivoting should be part of the standard.

That is a separate decision that is unnecessary to mix in here.

Does that mean that pivoting will convert a value to string for the column name?

That seems like a decent solution. Any implementation can ensure the resulting strings are unique (e.g. append '_0', '_1' in case of duplicates).

TomAugspurger · 2020-06-11T13:36:38Z

It seems like the preference here is to require that column labels must be

unique
strings

We'll want to specify / provide guidance on when and how to mangle duplicate columns should they arise (reading from a CSV file). And we'll want to specify what should happen when a dataframe operation introduces duplicate labels (each of these should probably raise)

concat([df1, df2], axis="columns") with duplicate labels between df1 and df2
non-unique indexers like df.loc[:, ['A', 'A']]
`pd.DataFrame(columns=['A', 'a']).rename(columns=str.upper)

tdimitri · 2020-06-13T17:55:11Z

Tom, I appreciate you trying to move things forward. I also agree that column labels should be unique and strings. This also makes saving a DataFrame/Dataset/Table for cross platform much easier.

What happens with a column name collision? I suggest we have a kwarg (perhaps 'collide' ?) that determines what happens. By default we might append '_1',' _2', '_3', etc. for each collision.

If the user specifies:
collide=None (default)-- allow configured name collision resolution (like appending a number). This can be configured
collide=False - then an error will be raised if col name collides
collide=True - override any defaults, and allow collisions.
collide='_x' (or any string) then '_x' will be appended to the column name upon collision.
collide=func the function will be called when a collision occurs

TomAugspurger · 2020-06-15T12:56:13Z

What happens with a column name collision? I suggest we have a kwarg (perhaps 'collide' ?) that determines what happens.

At least for data IO methods like read_csv we'll want to provide that (pandas calls it mangle_dupe_cols).;

I'm less certain of the need for it in methods that might introduce duplicate column labels in the course of normal operation, likeconcat. If we're trying to minimize the surface area of the API, a collide argument in pd.concat would be equivalent handling the duplicates prior to concatenating, pd.concat([df1.rename(...), df2.rename(...)]). This is less user-friendly, but is easier on us.

jbrockmendel · 2022-07-07T21:11:42Z

Just checking, requiring unique all-string columns means "to satisfy the spec, you must support unique all-string columns", not "to satisfy the spec, you must support only unique all-string columns", right?

rgommers · 2022-07-08T15:53:59Z

@jbrockmendel the intent was the latter (only). That seems to be preferred from both a usability and "avoid complexity" point of view. This issue is quite old, but IIRC in more recent conversations there was pretty universal agreement on this. And the interchange protocol has that requirement as well: https://data-apis.org/dataframe-protocol/latest/design_requirements.html#protocol-design-requirements

jbrockmendel · 2022-07-08T15:59:19Z

so to be spec-compliant pandas would have to deprecate support for non-unique columns and non-string columns?

rgommers · 2022-07-08T16:16:36Z

For context: there is a significant tension between backwards compatibility in libraries and not trying to standardize simply the way things work now for behavior/features that many maintainers don't like. As a result, library maintainers are not planning to implement the whole standard in their main namespace (or perhaps with some kind of switch, see gh-79).

I'd expect this to be one of those things where Pandas would either not want to deprecate this at all, or quite slowly.

jbrockmendel · 2022-07-08T16:23:56Z

I expect this is one of many things I'll have to get used to, but I find this confusing.

Saying "implementation X must support Y" seems reasonable. Continuing with "and it must not support Z" seems unnecessary and counterproductive. If I apply the same reasoning to the description of arrays being contiguous, that means you're not allowed to support strided arrays, so e.g. df.iloc[::2] would have to either be disallowed or make a copy?

rgommers · 2022-07-09T11:22:39Z

Saying "implementation X must support Y" seems reasonable. Continuing with "and it must not support Z" seems unnecessary and counterproductive.

Agreed - I think there are very few cases of this though. In general, we'd expect libraries to offer a superset of functionality of what's in a standard. So that means that if a user or downstream package author restricts themselves to the standardized set of APIs and to inputs that are supported, they have portable code. And if they go beyond it, they don't.

Maybe I was wrong above. It's possible for a library to support non-unique/non-string columns, as long as the behavior is compliant for any methods/functions in the standard. Additionally, it'd be good to have sane defaults outside of that, so for example all I/O routines and other standard ways of creating dataframes would default to producing unique string names. Otherwise it's too easy to write non-portable code. But then an explicit .rename(['col1', 'col2', 42]) may be fine.

f I apply the same reasoning to the description of arrays being contiguous, that means you're not allowed to support strided arrays, so e.g. df.iloc[::2] would have to either be disallowed or make a copy?

No, that's certainly not intended. The copy is only necessary in __dataframe__, for interchange with another library.

MarcoGorelli · 2023-12-20T16:52:33Z

if a user or downstream package author restricts themselves to the standardized set of APIs and to inputs that are supported, they have portable code. And if they go beyond it, they don't

this has generally been the guiding principle - so pandas need not forbid non-string column names, but anyone writing something like

df = data.__dataframe_consortium_standard__()
df.assign((df.col('a') + df.col('b')).rename(1999))

can't expect the above to produce dataframe-agnostic code

I think the opening issue / question has been addressed then, so closing, but please let me know if I've misunderstood and I'll reopn

datapythonista mentioned this issue Jul 1, 2020

Get and set column names #21

Closed

rgommers mentioned this issue Jan 7, 2021

Requirements document for the dataframe interchange protocol #35

Merged

MarcoGorelli closed this as completed Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restrictions on column labels #7

Restrictions on column labels #7

TomAugspurger commented May 19, 2020

datapythonista commented May 21, 2020

maartenbreddels commented May 28, 2020

TomAugspurger commented May 28, 2020 •

edited

Loading

amueller commented May 28, 2020

TomAugspurger commented May 28, 2020

datapythonista commented May 28, 2020

rgommers commented May 29, 2020

TomAugspurger commented Jun 11, 2020

tdimitri commented Jun 13, 2020 •

edited

Loading

TomAugspurger commented Jun 15, 2020

jbrockmendel commented Jul 7, 2022

rgommers commented Jul 8, 2022

jbrockmendel commented Jul 8, 2022

rgommers commented Jul 8, 2022

jbrockmendel commented Jul 8, 2022

rgommers commented Jul 9, 2022

MarcoGorelli commented Dec 20, 2023 •

edited

Loading

Restrictions on column labels #7

Restrictions on column labels #7

Comments

TomAugspurger commented May 19, 2020

datapythonista commented May 21, 2020

maartenbreddels commented May 28, 2020

TomAugspurger commented May 28, 2020 • edited Loading

amueller commented May 28, 2020

TomAugspurger commented May 28, 2020

datapythonista commented May 28, 2020

rgommers commented May 29, 2020

TomAugspurger commented Jun 11, 2020

tdimitri commented Jun 13, 2020 • edited Loading

TomAugspurger commented Jun 15, 2020

jbrockmendel commented Jul 7, 2022

rgommers commented Jul 8, 2022

jbrockmendel commented Jul 8, 2022

rgommers commented Jul 8, 2022

jbrockmendel commented Jul 8, 2022

rgommers commented Jul 9, 2022

MarcoGorelli commented Dec 20, 2023 • edited Loading

TomAugspurger commented May 28, 2020 •

edited

Loading

tdimitri commented Jun 13, 2020 •

edited

Loading

MarcoGorelli commented Dec 20, 2023 •

edited

Loading