Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get and set column names #21

Closed
datapythonista opened this issue Jul 1, 2020 · 2 comments
Closed

Get and set column names #21

datapythonista opened this issue Jul 1, 2020 · 2 comments

Comments

@datapythonista
Copy link
Member

Regarding column names, the next proposal, similar to what pandas currently does, uses a columns property to set and get columns names.

In #7, the preference is to restrict column names to string, and not allow duplicates.

The proposed API with an example is:

>>> df = dataframe({'col1': [1, 2], 'col2': [3, 4]})
>>> df.columns = 'foo', 'bar'
>>> df.columns = ['foo', 'bar']
>>> df.columns = map(str.upper, df.columns)
>>> df.columns
['FOO', 'BAR']

And the next cases would fail:

>>> df.columns = 1
TypeError: Columns must be an iterable, not int
>>> df.columns = 'foo'
TypeError: Columns must be an iterable, not str
>>> df.columns = 'foo', 1
TypeError: Column names must be str, int found
>>> df.columns = 'foo', 'bar', 'foobar'
ValueError: Expected 2 column names, found 3
>>> df.columns = 'foo', 'foo'
ValueError: Column names cannot be duplicated. Found duplicates: foo

Some things that people may want to discuss:

  • Using a different name for the property (e.g. column_names)
  • Being able to set a single column df.columns[0] = 'foo' (the proposal don't allow it)
  • The return type of the columns (the proposal returns a Python list, pandas returns an Index)
  • Setting the column of a dataframe with one column with df.columns = 'foo' (the proposal requires an iterable, so df.columns = ['foo'] or equivalent is needed).

In case it's useful, this is the implementation of the examples:

import collections
import typing


class dataframe:
    def __init__(self, data):
        self._columns = list(data)

    @property
    def columns(self) -> typing.List[str]:
        return self._columns
    
    @columns.setter
    def columns(self, names: typing.Iterable[str]):
        if not isinstance(names, collections.abc.Iterable) or isinstance(names, str):
            raise TypeError(f'Columns must be an iterable, not {type(names).__name__}')

        names = list(names)

        for name in names:
            if not isinstance(name, str):
                raise TypeError(f'Column names must be str, {type(name).__name__} found')
        
        if len(names) != len(self._columns):
            raise ValueError(f'Expected {len(self._columns)} column names, found {len(names)}')

        if len(set(names)) != len(self._columns):
            duplicates = set(name for name in names if names.count(name) > 1)
            raise ValueError(f'Column names cannot be duplicated. Found duplicates: {", ".join(duplicates)}')

        self._columns = names
@datapythonista
Copy link
Member Author

As pointed out in the meeting, the API in the description assumes the dataframe can be mutated in place (in this case the labels). This is something that has been discussed in #10, but it's still not decided which kind of API we want in terms of mutability. Those would be the main options:

  1. Mutable
df.columns = 'a', 'b'
df['a'] = 2
  1. Immutable
df = df.set_columns('a', 'b')
df = df.assign(a=2)
  1. Support both API's (as pandas does)

@rgommers
Copy link
Member

The consensus ended up being: "no mutability". So we cannot set column names, and the interchange protocol has a simple column_names property at the dataframe level.

This seems resolved, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants