Get and set column names #21

datapythonista · 2020-07-01T16:08:28Z

Regarding column names, the next proposal, similar to what pandas currently does, uses a columns property to set and get columns names.

In #7, the preference is to restrict column names to string, and not allow duplicates.

The proposed API with an example is:

>>> df = dataframe({'col1': [1, 2], 'col2': [3, 4]})
>>> df.columns = 'foo', 'bar'
>>> df.columns = ['foo', 'bar']
>>> df.columns = map(str.upper, df.columns)
>>> df.columns
['FOO', 'BAR']

And the next cases would fail:

>>> df.columns = 1
TypeError: Columns must be an iterable, not int
>>> df.columns = 'foo'
TypeError: Columns must be an iterable, not str
>>> df.columns = 'foo', 1
TypeError: Column names must be str, int found
>>> df.columns = 'foo', 'bar', 'foobar'
ValueError: Expected 2 column names, found 3
>>> df.columns = 'foo', 'foo'
ValueError: Column names cannot be duplicated. Found duplicates: foo

Some things that people may want to discuss:

Using a different name for the property (e.g. column_names)
Being able to set a single column df.columns[0] = 'foo' (the proposal don't allow it)
The return type of the columns (the proposal returns a Python list, pandas returns an Index)
Setting the column of a dataframe with one column with df.columns = 'foo' (the proposal requires an iterable, so df.columns = ['foo'] or equivalent is needed).

In case it's useful, this is the implementation of the examples:

import collections
import typing


class dataframe:
    def __init__(self, data):
        self._columns = list(data)

    @property
    def columns(self) -> typing.List[str]:
        return self._columns
    
    @columns.setter
    def columns(self, names: typing.Iterable[str]):
        if not isinstance(names, collections.abc.Iterable) or isinstance(names, str):
            raise TypeError(f'Columns must be an iterable, not {type(names).__name__}')

        names = list(names)

        for name in names:
            if not isinstance(name, str):
                raise TypeError(f'Column names must be str, {type(name).__name__} found')
        
        if len(names) != len(self._columns):
            raise ValueError(f'Expected {len(self._columns)} column names, found {len(names)}')

        if len(set(names)) != len(self._columns):
            duplicates = set(name for name in names if names.count(name) > 1)
            raise ValueError(f'Column names cannot be duplicated. Found duplicates: {", ".join(duplicates)}')

        self._columns = names

The text was updated successfully, but these errors were encountered:

datapythonista · 2020-07-02T18:07:57Z

As pointed out in the meeting, the API in the description assumes the dataframe can be mutated in place (in this case the labels). This is something that has been discussed in #10, but it's still not decided which kind of API we want in terms of mutability. Those would be the main options:

Mutable

df.columns = 'a', 'b'
df['a'] = 2

Immutable

df = df.set_columns('a', 'b')
df = df.assign(a=2)

Support both API's (as pandas does)

rgommers · 2021-06-25T20:33:37Z

The consensus ended up being: "no mutability". So we cannot set column names, and the interchange protocol has a simple column_names property at the dataframe level.

This seems resolved, closing.

datapythonista mentioned this issue Jul 1, 2020

Dataframe MVP #14

Closed

datapythonista mentioned this issue Aug 7, 2020

Dataframe interchange protocol #25

Closed

rgommers closed this as completed Jun 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get and set column names #21

Get and set column names #21

datapythonista commented Jul 1, 2020

datapythonista commented Jul 2, 2020

rgommers commented Jun 25, 2021

Get and set column names #21

Get and set column names #21

Comments

datapythonista commented Jul 1, 2020

datapythonista commented Jul 2, 2020

rgommers commented Jun 25, 2021