Pandas is a package for data manipulation and analytics in Python. It is highly popular in both the scientific and commercial communities. Pandas is well suited for ad-hoc exploratory work, prototyping and use in production systems due to its productivity and efficiency. Pandas is highly flexible, however, this leads to there being more than one way to skin a Panda.
This opinionated guide, about Pandas and using DataFrames, presents best practices to write code that is more consistent, reliable, maintainable and readable by practitioners. This is mainly aimed at code which is used in production systems and not ad-hoc exploratory work.
Why do we need this? Users of Pandas have a divers backgrounds (e.g. Data Scientist, Data Engineer, Researcher, Software Engineer) and language experiences (e.g. SQL, MATLAB, Java) which can lead to inconsistent coding styles and use of bad practices.
# Good
df['column']
# Bad
df.column
When selecting a column in a dataframe users should use dictionary like selection e.g. df['column']
and not property selection e.g. df.column
.
Why:
- It makes it more explict to the user that you are accessing a column and not a standard property or method
- Not all column names can be represented as a property - it must be a valid Python variable name. The column name also must not clash with an existing method or property
# Good
df = df.drop(columns='A')
# Bad
df.drop(columns='A', inplace=True)
Some operations can either be performed inplace or by re-assignment. Users should always opt to re-assignment for better readability. Both operation, with a few exceptions, perform a data copy and therefore don't have any performance differences.
# Good (file input)
df = pd.read_csv('data.csv')
df = df[['col1', 'col2', 'col3']]
# Good (collection input)
df = pd.DataFrame.from_records(list_of_dicts)
df = df[['col1', 'col2', 'col3']]
After a DataFrame initialisation, users should explicitly select the columns to use, even if all are selected. This creates a clear contract between the code and user about what is expected in the downstream data schema.
It is also recommended to explicitly specify data types as well. When reading files data types can subtly change with small alterations e.g. from integer to float, or numerical to string. It is helpful for the program to throw an error if the input is unexpected, otherwise it will continue silently.
# Good
df = pd.read_csv('data.csv', dtype={'col1': 'str', 'col2': 'int', 'col3': 'float'})
df = df[['col1', 'col2', 'col3']]
# Good
df_3 = df_1.merge(
df_2,
how='inner',
on='col1',
validate='1:1'
)
# Bad
df_3 = df_1.merge(df_2)
Users should explicitly specify the how
and on
parameters of a merge for readability, even if the default parameters would produce the same result.
It is also important to validate the merge type using the validate
parameter, this prevents unexpected "merge duplication". Upstream data can subtly changed which produces multiple rows per merge key. If no validation is performed, the data here will silently multiply and duplicate, as the operation will be promoted to a "many-to-one" or "many-to-many" producing "merge duplication". It is therefore useful to have the merge assumptions explicitly stated.
Also, don't deduplicate rows after a merge to remove merge duplication. Remove duplicates before joining, or even better determine why there are unexpected duplicate keys and remove them upstream. Merge duplication are computationally and memory expensive and produce hard to debug data bugs.
# Good
df['new_col_float'] = np.nan
df['new_col_int'] = pd.Series(dtype='int')
df['new_col_str'] = pd.Series(dtype='object')
# Bad
df['new_col_int'] = 0
df['new_col_str'] = ''
If a new empty column is needed always use NaN values. Never use "filler" values such as zeros or empty strings. This preserves the ability to use methods such as isnull
or notnull
.
# Bad
def func_1(df: pd.DataFrame) -> pd.DataFrame:
df['new_col'] = df['col1'] + 1
return df
df = func_1(df)
In large code bases, it can be tempting to keep adding columns to a dataframe and use it as the data exchange format between methods and functions. This however leads to code which is difficult to maintain as readers can't directly determine the schema of a DataFrame without running the code. It is more preferable to use dataclasses as an interchange format as the class definition is effectively immutable and specified explicitly (or if Python < 3.6 use namedtuples).
Contributions and discussion are highly welcomed. Please create an issue or pull request.