Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Add pandas index checks #200

Merged
merged 5 commits into from
Oct 14, 2022
Merged

Add pandas index checks #200

merged 5 commits into from
Oct 14, 2022

Conversation

skrawcz
Copy link
Collaborator

@skrawcz skrawcz commented Sep 22, 2022

Adds some basic index type checking to df creation

Related to issue #191, this commit is to help surface index type issues.

Specifically:

  1. Warn if there are index type mismatches.
  2. Require you to set your logger to debug if you want to see more details.
  3. Provide a "ResultBuilder" class that uses strict index type matching so if you
    want to error on index type mismatches, this is the results builder to use.

I don't think we should build anything more custom unless there's a clear
common use case - user contributed result builders sound like an interesting idea here though.

To use the new result builder the code should be the following:

from hamilton import base, driver
strict_builder = base.StrictIndexTypePandasDataFrameResult()
adapter = base.SimplePythonGraphAdapter(strict_builder)
...
dr =  driver.Driver(config, *modules, adapter=adapter)
df = dr.execute(...)  # this will now error if index types mismatch.

Changes

  • one clean up commit around imports
  • one commit with the index type changes
  • one commit to handle python 3.6 support and pandas index classes changing

Testing

  1. Unit tests.
  2. Checked things in the REPL locally against the examples in Better error handling for errors in the execute() stage #191.

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.
  • Reviewers requested with the Reviewers tool ➡️

Testing checklist

Python - local testing

  • python 3.6
  • python 3.7
  • python 3.8
  • python 3.9

Importing `typing` and using that as a prefix was getting unsightly.
So moved to importing the types explicitly.
Related to issue #191, this commit is to help surface index type issues.

Specifically:

1. Warn if there are index type mismatches.
2. Require you to set your logger to debug if you want to see more details.
3. Provide a "ResultBuilder" class that uses strict index type matching so if you
want to error on index type mismatches, this is the results builder to use.

I don't think we should build anything more custom unless there's a clear
common use case - user contributed result builders sound like an interesting idea.
Pandas dropped support for python 3.6 in something like 1.2.
So pandas 1.1.5 is what we're using in our CI system, and that
does not have the `NDArrayBackedExtensionIndex` type.

So I'm guessing here, but looking at the 1.1.5 source, we instead
want `ExtensionIndex`.
Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good -- curious on some decisions though around time-series

hamilton/base.py Show resolved Hide resolved
hamilton/base.py Outdated Show resolved Hide resolved
hamilton/base.py Outdated Show resolved Hide resolved
hamilton/base.py Outdated Show resolved Hide resolved
hamilton/base.py Show resolved Hide resolved
hamilton/base.py Outdated Show resolved Hide resolved
TIL: you can create a dataframe and pass in an index object
and it'll happily use it as a column.

So this test should exist for dataframe creation since it's a valid case.

But for the index type checking, I'm adding it here even though
it does not have an explicit index. Therefore, one could make the argument it
doesn't qualify here. But, I'd rather push people to be explicit
in their code, e.g. if they want to be strict on indexes, then they should
make the index a series, rather than passing an Index object.
So that way it's clearer what's going on and why.
I decided to use private functions to the static ones because
I don't really want them used outside of that function.
Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is all good except the time-series stuff is confusing me.

hamilton/base.py Show resolved Hide resolved
hamilton/base.py Show resolved Hide resolved
@elijahbenizzy elijahbenizzy self-requested a review October 14, 2022 20:32
@skrawcz skrawcz merged commit 5a9de3f into main Oct 14, 2022
@skrawcz skrawcz deleted the add_pandas_index_checks branch October 14, 2022 20:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants