Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

PandasDataFrameResult: Convert non-list values to single row frame #243

Merged
merged 6 commits into from
Dec 14, 2022

Conversation

ianhoffman
Copy link
Contributor

When trying to run an intermediate node which produces a scalar in the hello_world example, Hamilton throws an error:

WARNING:hamilton.base:It appears no Pandas index type was detected. This will likely break when trying to create a DataFrame. E.g. are you requesting all scalar values? Use a different result builder or return at least one Pandas object with an index. Ignore this warning if you're using DASK for now.
ERROR:hamilton.driver:-------------------------------------------------------------------
Oh no an error! Need help with Hamilton?
Join our slack and ask for help! https://join.slack.com/t/hamilton-opensource/shared_invite/zt-1bjs72asx-wcUTgH7q7QX1igiQ5bbdcg
-------------------------------------------------------------------

Traceback (most recent call last):
  File "my_script.py", line 29, in <module>
    df = dr.execute(output_columns)
  File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/hamilton/driver.py", line 142, in execute
    raise e
  File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/hamilton/driver.py", line 139, in execute
    return self.adapter.build_result(**outputs)
  File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/hamilton/base.py", line 171, in build_result
    raise ValueError(f"Cannot build result. Cannot handle type {value}.")
ValueError: Cannot build result. Cannot handle type 28.333333333333332.

If we can run an entire DAG, it seems like we should be able to run any sub-DAG of the DAG.

Changes

Updates PandasDataFrameResult.build_result() to convert scalar values into dataframes.

How I tested this

Updated unit tests.

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

@skrawcz
Copy link
Collaborator

skrawcz commented Dec 12, 2022

Awesome thanks.

Note: This will still break if you try to get multiple "scalar" values back. 🤔
But I think we have enough information to do the same fix with a "requisite" if statement.

It's getting late, but we can chat and see if you'd like to add these extras, or I can do it.

@ianhoffman
Copy link
Contributor Author

Awesome thanks.

Note: This will still break if you try to get multiple "scalar" values back. 🤔 But I think we have enough information to do the same fix with a "requisite" if statement.

It's getting late, but we can chat and see if you'd like to add these extras, or I can do it.

Pushed a fix... let me know what you think. I guess I also could have done this by checking whether len(all_index_types) == len(no_indexes). But that felt like maybe not the intended usage of PandasDataFrameResult.pandas_index_types.

hamilton/base.py Outdated
@@ -168,7 +168,12 @@ def build_result(**outputs: Dict[str, Any]) -> pd.DataFrame:
return value
elif isinstance(value, pd.Series):
return pd.DataFrame(outputs)
raise ValueError(f"Cannot build result. Cannot handle type {value}.")

if not any(isinstance(value, (pd.DataFrame, pd.Series)) for value in outputs.values()):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also check for scalars explicitly here, e.g. (int, str, float, bool) or something like that. Since this will break for valid outputs:

>>> pd.DataFrame({'foo': [1, 2], 'bar': [3, 4]})
   foo  bar
0    1    3
1    2    4
>>> pd.DataFrame([{'foo': [1, 2], 'bar': [3, 4]}])
      foo     bar
0  [1, 2]  [3, 4]

Don't know how much we want to worry about this edge case, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could specifically check for list, but I think this is OK. My opinion is that the goal of this should be (a) to get it not to break and (b) to get it not to lose any information. If folks don't like the way we join it they can use the dict result or build a custom ResultBuilder. Might be nice logging a warning though with the info so they know what to do...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm — we already do log some sort of warning:

❯ python3 my_script.py
WARNING:hamilton.base:It appears no Pandas index type was detected. This will likely break when trying to create a DataFrame. E.g. are you requesting all scalar values? Use a different result builder or return at least one Pandas object with an index. Ignore this warning if you're using DASK for now.
   spend_std_dev
0      17.224014

This isn't the most intuitive though, since things now work when you request all scalar values (but may not work when you request a list).

@skrawcz said he was gonna revisit that error message; my two cents are that we could

  • log nothing if all scalars are requested (since it works)
  • log if lists are requested, OR do a best effort attempt to resolve lists into dataframes. Probably logging is good enough.

Copy link
Collaborator

@skrawcz skrawcz Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the general case we're dealing with is if there is no "index" for pandas to know what to do.

Test cases to cover:

  • list like things (e.g. builtin types and np types and "sequence"/"generator" types)
  • scalars
  • other objects

and then the mixture of them, e.g. list & scalar == fine, list & list == fine (if they're the same length), scalar & scalar == fine, scalar & object == fine, etc.

Copy link
Contributor Author

@ianhoffman ianhoffman Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we only need to worry about cases where pd.DataFrame(outputs) won't work... e.g. when we need to do pd.DataFrame([outputs]) instead. Yeah, there are times when pd.DataFrame(outputs) will blow up – e.g. if you pass a combination of dicts and lists — which we aren't yet validating, but I think that's out of scope for now (not that we should never take it on). Pandas already emits pretty decent error messages in those cases. So I'm tempted to say that the code is good enough as is (pending a switch to is_list_like) and that we should add tests for other cases but not explicitly handle them here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. pandas handles these cases perfectly fine:

>>> pd.DataFrame({'foo': {'a': 1, 'b': 2}, 'bar': {'c': 2, 'd': 3}})
   foo  bar
a  1.0  NaN
b  2.0  NaN
c  NaN  2.0
d  NaN  3.0
>>> pd.DataFrame({'foo': [3, 4], 'bar': [1, 2]})
   foo  bar
0    3    1
1    4    2

And it blows up in an informative way here:

>>> pd.DataFrame({'foo': [3, 4], 'bar': {'a': 1, 'b': 2}})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/pandas/core/frame.py", line 663, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 493, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
  File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 118, in arrays_to_mgr
    index = _extract_index(arrays)
  File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 669, in _extract_index
    raise ValueError(
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.

Casting to a list does fix the above error, but the output isn't worth much:

>>> pd.DataFrame([{'foo': [3, 4], 'bar': {'a': 1, 'b': 2}}])
      foo               bar
0  [3, 4]  {'a': 1, 'b': 2}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm wondering if this set of rules is just too complicated and it should break when there's no index. Might not be worth trying to intuit it...

So:

1 series/DF or more -> works
0 series -> breaks

Any more complex and we're setting ourselves up for unintuitive behavior later. Maybe the break says "You should use the dict adapter if you don't have a dataframe output"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we did that, we'd need to swap out adapters to run intermediate nodes, which seems painful for debugging. I think it's pretty reasonable to convert scalars into a single-row, multi-column DF and display them — that's helpful for end-users and not super complex.

I do think anything beyond that is overdoing it, though.

Copy link
Collaborator

@elijahbenizzy elijahbenizzy Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah -- I think that makes sense overall. 👍 Still not 100% sure about it being smart, but I think its not too complex (as you said). And, result builders are cheap -- so long as they don't end up dropping data its not the end of the world.

Thinking out loud: just to confirm, this is for the use-case of just running a piece of the DAG, right? E.G. cell-by-cell in your jupyter notebook/kernel example? I wonder if there's a special way to provide a spine. E.G. have something like a configuration element that the DAG-builder knows about, or something similar. Also we could have guess_spine as a parameter to enable less strict checking...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this.

this is for the use-case of just running a piece of the DAG, right?

Yup.

I wonder if there's a special way to provide a spine. E.G. have something like a configuration element that the DAG-builder knows about, or something similar. Also we could have guess_spine as a parameter to enable less strict checking...

That could work. Alternatively my kernel could just provide its own custom result builder which does the behavior implemented in this PR.

I do think most users won't care about these sort of advanced configuration options though; they probably expect Hamilton to "just work" without any configuration. It seems to me that visualizing >=1 scalar values is a pretty legit use-case, so IMO that should "just work" without config. Whether we allow config to support more advanced use cases is a different question. I do think we should support config for advanced use cases, but I don't think it's urgent.

@skrawcz
Copy link
Collaborator

skrawcz commented Dec 12, 2022

Actually it seems we should be able rely more on pandas here -- e.g. https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_list_like.html 🤔

@ianhoffman
Copy link
Contributor Author

https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_list_like.html

Ah, very nice. I'll look into switching to that later.

* Use `is_list_like` instead of explicitly checking for dataframes and series
* Add a bunch of tests for building a result from different types, such as scalars, lists, numpy arrays, mixtures of arrays and dicts, etc.
hamilton/base.py Outdated Show resolved Hide resolved
hamilton/base.py Outdated Show resolved Hide resolved
hamilton/base.py Outdated Show resolved Hide resolved
tests/test_base.py Outdated Show resolved Hide resolved
* Tests for objects and scalars + dicts
* Improve comment to mention objects as well as scalars
@ianhoffman ianhoffman changed the title Handle scalar values in PandasDataFrameResult PandasDataFrameResult: Convert non-list values to single row frame Dec 13, 2022
@ianhoffman ianhoffman requested review from skrawcz and elijahbenizzy and removed request for skrawcz and elijahbenizzy December 13, 2022 23:57
@@ -119,14 +119,26 @@ def test_SimplePythonDataFrameGraphAdapter_check_input_type_mismatch(node_type,
assert actual is False


def _gen_ints(n: int) -> typing.Iterator[int]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nitty nit pick:

  1. typing could be Generator
  2. You can use yield from
def _gen_int(n) -> Generator[int, None, None]:
   ...:     yield from range(n)

Copy link
Contributor Author

@ianhoffman ianhoffman Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Iterator[T] is a subtype of Generator[T, …], but I could be wrong (this is actually why I messaged you about mypy).

Point taken about using yield from.

Copy link
Collaborator

@skrawcz skrawcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good enough!

Great thanks for the help @ianhoffman !

@skrawcz
Copy link
Collaborator

skrawcz commented Dec 14, 2022

Only TODO will be to update the check index warning.

@ianhoffman
Copy link
Contributor Author

Only TODO will be to update the check index warning.

Yeah, I can do that too - just not sure what we want the behavior to be. Maybe we just shouldn’t log if all scalars/objects were provided? Since we are providing this IMO pretty reasonable output in that case. Or did you have something else in mind?

@skrawcz
Copy link
Collaborator

skrawcz commented Dec 14, 2022

Only TODO will be to update the check index warning.

Yeah, I can do that too - just not sure what we want the behavior to be. Maybe we just shouldn’t log if all scalars/objects were provided? Since we are providing this IMO pretty reasonable output in that case. Or did you have something else in mind?

We can do that in a follow up PR/commit.

@skrawcz skrawcz merged commit bcdbe13 into stitchfix:main Dec 14, 2022
skrawcz added a commit that referenced this pull request Dec 14, 2022
Following #243 we need to adjust the warning
to make sense. Since we are not in danger of things
breaking imminently. So the warning is just there for
the user now to indicate how Pandas might be resolving
indexes to create the dataframe.
@skrawcz skrawcz mentioned this pull request Dec 14, 2022
7 tasks
@skrawcz
Copy link
Collaborator

skrawcz commented Dec 14, 2022

@ianhoffman #246

skrawcz added a commit that referenced this pull request Dec 19, 2022
* Adjusts index type check warnings

Following #243 we need to adjust the warning
to make sense. Since we are not in danger of things
breaking imminently. So the warning is just there for
the user now to indicate how Pandas might be resolving
indexes to create the dataframe.

* Fixes bug with showing count of no index outputs

This message was just wrong with the count displayed.
So fixing it to reflect how many no index outputs there
actually are.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants