PandasDataFrameResult: Convert non-list values to single row frame #243

ianhoffman · 2022-12-12T01:36:20Z

When trying to run an intermediate node which produces a scalar in the hello_world example, Hamilton throws an error:

WARNING:hamilton.base:It appears no Pandas index type was detected. This will likely break when trying to create a DataFrame. E.g. are you requesting all scalar values? Use a different result builder or return at least one Pandas object with an index. Ignore this warning if you're using DASK for now.
ERROR:hamilton.driver:-------------------------------------------------------------------
Oh no an error! Need help with Hamilton?
Join our slack and ask for help! https://join.slack.com/t/hamilton-opensource/shared_invite/zt-1bjs72asx-wcUTgH7q7QX1igiQ5bbdcg
-------------------------------------------------------------------

Traceback (most recent call last):
  File "my_script.py", line 29, in <module>
    df = dr.execute(output_columns)
  File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/hamilton/driver.py", line 142, in execute
    raise e
  File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/hamilton/driver.py", line 139, in execute
    return self.adapter.build_result(**outputs)
  File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/hamilton/base.py", line 171, in build_result
    raise ValueError(f"Cannot build result. Cannot handle type {value}.")
ValueError: Cannot build result. Cannot handle type 28.333333333333332.

If we can run an entire DAG, it seems like we should be able to run any sub-DAG of the DAG.

Changes

Updates PandasDataFrameResult.build_result() to convert scalar values into dataframes.

How I tested this

Updated unit tests.

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

skrawcz · 2022-12-12T07:22:27Z

Awesome thanks.

Note: This will still break if you try to get multiple "scalar" values back. 🤔
But I think we have enough information to do the same fix with a "requisite" if statement.

It's getting late, but we can chat and see if you'd like to add these extras, or I can do it.

ianhoffman · 2022-12-12T13:45:18Z

Awesome thanks.

Note: This will still break if you try to get multiple "scalar" values back. 🤔 But I think we have enough information to do the same fix with a "requisite" if statement.

It's getting late, but we can chat and see if you'd like to add these extras, or I can do it.

Pushed a fix... let me know what you think. I guess I also could have done this by checking whether len(all_index_types) == len(no_indexes). But that felt like maybe not the intended usage of PandasDataFrameResult.pandas_index_types.

ianhoffman · 2022-12-12T13:59:39Z

hamilton/base.py

@@ -168,7 +168,12 @@ def build_result(**outputs: Dict[str, Any]) -> pd.DataFrame:
                return value
            elif isinstance(value, pd.Series):
                return pd.DataFrame(outputs)
-            raise ValueError(f"Cannot build result. Cannot handle type {value}.")
+
+        if not any(isinstance(value, (pd.DataFrame, pd.Series)) for value in outputs.values()):


We could also check for scalars explicitly here, e.g. (int, str, float, bool) or something like that. Since this will break for valid outputs:

>>> pd.DataFrame({'foo': [1, 2], 'bar': [3, 4]}) foo bar 0 1 3 1 2 4 >>> pd.DataFrame([{'foo': [1, 2], 'bar': [3, 4]}]) foo bar 0 [1, 2] [3, 4]

Don't know how much we want to worry about this edge case, though.

Could specifically check for list, but I think this is OK. My opinion is that the goal of this should be (a) to get it not to break and (b) to get it not to lose any information. If folks don't like the way we join it they can use the dict result or build a custom ResultBuilder. Might be nice logging a warning though with the info so they know what to do...

Hmm — we already do log some sort of warning:

❯ python3 my_script.py WARNING:hamilton.base:It appears no Pandas index type was detected. This will likely break when trying to create a DataFrame. E.g. are you requesting all scalar values? Use a different result builder or return at least one Pandas object with an index. Ignore this warning if you're using DASK for now. spend_std_dev 0 17.224014

This isn't the most intuitive though, since things now work when you request all scalar values (but may not work when you request a list).

@skrawcz said he was gonna revisit that error message; my two cents are that we could

log nothing if all scalars are requested (since it works)

log if lists are requested, OR do a best effort attempt to resolve lists into dataframes. Probably logging is good enough.

the general case we're dealing with is if there is no "index" for pandas to know what to do.

Test cases to cover:

list like things (e.g. builtin types and np types and "sequence"/"generator" types)

scalars

other objects

and then the mixture of them, e.g. list & scalar == fine, list & list == fine (if they're the same length), scalar & scalar == fine, scalar & object == fine, etc.

I think that we only need to worry about cases where pd.DataFrame(outputs) won't work... e.g. when we need to do pd.DataFrame([outputs]) instead. Yeah, there are times when pd.DataFrame(outputs) will blow up – e.g. if you pass a combination of dicts and lists — which we aren't yet validating, but I think that's out of scope for now (not that we should never take it on). Pandas already emits pretty decent error messages in those cases. So I'm tempted to say that the code is good enough as is (pending a switch to is_list_like) and that we should add tests for other cases but not explicitly handle them here.

E.g. pandas handles these cases perfectly fine:

>>> pd.DataFrame({'foo': {'a': 1, 'b': 2}, 'bar': {'c': 2, 'd': 3}}) foo bar a 1.0 NaN b 2.0 NaN c NaN 2.0 d NaN 3.0 >>> pd.DataFrame({'foo': [3, 4], 'bar': [1, 2]}) foo bar 0 3 1 1 4 2

And it blows up in an informative way here:

>>> pd.DataFrame({'foo': [3, 4], 'bar': {'a': 1, 'b': 2}}) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/pandas/core/frame.py", line 663, in __init__ mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager) File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 493, in dict_to_mgr return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy) File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 118, in arrays_to_mgr index = _extract_index(arrays) File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 669, in _extract_index raise ValueError( ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.

Casting to a list does fix the above error, but the output isn't worth much:

>>> pd.DataFrame([{'foo': [3, 4], 'bar': {'a': 1, 'b': 2}}]) foo bar 0 [3, 4] {'a': 1, 'b': 2}

Yeah I'm wondering if this set of rules is just too complicated and it should break when there's no index. Might not be worth trying to intuit it...

So:

1 series/DF or more -> works
0 series -> breaks

Any more complex and we're setting ourselves up for unintuitive behavior later. Maybe the break says "You should use the dict adapter if you don't have a dataframe output"

If we did that, we'd need to swap out adapters to run intermediate nodes, which seems painful for debugging. I think it's pretty reasonable to convert scalars into a single-row, multi-column DF and display them — that's helpful for end-users and not super complex.

I do think anything beyond that is overdoing it, though.

Yeah -- I think that makes sense overall. 👍 Still not 100% sure about it being smart, but I think its not too complex (as you said). And, result builders are cheap -- so long as they don't end up dropping data its not the end of the world.

Thinking out loud: just to confirm, this is for the use-case of just running a piece of the DAG, right? E.G. cell-by-cell in your jupyter notebook/kernel example? I wonder if there's a special way to provide a spine. E.G. have something like a configuration element that the DAG-builder knows about, or something similar. Also we could have guess_spine as a parameter to enable less strict checking...

Missed this.

this is for the use-case of just running a piece of the DAG, right?

Yup.

I wonder if there's a special way to provide a spine. E.G. have something like a configuration element that the DAG-builder knows about, or something similar. Also we could have guess_spine as a parameter to enable less strict checking...

That could work. Alternatively my kernel could just provide its own custom result builder which does the behavior implemented in this PR.

I do think most users won't care about these sort of advanced configuration options though; they probably expect Hamilton to "just work" without any configuration. It seems to me that visualizing >=1 scalar values is a pretty legit use-case, so IMO that should "just work" without config. Whether we allow config to support more advanced use cases is a different question. I do think we should support config for advanced use cases, but I don't think it's urgent.

skrawcz · 2022-12-12T17:49:24Z

Actually it seems we should be able rely more on pandas here -- e.g. https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_list_like.html 🤔

ianhoffman · 2022-12-12T17:49:47Z

https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_list_like.html

Ah, very nice. I'll look into switching to that later.

* Use `is_list_like` instead of explicitly checking for dataframes and series * Add a bunch of tests for building a result from different types, such as scalars, lists, numpy arrays, mixtures of arrays and dicts, etc.

…nice to have this at some point though.

hamilton/base.py

tests/test_base.py

* Tests for objects and scalars + dicts * Improve comment to mention objects as well as scalars

elijahbenizzy · 2022-12-14T05:28:52Z

tests/test_base.py

@@ -119,14 +119,26 @@ def test_SimplePythonDataFrameGraphAdapter_check_input_type_mismatch(node_type,
    assert actual is False


+def _gen_ints(n: int) -> typing.Iterator[int]:


Very nitty nit pick:

typing could be Generator

You can use yield from

def _gen_int(n) -> Generator[int, None, None]: ...: yield from range(n)

I think Iterator[T] is a subtype of Generator[T, …], but I could be wrong (this is actually why I messaged you about mypy).

Point taken about using yield from.

skrawcz

I think this is good enough!

Great thanks for the help @ianhoffman !

skrawcz · 2022-12-14T18:55:55Z

Only TODO will be to update the check index warning.

ianhoffman · 2022-12-14T19:05:22Z

Only TODO will be to update the check index warning.

Yeah, I can do that too - just not sure what we want the behavior to be. Maybe we just shouldn’t log if all scalars/objects were provided? Since we are providing this IMO pretty reasonable output in that case. Or did you have something else in mind?

skrawcz · 2022-12-14T19:05:52Z

Only TODO will be to update the check index warning.

Yeah, I can do that too - just not sure what we want the behavior to be. Maybe we just shouldn’t log if all scalars/objects were provided? Since we are providing this IMO pretty reasonable output in that case. Or did you have something else in mind?

We can do that in a follow up PR/commit.

Following #243 we need to adjust the warning to make sense. Since we are not in danger of things breaking imminently. So the warning is just there for the user now to indicate how Pandas might be resolving indexes to create the dataframe.

skrawcz · 2022-12-14T19:35:02Z

@ianhoffman #246

* Adjusts index type check warnings Following #243 we need to adjust the warning to make sense. Since we are not in danger of things breaking imminently. So the warning is just there for the user now to indicate how Pandas might be resolving indexes to create the dataframe. * Fixes bug with showing count of no index outputs This message was just wrong with the count displayed. So fixing it to reflect how many no index outputs there actually are.

Handle scalar values in PandasDataFrameResult

9a8385c

handle requests for all scalar values

d4433e9

ianhoffman commented Dec 12, 2022

View reviewed changes

ianhoffman added 2 commits December 12, 2022 23:19

PR feedback:

d8f0452

* Use `is_list_like` instead of explicitly checking for dataframes and series * Add a bunch of tests for building a result from different types, such as scalars, lists, numpy arrays, mixtures of arrays and dicts, etc.

omit test for nested arrays since they break in Python 3.6. Would be …

04cdc75

…nice to have this at some point though.