-
Notifications
You must be signed in to change notification settings - Fork 37
PandasDataFrameResult: Convert non-list values to single row frame #243
Conversation
Awesome thanks. Note: This will still break if you try to get multiple "scalar" values back. 🤔 It's getting late, but we can chat and see if you'd like to add these extras, or I can do it. |
Pushed a fix... let me know what you think. I guess I also could have done this by checking whether |
hamilton/base.py
Outdated
@@ -168,7 +168,12 @@ def build_result(**outputs: Dict[str, Any]) -> pd.DataFrame: | |||
return value | |||
elif isinstance(value, pd.Series): | |||
return pd.DataFrame(outputs) | |||
raise ValueError(f"Cannot build result. Cannot handle type {value}.") | |||
|
|||
if not any(isinstance(value, (pd.DataFrame, pd.Series)) for value in outputs.values()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also check for scalars explicitly here, e.g. (int, str, float, bool)
or something like that. Since this will break for valid outputs:
>>> pd.DataFrame({'foo': [1, 2], 'bar': [3, 4]})
foo bar
0 1 3
1 2 4
>>> pd.DataFrame([{'foo': [1, 2], 'bar': [3, 4]}])
foo bar
0 [1, 2] [3, 4]
Don't know how much we want to worry about this edge case, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could specifically check for list, but I think this is OK. My opinion is that the goal of this should be (a) to get it not to break and (b) to get it not to lose any information. If folks don't like the way we join it they can use the dict result or build a custom ResultBuilder
. Might be nice logging a warning though with the info so they know what to do...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm — we already do log some sort of warning:
❯ python3 my_script.py
WARNING:hamilton.base:It appears no Pandas index type was detected. This will likely break when trying to create a DataFrame. E.g. are you requesting all scalar values? Use a different result builder or return at least one Pandas object with an index. Ignore this warning if you're using DASK for now.
spend_std_dev
0 17.224014
This isn't the most intuitive though, since things now work when you request all scalar values (but may not work when you request a list).
@skrawcz said he was gonna revisit that error message; my two cents are that we could
- log nothing if all scalars are requested (since it works)
- log if lists are requested, OR do a best effort attempt to resolve lists into dataframes. Probably logging is good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the general case we're dealing with is if there is no "index" for pandas to know what to do.
Test cases to cover:
- list like things (e.g. builtin types and np types and "sequence"/"generator" types)
- scalars
- other objects
and then the mixture of them, e.g. list & scalar == fine, list & list == fine (if they're the same length), scalar & scalar == fine, scalar & object == fine, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we only need to worry about cases where pd.DataFrame(outputs)
won't work... e.g. when we need to do pd.DataFrame([outputs])
instead. Yeah, there are times when pd.DataFrame(outputs)
will blow up – e.g. if you pass a combination of dicts and lists — which we aren't yet validating, but I think that's out of scope for now (not that we should never take it on). Pandas already emits pretty decent error messages in those cases. So I'm tempted to say that the code is good enough as is (pending a switch to is_list_like
) and that we should add tests for other cases but not explicitly handle them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. pandas handles these cases perfectly fine:
>>> pd.DataFrame({'foo': {'a': 1, 'b': 2}, 'bar': {'c': 2, 'd': 3}})
foo bar
a 1.0 NaN
b 2.0 NaN
c NaN 2.0
d NaN 3.0
>>> pd.DataFrame({'foo': [3, 4], 'bar': [1, 2]})
foo bar
0 3 1
1 4 2
And it blows up in an informative way here:
>>> pd.DataFrame({'foo': [3, 4], 'bar': {'a': 1, 'b': 2}})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/pandas/core/frame.py", line 663, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 493, in dict_to_mgr
return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 118, in arrays_to_mgr
index = _extract_index(arrays)
File "/Users/ian.hoffman/src/hamilton/examples/hello_world/.venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 669, in _extract_index
raise ValueError(
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.
Casting to a list does fix the above error, but the output isn't worth much:
>>> pd.DataFrame([{'foo': [3, 4], 'bar': {'a': 1, 'b': 2}}])
foo bar
0 [3, 4] {'a': 1, 'b': 2}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm wondering if this set of rules is just too complicated and it should break when there's no index. Might not be worth trying to intuit it...
So:
1 series/DF or more -> works
0 series -> breaks
Any more complex and we're setting ourselves up for unintuitive behavior later. Maybe the break says "You should use the dict adapter if you don't have a dataframe output"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we did that, we'd need to swap out adapters to run intermediate nodes, which seems painful for debugging. I think it's pretty reasonable to convert scalars into a single-row, multi-column DF and display them — that's helpful for end-users and not super complex.
I do think anything beyond that is overdoing it, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah -- I think that makes sense overall. 👍 Still not 100% sure about it being smart, but I think its not too complex (as you said). And, result builders are cheap -- so long as they don't end up dropping data its not the end of the world.
Thinking out loud: just to confirm, this is for the use-case of just running a piece of the DAG, right? E.G. cell-by-cell in your jupyter notebook/kernel example? I wonder if there's a special way to provide a spine. E.G. have something like a configuration element that the DAG-builder knows about, or something similar. Also we could have guess_spine
as a parameter to enable less strict checking...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed this.
this is for the use-case of just running a piece of the DAG, right?
Yup.
I wonder if there's a special way to provide a spine. E.G. have something like a configuration element that the DAG-builder knows about, or something similar. Also we could have guess_spine as a parameter to enable less strict checking...
That could work. Alternatively my kernel could just provide its own custom result builder which does the behavior implemented in this PR.
I do think most users won't care about these sort of advanced configuration options though; they probably expect Hamilton to "just work" without any configuration. It seems to me that visualizing >=1 scalar values is a pretty legit use-case, so IMO that should "just work" without config. Whether we allow config to support more advanced use cases is a different question. I do think we should support config for advanced use cases, but I don't think it's urgent.
Actually it seems we should be able rely more on pandas here -- e.g. https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_list_like.html 🤔 |
Ah, very nice. I'll look into switching to that later. |
* Use `is_list_like` instead of explicitly checking for dataframes and series * Add a bunch of tests for building a result from different types, such as scalars, lists, numpy arrays, mixtures of arrays and dicts, etc.
…nice to have this at some point though.
* Tests for objects and scalars + dicts * Improve comment to mention objects as well as scalars
@@ -119,14 +119,26 @@ def test_SimplePythonDataFrameGraphAdapter_check_input_type_mismatch(node_type, | |||
assert actual is False | |||
|
|||
|
|||
def _gen_ints(n: int) -> typing.Iterator[int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nitty nit pick:
- typing could be
Generator
- You can use
yield from
def _gen_int(n) -> Generator[int, None, None]:
...: yield from range(n)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Iterator[T]
is a subtype of Generator[T, …]
, but I could be wrong (this is actually why I messaged you about mypy).
Point taken about using yield from
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good enough!
Great thanks for the help @ianhoffman !
Only TODO will be to update the check index warning. |
Yeah, I can do that too - just not sure what we want the behavior to be. Maybe we just shouldn’t log if all scalars/objects were provided? Since we are providing this IMO pretty reasonable output in that case. Or did you have something else in mind? |
We can do that in a follow up PR/commit. |
Following #243 we need to adjust the warning to make sense. Since we are not in danger of things breaking imminently. So the warning is just there for the user now to indicate how Pandas might be resolving indexes to create the dataframe.
* Adjusts index type check warnings Following #243 we need to adjust the warning to make sense. Since we are not in danger of things breaking imminently. So the warning is just there for the user now to indicate how Pandas might be resolving indexes to create the dataframe. * Fixes bug with showing count of no index outputs This message was just wrong with the count displayed. So fixing it to reflect how many no index outputs there actually are.
When trying to run an intermediate node which produces a scalar in the
hello_world
example, Hamilton throws an error:If we can run an entire DAG, it seems like we should be able to run any sub-DAG of the DAG.
Changes
Updates
PandasDataFrameResult.build_result()
to convert scalar values into dataframes.How I tested this
Updated unit tests.
Checklist