-
Notifications
You must be signed in to change notification settings - Fork 37
Using tables/dataframes for parameterization #196
Comments
@ropeladder just to confirm my understanding:
|
|
@elijahbenizzy thoughts? |
Seems reasonable -- want to get the naming right though so its clear what its doing. I'd prefer to make this simpler though -- is there a way to just pass in a single dataframe? And to clarify, how many parameterizations would be above examples create? Looks like 4? Any chance you'd mind writing down how it would look in the current world so we could compare? Implementation should be straightforward -- I think we can just pass to the superclass. We might, however, need to do something a bit more complex (depending on whether we'll know the mapping of column names to positions, as the function is not known on instantiating, only on call. Can work around this though.) |
Passing in just a single dataframe was my initial thought but I couldn't figure out a clear way to differentiate input from output columns. Until just now when I realized this is possibly a good use for a multilevel column index. I'm thinking each of the two examples above would create one parameterization per row. So, two each. The two above example each have two separate output columns, so they would be kind of clunky in the current @parameterize(
outdf1a={"outputcols": value(['outseries1a', 'outseries2a']), "input1": source("inseries1a"), "input2": source("inseries2a"), "input3": value(5.0)},
outdf1b={"outputcols": value(['outseries1b', 'outseries2b']), "input1": source("inseries1b"), "input2": source("inseries2b"), "input3": value(0.2)})
def my_func(outputcols: list[str], input1: pd.Series(), input2: pd.Series(), input3: float) -> pd.DataFrame:
return pd.DataFrame([input1 * 2, input2 * input3], columns=[outputcols])
@extract_columns('outseries1a', 'outseries2a')
def my_disaggregator1a(outdf1a: pd.DataFrame) -> pd.Series:
return outdf1a
@extract_columns('outseries1b', 'outseries2b')
def my_disaggregator1b(outdf1b: pd.DataFrame) -> pd.Series:
return outdf1b Version of this using df = pd.DataFrame([
["outseries1a", "outseries2a", "inseries1a", "inseries2a", 5.0],
["outseries1b", "outseries2b", "inseries1b", "inseries2b", 0.2],
# ...
],
columns=[
["out", "out", "source", "source", "value"], # configure whether column is source or value and also whether it's input ("source", "value") or output ("out")
["output1", "output2", "input1", "input2", "input3"]]) # specify column names (corresponding to function arguments and (if outputting multiple columns) output dataframe columns)
@parameterize_frame(df)
def my_func(output1: str, output2: str, input1: pd.Series(), input2: pd.Series(), input3: float) -> pd.DataFrame:
return pd.DataFrame([input1 * 2, input2 * 3], columns=[output1, output2]) # this outputs a dataframe but hamilton then automatically 'extracts' the columns as per their names in the "out" dataframe columns With just one output column, this is what I'm imagining df = pd.DataFrame([
["outseries1a", "inseries1a", "inseries2a", 5.0],
["outseries1b", "inseries1b", "inseries2b", 0.2],
# ...
],
columns=[
["out", "source", "source", "value"], # configure whether column is source or value and also whether it's input ("source", "value") or output ("out")
["output1", "input1", "input2", "input3"]]) # specify column names (corresponding to function arguments and (if outputting multiple columns) output dataframe columns)
@parameterize_frame(df)
def my_func(output1: str, input1: pd.Series(), input2: pd.Series(), input3: float) -> pd.DataFrame:
return pd.DataFrame([input1 * 2, input2 * 3], columns=[output1]) # if there's a single column it could maybe just return a series instead and pick up the name from the first column of the dataframe |
@ropeladder OK, awesome, this makes a lot of sense to me now. Specifically, this is doing both an Implementation should be straightforward, want to noodle on generalizability for this. |
@elijahbenizzy any updates on this? |
@skrawcz haven't had a chance to look yet. @ropeladder would love to sync up -- happy to provide this as a recipe/think about including in the main framework. If I implement would you have a chance to take it for a spin? |
@elijahbenizzy Sure, I've got some (private) code I could try it out on and/or I'd be happy to cook up some basic examples. |
Awesome @ropeladder -- sorry for the delay in response. I'll take some time in tne next week to implement a prototype then we can give it a spin/connect about it! |
Alright, proved out that I could do this with only using
But overall I quite like what this is trying to do -- took me implementing to see the value. Haven't tested thoroughly as I have to run but the code is quite simple... #227 |
Alright, so with the following example (rewrote slightly): df = pd.DataFrame(
[
["outseries1a", "outseries2a", "inseries1a", "inseries2a", 5.0],
["outseries1b", "outseries2b", "inseries1b", "inseries2b", 0.2],
# ...
],
# Have to switch as indices have to be unique
columns=[
[
"output1",
"output2",
"input1",
"input2",
"input3",
], # configure whether column is source or value and also whether it's input ("source", "value") or output ("out")
["out", "out", "source", "source", "value"],
],
) # specify column names (corresponding to function arguments and (if outputting multiple columns) output dataframe columns)
@parameterize_frame(df)
def my_func(
output1: str, output2: str, input1: pd.Series, input2: pd.Series, input3: float
) -> pd.DataFrame:
return pd.DataFrame(
{output1: input1 * input2 * input3, output2: input1 + input2 + input3}
) Here's what the DAG looks like. Only weirdness is I generate temporary node names for the dataframes we generate. I'm pretty happy with this as an implementation, but I'd like to suggest reframing the API. What we're trying to do is combine @parameterize_extract(
extract_type='dataframe', #could be dict? We also have others...
extract_mapping={
('outseries1a', 'outseries1b') : dict(input1=source('inseries1a'), input2= source('inseries2a'), input3=value(5.0)),
('outseries2a', 'outseries2b') : dict(input1=source('inseries1b'), input2= source('inseries2b'), input3=value(5.0)),
})
def my_func(input1, input2, input3) -> pd.Series:
return pd.DataFrame(
[input1 * input2 * input3, input1 + input2 + input3]
) Note the above could easily be done with a df as well. Primary differences are: (1) abstraction to allow for different extract types -- tbd on what this swould look like... Thoughts? |
And an update -- I've implemented something similar to the above, still figuring out an edge case or two: @parameterize_extract(
ParameterizedExtract(('outseries1a', 'outseries2a'), {'input1': source('inseries1a'), 'input2': source('inseries1b'), 'input3': value(1.6)}),
ParameterizedExtract(('outseries1b', 'outseries2b'), {'input1': source('inseries2a'), 'input2': source('inseries2b'), 'input3': value(0.2)})
)
def my_func_parameterized_extract(input1: pd.Series, input2: pd.Series, input3: float) -> pd.DataFrame:
return pd.concat([input1 * input2 * input3, input1 + input2 + input3], axis=1) |
I can see why the latest example is nice but I'm imagining it in the context of the PoC I was trying out, where I would end up putting in 7 ParameterizedExtract lines -- which means repeating ParameterizedExtract(), the input column names, and also the input type (source/value) in fairly verbose way. Instead, if the interface is worked like that, I would probably rewrite it as something like this: data_names = [
["outseries1a", "outseries2a", "inseries1a", "inseries2a", 5.0],
["outseries1b", "outseries2b", "inseries1b", "inseries2b", 0.2],
["outseries1c", "outseries2c", "inseries1c", "inseries2c", 0.2],
["outseries1d", "outseries2d", "inseries1d", "inseries2d", 0.2],
["outseries1e", "outseries2e", "inseries1e", "inseries2e", 0.2],
["outseries1f", "outseries2f", "inseries1f", "inseries2f", 0.2],
["outseries1g", "outseries2g", "inseries1g", "inseries2g", 0.2]
]
@parameterize_extract(**{ParameterizedExtract((row[a], row[b]), {'input1': source(row[c]), 'input2': source(row[d]), 'input3': value(row[e])} for row in data_names})
def my_func_parameterized_extract(input1: pd.Series, input2: pd.Series, input3: float) -> pd.DataFrame:
return pd.concat([input1 * input2 * input3, input1 + input2 + input3], axis=1) There may be a nicer way to do this I'm missing, but the issue here is that it still requires a fairly awkward 'translation' layer between the data definition and the function definition signature. E.g. if I'm reading through these series definitions it feels like a lot of extra overhead to think through, as compared to having a dataframe I can view directly with relevant labels. |
OK, so I have some code that you can play around with if you want! It has both implementations (dataframes uses the other). See PR #227. To install, you can:
Definitely see your point -- what I have is overly verbose. I guess my concern about the dataframe API is that it takes a lot to figure out what's happening. We can support both, but I wonder if there's a best of both worlds. It would also be nice to reuse notions from parameterize_extract(
(("outseries1a", "outseries2a"), {"input1": source("inseries1a"), "input2": source("inseries2a"), "input3": value(5.0)}),
(("outseries1b", "outseries2b"), {"input1": source("inseries1b"), "input2": source("inseries2b"), "input3": value(0.2)}),
)
def my_func_parameterized_extract(input1: pd.Series, input2: pd.Series, input3: float) -> pd.DataFrame:
return pd.concat([input1 * input2 * input3, input1 + input2 + input3], axis=1) But I guess the question is -- do you want the |
Just tried out For my use case, source and value were the same across parameterizations... and I guess I'm having trouble imagining how they wouldn't be without a fair bit of logic in the function itself, since then the function has to handle getting either a series or a value. More broadly, I see your point that values = [
["outseries1a", "inseries1a", "inseries2a", 5.0],
["outseries1b", "inseries1b", "inseries2b", 0.2],
# ...
],
column_types = ["out", "source", "source", "value"]
column_varnames = ["output1", "input1", "input2", "input3"]
@parameterize_frame(values, types=column_types, names=column_varnames)
def my_func(output1: str, input1: pd.Series(), input2: pd.Series(), input3: float) -> pd.DataFrame:
return pd.DataFrame([input1 * 2, input2 * 3], columns=[output1]) (...and maybe you could still pass in the full multi-indexed dataframe instead of the |
Awesome, yeah, I think that makes a lot of sense. Will polish -- I like having two APIs as well (one if you need to change to be fully parameterized, the other if you want each parameterization to be the same structure). Also think we can remove them from the function signatures as I do with I'm going to polish and get out tomorrow in an RC version! |
This provides a convenience API for #196. The idea is that people want to do a parameterization and extract columns operator at once -- this should be easy. The cool thing is that this just uses the parameterize and extract APIs. It also has a from_df() function to allow for passing in a dataframe to be more concise.
This provides a convenience API for #196. The idea is that people want to do a parameterization and extract columns operator at once -- this should be easy. The cool thing is that this just uses the parameterize and extract APIs. It also has a from_df() function to allow for passing in a dataframe to be more concise.
@ropeladder any feedback on the release candidate? See https://hamilton-opensource.slack.com/archives/C03AJNGDGQL/p1669601959199269 for details. |
This provides a convenience API for #196. The idea is that people want to do a parameterization and extract columns operator at once -- this should be easy. The cool thing is that this just uses the parameterize and extract APIs. It also has a from_df() function to allow for passing in a dataframe to be more concise.
This provides a convenience API for #196. The idea is that people want to do a parameterization and extract columns operator at once -- this should be easy. The cool thing is that this just uses the parameterize and extract APIs. It also has a from_df() function to allow for passing in a dataframe to be more concise.
This provides a convenience API for #196. The idea is that people want to do a parameterization and extract columns operator at once -- this should be easy. The cool thing is that this just uses the parameterize and extract APIs. It also has a from_df() function to allow for passing in a dataframe to be more concise.
This provides a convenience API for #196. The idea is that people want to do a parameterization and extract columns operator at once -- this should be easy. The cool thing is that this just uses the parameterize and extract APIs. It also has a from_df() function to allow for passing in a dataframe to be more concise.
This provides a convenience API for #196. The idea is that people want to do a parameterization and extract columns operator at once -- this should be easy. The cool thing is that this just uses the parameterize and extract APIs. It also has a from_df() function to allow for passing in a dataframe to be more concise.
This provides a convenience API for #196. The idea is that people want to do a parameterization and extract columns operator at once -- this should be easy. The cool thing is that this just uses the parameterize and extract APIs. It also has a from_df() function to allow for passing in a dataframe to be more concise.
This provides a convenience API for #196. The idea is that people want to do a parameterization and extract columns operator at once -- this should be easy. The cool thing is that this just uses the parameterize and extract APIs. It also has a from_df() function to allow for passing in a dataframe to be more concise.
This provides a convenience API for #196. The idea is that people want to do a parameterization and extract columns operator at once -- this should be easy. The cool thing is that this just uses the parameterize and extract APIs. It also has a from_df() function to allow for passing in a dataframe to be more concise.
This provides a convenience API for #196. The idea is that people want to do a parameterization and extract columns operator at once -- this should be easy. The cool thing is that this just uses the parameterize and extract APIs. It also has a from_df() function to allow for passing in a dataframe to be more concise.
This provides a convenience API for #196. The idea is that people want to do a parameterization and extract columns operator at once -- this should be easy. The cool thing is that this just uses the parameterize and extract APIs. It also has a from_df() function to allow for passing in a dataframe to be more concise.
This has been pushed in the 1.16.0 release. See docs: |
Is your feature request related to a problem? Please describe.
When using Hamilton to create lots of new columns using
@parameterize
, an explicitly tabular approach could make things cleaner and also could allow for multi-column outputs.Describe the solution you'd like
These aren't perfect but they should give an idea of what I'm thinking:
Using a table instead of a dataframe:
Describe alternatives you've considered
@parameterize
and then again in the function signature.The text was updated successfully, but these errors were encountered: