Using tables/dataframes for parameterization #196

ropeladder · 2022-09-14T19:12:51Z

Is your feature request related to a problem? Please describe.
When using Hamilton to create lots of new columns using @parameterize, an explicitly tabular approach could make things cleaner and also could allow for multi-column outputs.

Describe the solution you'd like
These aren't perfect but they should give an idea of what I'm thinking:

df = pd.DataFrame([
    ["outseries1a", "outseries2a", "inseries1a", "inseries2a"],
    ["outseries1b", "outseries2b", "inseries1b", "inseries2b"]
],
columns=["output1", "output2", "input1", "input2"])

@parameterize_frame(df, sources = df.columns[2:], values = None, out_names=df.columns[:1])  # input names are matched based on argument names for my_func
def my_func(input1: pd.Series(), input2: pd.Series()) -> pd.DataFrame:
    return pd.DataFrame([input1 * 2, input2 * 3])

Using a table instead of a dataframe:

table = [
    ["outseries1a", "outseries2a", "inseries1a", "inseries2a"],
    ["outseries1b", "outseries2b", "inseries1b", "inseries2b"]
]
input_types = [
    [source, source]
]

@parameterize_table(table, in_names = [3,4], in_types = input_types, out_names = [0:1])
def my_func(input1: pd.Series(), input2: pd.Series()) -> List[pd.Series]:
    return [input1 * 2, input2 * 3]

Describe alternatives you've considered

Passing in destructured dicts doesn't offer an integrated way to extract columns. It also feels redundant naming the series in @parameterize and then again in the function signature.

The text was updated successfully, but these errors were encountered:

skrawcz · 2022-09-14T19:44:28Z

@ropeladder just to confirm my understanding:

is your assumption that column order maps directly to function parameter arguments?
would these tables & dataframes be purpose built for input to the parameterized function? Or do you see them coming from somewhere else / being used in another context?

ropeladder · 2022-09-14T20:05:44Z

In the dataframe example column order wouldn't matter because the column names match up with the arguments. In the table example it would matter.
Yes, my thinking was these would be built specifically for the parameterized function.

skrawcz · 2022-09-15T06:01:13Z

@elijahbenizzy thoughts?

elijahbenizzy · 2022-09-15T14:41:25Z

Seems reasonable -- want to get the naming right though so its clear what its doing. I'd prefer to make this simpler though -- is there a way to just pass in a single dataframe? And to clarify, how many parameterizations would be above examples create? Looks like 4? Any chance you'd mind writing down how it would look in the current world so we could compare?

Implementation should be straightforward -- I think we can just pass to the superclass. We might, however, need to do something a bit more complex (depending on whether we'll know the mapping of column names to positions, as the function is not known on instantiating, only on call. Can work around this though.)

ropeladder · 2022-09-15T20:20:46Z

Passing in just a single dataframe was my initial thought but I couldn't figure out a clear way to differentiate input from output columns. Until just now when I realized this is possibly a good use for a multilevel column index.

I'm thinking each of the two examples above would create one parameterization per row. So, two each.

The two above example each have two separate output columns, so they would be kind of clunky in the current @parameterize environment -- something like this:

@parameterize(
    outdf1a={"outputcols": value(['outseries1a', 'outseries2a']), "input1": source("inseries1a"), "input2": source("inseries2a"), "input3": value(5.0)},
    outdf1b={"outputcols": value(['outseries1b', 'outseries2b']), "input1": source("inseries1b"), "input2": source("inseries2b"), "input3": value(0.2)})
def my_func(outputcols: list[str], input1: pd.Series(), input2: pd.Series(), input3: float) -> pd.DataFrame:
    return pd.DataFrame([input1 * 2, input2 * input3], columns=[outputcols])

@extract_columns('outseries1a', 'outseries2a')
def my_disaggregator1a(outdf1a: pd.DataFrame) -> pd.Series:
    return outdf1a

@extract_columns('outseries1b', 'outseries2b')
def my_disaggregator1b(outdf1b: pd.DataFrame) -> pd.Series:
    return outdf1b

Version of this using @parameterize_frame and multiindices:

df = pd.DataFrame([
    ["outseries1a", "outseries2a", "inseries1a", "inseries2a", 5.0],
    ["outseries1b", "outseries2b", "inseries1b", "inseries2b", 0.2],
    # ...
],
columns=[
    ["out", "out", "source", "source", "value"],  # configure whether column is source or value and also whether it's input ("source", "value") or output ("out")
    ["output1", "output2", "input1", "input2", "input3"]])  # specify column names (corresponding to function arguments and (if outputting multiple columns) output dataframe columns)

@parameterize_frame(df)
def my_func(output1: str, output2: str, input1: pd.Series(), input2: pd.Series(), input3: float) -> pd.DataFrame:
    return pd.DataFrame([input1 * 2, input2 * 3], columns=[output1, output2])  # this outputs a dataframe but hamilton then automatically 'extracts' the columns as per their names in the "out" dataframe columns

With just one output column, this is what I'm imagining @parameterize_frame would look like this:

df = pd.DataFrame([
    ["outseries1a", "inseries1a", "inseries2a", 5.0],
    ["outseries1b", "inseries1b", "inseries2b", 0.2],
    # ...
],
columns=[
    ["out", "source", "source", "value"], # configure whether column is source or value and also whether it's input ("source", "value") or output ("out")
    ["output1", "input1", "input2", "input3"]])  # specify column names (corresponding to function arguments and (if outputting multiple columns) output dataframe columns)

@parameterize_frame(df)
def my_func(output1: str, input1: pd.Series(), input2: pd.Series(), input3: float) -> pd.DataFrame:
    return pd.DataFrame([input1 * 2, input2 * 3], columns=[output1])  # if there's a single column it could maybe just return a series instead and pick up the name from the first column of the dataframe

elijahbenizzy · 2022-09-16T20:17:04Z

@ropeladder OK, awesome, this makes a lot of sense to me now. Specifically, this is doing both an extract_columns and a parameterize operation at once, all specified by a dataframe. Thanks for the clarification!

Implementation should be straightforward, want to noodle on generalizability for this.

skrawcz · 2022-10-11T17:59:14Z

@elijahbenizzy any updates on this?

elijahbenizzy · 2022-10-28T22:25:32Z

@skrawcz haven't had a chance to look yet. @ropeladder would love to sync up -- happy to provide this as a recipe/think about including in the main framework. If I implement would you have a chance to take it for a spin?

ropeladder · 2022-11-03T01:23:28Z

@elijahbenizzy Sure, I've got some (private) code I could try it out on and/or I'd be happy to cook up some basic examples.

elijahbenizzy · 2022-11-08T14:41:29Z

Awesome @ropeladder -- sorry for the delay in response. I'll take some time in tne next week to implement a prototype then we can give it a spin/connect about it!

elijahbenizzy · 2022-11-15T02:31:40Z

Alright, proved out that I could do this with only using extract_columns + parameterized. Before we get this out I need to:

Mull over the API
Fix all the hacks
Let you test it out!

But overall I quite like what this is trying to do -- took me implementing to see the value.

Haven't tested thoroughly as I have to run but the code is quite simple... #227

elijahbenizzy · 2022-11-15T16:00:05Z

Alright, so with the following example (rewrote slightly):

df = pd.DataFrame(
        [
            ["outseries1a", "outseries2a", "inseries1a", "inseries2a", 5.0],
            ["outseries1b", "outseries2b", "inseries1b", "inseries2b", 0.2],
            # ...
        ],
        # Have to switch as indices have to be unique
        columns=[
            [
                "output1",
                "output2",
                "input1",
                "input2",
                "input3",
            ],  # configure whether column is source or value and also whether it's input ("source", "value") or output ("out")
            ["out", "out", "source", "source", "value"],
        ],
    )  # specify column names (corresponding to function arguments and (if outputting multiple columns) output dataframe columns)

    @parameterize_frame(df)
    def my_func(
        output1: str, output2: str, input1: pd.Series, input2: pd.Series, input3: float
    ) -> pd.DataFrame:
        return pd.DataFrame(
            {output1: input1 * input2 * input3, output2: input1 + input2 + input3}
        )

Here's what the DAG looks like. Only weirdness is I generate temporary node names for the dataframes we generate.

I'm pretty happy with this as an implementation, but I'd like to suggest reframing the API. What we're trying to do is combine @parameterize and @extract_columns in a way that avoids repetition. So, what about something like this:

@parameterize_extract(
    extract_type='dataframe', #could be dict? We also have others...
    extract_mapping={
        ('outseries1a', 'outseries1b') : dict(input1=source('inseries1a'), input2= source('inseries2a'), input3=value(5.0)),
        ('outseries2a', 'outseries2b') : dict(input1=source('inseries1b'), input2= source('inseries2b'), input3=value(5.0)),
})
def my_func(input1, input2, input3) -> pd.Series:
    return pd.DataFrame(
            [input1 * input2 * input3, input1 + input2 + input3]
        )

Note the above could easily be done with a df as well. Primary differences are:

(1) abstraction to allow for different extract types -- tbd on what this swould look like...
(2) you can actually have different numbers of columns per func (should we though?)
(3) we use the order of the columns in the decorator to avoid rewriting pieces
(4) we re-use the same functions

Thoughts?

elijahbenizzy · 2022-11-16T05:22:44Z

And an update -- I've implemented something similar to the above, still figuring out an edge case or two:

@parameterize_extract(
        ParameterizedExtract(('outseries1a', 'outseries2a'), {'input1': source('inseries1a'), 'input2': source('inseries1b'), 'input3': value(1.6)}),
        ParameterizedExtract(('outseries1b', 'outseries2b'), {'input1': source('inseries2a'), 'input2': source('inseries2b'), 'input3': value(0.2)})
    )
def my_func_parameterized_extract(input1: pd.Series, input2: pd.Series, input3: float) -> pd.DataFrame:
    return pd.concat([input1 * input2 * input3, input1 + input2 + input3], axis=1)

ropeladder · 2022-11-16T16:09:51Z

I can see why the latest example is nice but I'm imagining it in the context of the PoC I was trying out, where I would end up putting in 7 ParameterizedExtract lines -- which means repeating ParameterizedExtract(), the input column names, and also the input type (source/value) in fairly verbose way.

Instead, if the interface is worked like that, I would probably rewrite it as something like this:

data_names = [
    ["outseries1a", "outseries2a", "inseries1a", "inseries2a", 5.0],
    ["outseries1b", "outseries2b", "inseries1b", "inseries2b", 0.2],
    ["outseries1c", "outseries2c", "inseries1c", "inseries2c", 0.2],
    ["outseries1d", "outseries2d", "inseries1d", "inseries2d", 0.2],
    ["outseries1e", "outseries2e", "inseries1e", "inseries2e", 0.2],
    ["outseries1f", "outseries2f", "inseries1f", "inseries2f", 0.2],
    ["outseries1g", "outseries2g", "inseries1g", "inseries2g", 0.2]
]
@parameterize_extract(**{ParameterizedExtract((row[a], row[b]), {'input1': source(row[c]), 'input2': source(row[d]), 'input3': value(row[e])} for row in data_names})
def my_func_parameterized_extract(input1: pd.Series, input2: pd.Series, input3: float) -> pd.DataFrame:
    return pd.concat([input1 * input2 * input3, input1 + input2 + input3], axis=1)

There may be a nicer way to do this I'm missing, but the issue here is that it still requires a fairly awkward 'translation' layer between the data definition and the function definition signature. E.g. if I'm reading through these series definitions it feels like a lot of extra overhead to think through, as compared to having a dataframe I can view directly with relevant labels.

elijahbenizzy · 2022-11-16T18:37:01Z

OK, so I have some code that you can play around with if you want! It has both implementations (dataframes uses the other).

See PR #227. To install, you can:

Git clone
pip install -e . (if you're using pip)
Import it from hamilton.experimental.decorators

Definitely see your point -- what I have is overly verbose. I guess my concern about the dataframe API is that it takes a lot to figure out what's happening. We can support both, but I wonder if there's a best of both worlds. It would also be nice to reuse notions from extract or parameterize. The following is shorter but not nearly as concise as yours....

parameterize_extract(
    (("outseries1a", "outseries2a"), {"input1": source("inseries1a"), "input2": source("inseries2a"), "input3": value(5.0)}),
    (("outseries1b", "outseries2b"), {"input1": source("inseries1b"), "input2": source("inseries2b"), "input3": value(0.2)}),
)
def my_func_parameterized_extract(input1: pd.Series, input2: pd.Series, input3: float) -> pd.DataFrame:
    return pd.concat([input1 * input2 * input3, input1 + input2 + input3], axis=1)

But I guess the question is -- do you want the source/value to be the same across all parameterizations? Will they all output the number of columns? Or would one be a source and the other a value? Going to noodle, but might be able to support both/simplify the one I have.

ropeladder · 2022-11-23T18:59:13Z

Just tried out @parameterize_extract and @parameterize_frame on my main use case, they both work great. I was confused by @parameterize_frame as currently written because it required the output columns to be in the function signature, but once I fixed that it was good.

For my use case, source and value were the same across parameterizations... and I guess I'm having trouble imagining how they wouldn't be without a fair bit of logic in the function itself, since then the function has to handle getting either a series or a value.

More broadly, I see your point that @parameterize_frame requires a lot of contextual knowledge to understand what's going on (in particular, what the column indices mean and which one is which). My first examples are a bit better in that regard since they make the column identities a bit more explicit. Something like this might be a good compromise:

values = [
    ["outseries1a", "inseries1a", "inseries2a", 5.0],
    ["outseries1b", "inseries1b", "inseries2b", 0.2],
    # ...
],
column_types = ["out", "source", "source", "value"]
column_varnames = ["output1", "input1", "input2", "input3"]

@parameterize_frame(values, types=column_types, names=column_varnames)
def my_func(output1: str, input1: pd.Series(), input2: pd.Series(), input3: float) -> pd.DataFrame:
    return pd.DataFrame([input1 * 2, input2 * 3], columns=[output1])

(...and maybe you could still pass in the full multi-indexed dataframe instead of the types and names arguments if you wanted.)

elijahbenizzy · 2022-11-27T01:47:03Z

Just tried out @parameterize_extract and @parameterize_frame on my main use case, they both work great. I was confused by @parameterize_frame as currently written because it required the output columns to be in the function signature, but once I fixed that it was good.

For my use case, source and value were the same across parameterizations... and I guess I'm having trouble imagining how they wouldn't be without a fair bit of logic in the function itself, since then the function has to handle getting either a series or a value.

More broadly, I see your point that @parameterize_frame requires a lot of contextual knowledge to understand what's going on (in particular, what the column indices mean and which one is which). My first examples are a bit better in that regard since they make the column identities a bit more explicit. Something like this might be a good compromise:
values = [
    ["outseries1a", "inseries1a", "inseries2a", 5.0],
    ["outseries1b", "inseries1b", "inseries2b", 0.2],
    # ...
],
column_types = ["out", "source", "source", "value"]
column_varnames = ["output1", "input1", "input2", "input3"]

@parameterize_frame(values, types=column_types, names=column_varnames)
def my_func(output1: str, input1: pd.Series(), input2: pd.Series(), input3: float) -> pd.DataFrame:
    return pd.DataFrame([input1 * 2, input2 * 3], columns=[output1])
(...and maybe you could still pass in the full multi-indexed dataframe instead of the types and names arguments if you wanted.)

Awesome, yeah, I think that makes a lot of sense. Will polish -- I like having two APIs as well (one if you need to change to be fully parameterized, the other if you want each parameterization to be the same structure).

Also think we can remove them from the function signatures as I do with parameteterize_extract

I'm going to polish and get out tomorrow in an RC version!

This provides a convenience API for #196. The idea is that people want to do a parameterization and extract columns operator at once -- this should be easy. The cool thing is that this just uses the parameterize and extract APIs. It also has a from_df() function to allow for passing in a dataframe to be more concise.

skrawcz · 2022-12-06T23:25:41Z

@ropeladder any feedback on the release candidate? See https://hamilton-opensource.slack.com/archives/C03AJNGDGQL/p1669601959199269 for details.

This provides a convenience API for #196. The idea is that people want to do a parameterization and extract columns operator at once -- this should be easy. The cool thing is that this just uses the parameterize and extract APIs. It also has a from_df() function to allow for passing in a dataframe to be more concise.

skrawcz · 2023-02-09T07:27:01Z

This has been pushed in the 1.16.0 release. See docs:

elijahbenizzy added the product idea label Oct 29, 2022

elijahbenizzy mentioned this issue Nov 28, 2022

Implements parameterize_extract_columns + a dataframe API #238

Merged

7 tasks

skrawcz closed this as completed Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using tables/dataframes for parameterization #196

Using tables/dataframes for parameterization #196

ropeladder commented Sep 14, 2022

skrawcz commented Sep 14, 2022

ropeladder commented Sep 14, 2022

skrawcz commented Sep 15, 2022

elijahbenizzy commented Sep 15, 2022 •

edited

Loading

ropeladder commented Sep 15, 2022

elijahbenizzy commented Sep 16, 2022

skrawcz commented Oct 11, 2022

elijahbenizzy commented Oct 28, 2022

ropeladder commented Nov 3, 2022

elijahbenizzy commented Nov 8, 2022 •

edited

Loading

elijahbenizzy commented Nov 15, 2022

elijahbenizzy commented Nov 15, 2022

elijahbenizzy commented Nov 16, 2022 •

edited

Loading

ropeladder commented Nov 16, 2022

elijahbenizzy commented Nov 16, 2022 •

edited

Loading

ropeladder commented Nov 23, 2022

elijahbenizzy commented Nov 27, 2022 •

edited

Loading

skrawcz commented Dec 6, 2022

skrawcz commented Feb 9, 2023

Using tables/dataframes for parameterization #196

Using tables/dataframes for parameterization #196

Comments

ropeladder commented Sep 14, 2022

skrawcz commented Sep 14, 2022

ropeladder commented Sep 14, 2022

skrawcz commented Sep 15, 2022

elijahbenizzy commented Sep 15, 2022 • edited Loading

ropeladder commented Sep 15, 2022

elijahbenizzy commented Sep 16, 2022

skrawcz commented Oct 11, 2022

elijahbenizzy commented Oct 28, 2022

ropeladder commented Nov 3, 2022

elijahbenizzy commented Nov 8, 2022 • edited Loading

elijahbenizzy commented Nov 15, 2022

elijahbenizzy commented Nov 15, 2022

elijahbenizzy commented Nov 16, 2022 • edited Loading

ropeladder commented Nov 16, 2022

elijahbenizzy commented Nov 16, 2022 • edited Loading

ropeladder commented Nov 23, 2022

elijahbenizzy commented Nov 27, 2022 • edited Loading

skrawcz commented Dec 6, 2022

skrawcz commented Feb 9, 2023

elijahbenizzy commented Sep 15, 2022 •

edited

Loading

elijahbenizzy commented Nov 8, 2022 •

edited

Loading

elijahbenizzy commented Nov 16, 2022 •

edited

Loading

elijahbenizzy commented Nov 16, 2022 •

edited

Loading

elijahbenizzy commented Nov 27, 2022 •

edited

Loading