Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Using tables/dataframes for parameterization #196

Closed
ropeladder opened this issue Sep 14, 2022 · 19 comments
Closed

Using tables/dataframes for parameterization #196

ropeladder opened this issue Sep 14, 2022 · 19 comments

Comments

@ropeladder
Copy link

Is your feature request related to a problem? Please describe.
When using Hamilton to create lots of new columns using @parameterize, an explicitly tabular approach could make things cleaner and also could allow for multi-column outputs.

Describe the solution you'd like
These aren't perfect but they should give an idea of what I'm thinking:

df = pd.DataFrame([
    ["outseries1a", "outseries2a", "inseries1a", "inseries2a"],
    ["outseries1b", "outseries2b", "inseries1b", "inseries2b"]
],
columns=["output1", "output2", "input1", "input2"])

@parameterize_frame(df, sources = df.columns[2:], values = None, out_names=df.columns[:1])  # input names are matched based on argument names for my_func
def my_func(input1: pd.Series(), input2: pd.Series()) -> pd.DataFrame:
    return pd.DataFrame([input1 * 2, input2 * 3])

Using a table instead of a dataframe:

table = [
    ["outseries1a", "outseries2a", "inseries1a", "inseries2a"],
    ["outseries1b", "outseries2b", "inseries1b", "inseries2b"]
]
input_types = [
    [source, source]
]

@parameterize_table(table, in_names = [3,4], in_types = input_types, out_names = [0:1])
def my_func(input1: pd.Series(), input2: pd.Series()) -> List[pd.Series]:
    return [input1 * 2, input2 * 3]

Describe alternatives you've considered

  • Passing in destructured dicts doesn't offer an integrated way to extract columns. It also feels redundant naming the series in @parameterize and then again in the function signature.
@skrawcz
Copy link
Collaborator

skrawcz commented Sep 14, 2022

@ropeladder just to confirm my understanding:

  1. is your assumption that column order maps directly to function parameter arguments?
  2. would these tables & dataframes be purpose built for input to the parameterized function? Or do you see them coming from somewhere else / being used in another context?

@ropeladder
Copy link
Author

  1. In the dataframe example column order wouldn't matter because the column names match up with the arguments. In the table example it would matter.
  2. Yes, my thinking was these would be built specifically for the parameterized function.

@skrawcz
Copy link
Collaborator

skrawcz commented Sep 15, 2022

@elijahbenizzy thoughts?

@elijahbenizzy
Copy link
Collaborator

elijahbenizzy commented Sep 15, 2022

Seems reasonable -- want to get the naming right though so its clear what its doing. I'd prefer to make this simpler though -- is there a way to just pass in a single dataframe? And to clarify, how many parameterizations would be above examples create? Looks like 4? Any chance you'd mind writing down how it would look in the current world so we could compare?

Implementation should be straightforward -- I think we can just pass to the superclass. We might, however, need to do something a bit more complex (depending on whether we'll know the mapping of column names to positions, as the function is not known on instantiating, only on call. Can work around this though.)

@ropeladder
Copy link
Author

Passing in just a single dataframe was my initial thought but I couldn't figure out a clear way to differentiate input from output columns. Until just now when I realized this is possibly a good use for a multilevel column index.

I'm thinking each of the two examples above would create one parameterization per row. So, two each.

The two above example each have two separate output columns, so they would be kind of clunky in the current @parameterize environment -- something like this:

@parameterize(
    outdf1a={"outputcols": value(['outseries1a', 'outseries2a']), "input1": source("inseries1a"), "input2": source("inseries2a"), "input3": value(5.0)},
    outdf1b={"outputcols": value(['outseries1b', 'outseries2b']), "input1": source("inseries1b"), "input2": source("inseries2b"), "input3": value(0.2)})
def my_func(outputcols: list[str], input1: pd.Series(), input2: pd.Series(), input3: float) -> pd.DataFrame:
    return pd.DataFrame([input1 * 2, input2 * input3], columns=[outputcols])

@extract_columns('outseries1a', 'outseries2a')
def my_disaggregator1a(outdf1a: pd.DataFrame) -> pd.Series:
    return outdf1a

@extract_columns('outseries1b', 'outseries2b')
def my_disaggregator1b(outdf1b: pd.DataFrame) -> pd.Series:
    return outdf1b

Version of this using @parameterize_frame and multiindices:

df = pd.DataFrame([
    ["outseries1a", "outseries2a", "inseries1a", "inseries2a", 5.0],
    ["outseries1b", "outseries2b", "inseries1b", "inseries2b", 0.2],
    # ...
],
columns=[
    ["out", "out", "source", "source", "value"],  # configure whether column is source or value and also whether it's input ("source", "value") or output ("out")
    ["output1", "output2", "input1", "input2", "input3"]])  # specify column names (corresponding to function arguments and (if outputting multiple columns) output dataframe columns)

@parameterize_frame(df)
def my_func(output1: str, output2: str, input1: pd.Series(), input2: pd.Series(), input3: float) -> pd.DataFrame:
    return pd.DataFrame([input1 * 2, input2 * 3], columns=[output1, output2])  # this outputs a dataframe but hamilton then automatically 'extracts' the columns as per their names in the "out" dataframe columns

With just one output column, this is what I'm imagining @parameterize_frame would look like this:

df = pd.DataFrame([
    ["outseries1a", "inseries1a", "inseries2a", 5.0],
    ["outseries1b", "inseries1b", "inseries2b", 0.2],
    # ...
],
columns=[
    ["out", "source", "source", "value"], # configure whether column is source or value and also whether it's input ("source", "value") or output ("out")
    ["output1", "input1", "input2", "input3"]])  # specify column names (corresponding to function arguments and (if outputting multiple columns) output dataframe columns)

@parameterize_frame(df)
def my_func(output1: str, input1: pd.Series(), input2: pd.Series(), input3: float) -> pd.DataFrame:
    return pd.DataFrame([input1 * 2, input2 * 3], columns=[output1])  # if there's a single column it could maybe just return a series instead and pick up the name from the first column of the dataframe

@elijahbenizzy
Copy link
Collaborator

@ropeladder OK, awesome, this makes a lot of sense to me now. Specifically, this is doing both an extract_columns and a parameterize operation at once, all specified by a dataframe. Thanks for the clarification!

Implementation should be straightforward, want to noodle on generalizability for this.

@skrawcz
Copy link
Collaborator

skrawcz commented Oct 11, 2022

@elijahbenizzy any updates on this?

@elijahbenizzy
Copy link
Collaborator

@skrawcz haven't had a chance to look yet. @ropeladder would love to sync up -- happy to provide this as a recipe/think about including in the main framework. If I implement would you have a chance to take it for a spin?

@ropeladder
Copy link
Author

@elijahbenizzy Sure, I've got some (private) code I could try it out on and/or I'd be happy to cook up some basic examples.

@elijahbenizzy
Copy link
Collaborator

elijahbenizzy commented Nov 8, 2022

Awesome @ropeladder -- sorry for the delay in response. I'll take some time in tne next week to implement a prototype then we can give it a spin/connect about it!

@elijahbenizzy
Copy link
Collaborator

Alright, proved out that I could do this with only using extract_columns + parameterized. Before we get this out I need to:

  1. Mull over the API
  2. Fix all the hacks
  3. Let you test it out!

But overall I quite like what this is trying to do -- took me implementing to see the value.

Haven't tested thoroughly as I have to run but the code is quite simple... #227

@elijahbenizzy
Copy link
Collaborator

Alright, so with the following example (rewrote slightly):

df = pd.DataFrame(
        [
            ["outseries1a", "outseries2a", "inseries1a", "inseries2a", 5.0],
            ["outseries1b", "outseries2b", "inseries1b", "inseries2b", 0.2],
            # ...
        ],
        # Have to switch as indices have to be unique
        columns=[
            [
                "output1",
                "output2",
                "input1",
                "input2",
                "input3",
            ],  # configure whether column is source or value and also whether it's input ("source", "value") or output ("out")
            ["out", "out", "source", "source", "value"],
        ],
    )  # specify column names (corresponding to function arguments and (if outputting multiple columns) output dataframe columns)

    @parameterize_frame(df)
    def my_func(
        output1: str, output2: str, input1: pd.Series, input2: pd.Series, input3: float
    ) -> pd.DataFrame:
        return pd.DataFrame(
            {output1: input1 * input2 * input3, output2: input1 + input2 + input3}
        )

Here's what the DAG looks like. Only weirdness is I generate temporary node names for the dataframes we generate.

image

I'm pretty happy with this as an implementation, but I'd like to suggest reframing the API. What we're trying to do is combine @parameterize and @extract_columns in a way that avoids repetition. So, what about something like this:

@parameterize_extract(
    extract_type='dataframe', #could be dict? We also have others...
    extract_mapping={
        ('outseries1a', 'outseries1b') : dict(input1=source('inseries1a'), input2= source('inseries2a'), input3=value(5.0)),
        ('outseries2a', 'outseries2b') : dict(input1=source('inseries1b'), input2= source('inseries2b'), input3=value(5.0)),
})
def my_func(input1, input2, input3) -> pd.Series:
    return pd.DataFrame(
            [input1 * input2 * input3, input1 + input2 + input3]
        )

Note the above could easily be done with a df as well. Primary differences are:

(1) abstraction to allow for different extract types -- tbd on what this swould look like...
(2) you can actually have different numbers of columns per func (should we though?)
(3) we use the order of the columns in the decorator to avoid rewriting pieces
(4) we re-use the same functions

Thoughts?

@elijahbenizzy
Copy link
Collaborator

elijahbenizzy commented Nov 16, 2022

And an update -- I've implemented something similar to the above, still figuring out an edge case or two:

@parameterize_extract(
        ParameterizedExtract(('outseries1a', 'outseries2a'), {'input1': source('inseries1a'), 'input2': source('inseries1b'), 'input3': value(1.6)}),
        ParameterizedExtract(('outseries1b', 'outseries2b'), {'input1': source('inseries2a'), 'input2': source('inseries2b'), 'input3': value(0.2)})
    )
def my_func_parameterized_extract(input1: pd.Series, input2: pd.Series, input3: float) -> pd.DataFrame:
    return pd.concat([input1 * input2 * input3, input1 + input2 + input3], axis=1)

@ropeladder
Copy link
Author

I can see why the latest example is nice but I'm imagining it in the context of the PoC I was trying out, where I would end up putting in 7 ParameterizedExtract lines -- which means repeating ParameterizedExtract(), the input column names, and also the input type (source/value) in fairly verbose way.

Instead, if the interface is worked like that, I would probably rewrite it as something like this:

data_names = [
    ["outseries1a", "outseries2a", "inseries1a", "inseries2a", 5.0],
    ["outseries1b", "outseries2b", "inseries1b", "inseries2b", 0.2],
    ["outseries1c", "outseries2c", "inseries1c", "inseries2c", 0.2],
    ["outseries1d", "outseries2d", "inseries1d", "inseries2d", 0.2],
    ["outseries1e", "outseries2e", "inseries1e", "inseries2e", 0.2],
    ["outseries1f", "outseries2f", "inseries1f", "inseries2f", 0.2],
    ["outseries1g", "outseries2g", "inseries1g", "inseries2g", 0.2]
]
@parameterize_extract(**{ParameterizedExtract((row[a], row[b]), {'input1': source(row[c]), 'input2': source(row[d]), 'input3': value(row[e])} for row in data_names})
def my_func_parameterized_extract(input1: pd.Series, input2: pd.Series, input3: float) -> pd.DataFrame:
    return pd.concat([input1 * input2 * input3, input1 + input2 + input3], axis=1)

There may be a nicer way to do this I'm missing, but the issue here is that it still requires a fairly awkward 'translation' layer between the data definition and the function definition signature. E.g. if I'm reading through these series definitions it feels like a lot of extra overhead to think through, as compared to having a dataframe I can view directly with relevant labels.

@elijahbenizzy
Copy link
Collaborator

elijahbenizzy commented Nov 16, 2022

OK, so I have some code that you can play around with if you want! It has both implementations (dataframes uses the other).

See PR #227. To install, you can:

  1. Git clone
  2. pip install -e . (if you're using pip)
  3. Import it from hamilton.experimental.decorators

Definitely see your point -- what I have is overly verbose. I guess my concern about the dataframe API is that it takes a lot to figure out what's happening. We can support both, but I wonder if there's a best of both worlds. It would also be nice to reuse notions from extract or parameterize. The following is shorter but not nearly as concise as yours....

parameterize_extract(
    (("outseries1a", "outseries2a"), {"input1": source("inseries1a"), "input2": source("inseries2a"), "input3": value(5.0)}),
    (("outseries1b", "outseries2b"), {"input1": source("inseries1b"), "input2": source("inseries2b"), "input3": value(0.2)}),
)
def my_func_parameterized_extract(input1: pd.Series, input2: pd.Series, input3: float) -> pd.DataFrame:
    return pd.concat([input1 * input2 * input3, input1 + input2 + input3], axis=1)

But I guess the question is -- do you want the source/value to be the same across all parameterizations? Will they all output the number of columns? Or would one be a source and the other a value? Going to noodle, but might be able to support both/simplify the one I have.

@ropeladder
Copy link
Author

Just tried out @parameterize_extract and @parameterize_frame on my main use case, they both work great. I was confused by @parameterize_frame as currently written because it required the output columns to be in the function signature, but once I fixed that it was good.

For my use case, source and value were the same across parameterizations... and I guess I'm having trouble imagining how they wouldn't be without a fair bit of logic in the function itself, since then the function has to handle getting either a series or a value.

More broadly, I see your point that @parameterize_frame requires a lot of contextual knowledge to understand what's going on (in particular, what the column indices mean and which one is which). My first examples are a bit better in that regard since they make the column identities a bit more explicit. Something like this might be a good compromise:

values = [
    ["outseries1a", "inseries1a", "inseries2a", 5.0],
    ["outseries1b", "inseries1b", "inseries2b", 0.2],
    # ...
],
column_types = ["out", "source", "source", "value"]
column_varnames = ["output1", "input1", "input2", "input3"]

@parameterize_frame(values, types=column_types, names=column_varnames)
def my_func(output1: str, input1: pd.Series(), input2: pd.Series(), input3: float) -> pd.DataFrame:
    return pd.DataFrame([input1 * 2, input2 * 3], columns=[output1])

(...and maybe you could still pass in the full multi-indexed dataframe instead of the types and names arguments if you wanted.)

@elijahbenizzy
Copy link
Collaborator

elijahbenizzy commented Nov 27, 2022

Just tried out @parameterize_extract and @parameterize_frame on my main use case, they both work great. I was confused by @parameterize_frame as currently written because it required the output columns to be in the function signature, but once I fixed that it was good.

For my use case, source and value were the same across parameterizations... and I guess I'm having trouble imagining how they wouldn't be without a fair bit of logic in the function itself, since then the function has to handle getting either a series or a value.

More broadly, I see your point that @parameterize_frame requires a lot of contextual knowledge to understand what's going on (in particular, what the column indices mean and which one is which). My first examples are a bit better in that regard since they make the column identities a bit more explicit. Something like this might be a good compromise:

values = [
    ["outseries1a", "inseries1a", "inseries2a", 5.0],
    ["outseries1b", "inseries1b", "inseries2b", 0.2],
    # ...
],
column_types = ["out", "source", "source", "value"]
column_varnames = ["output1", "input1", "input2", "input3"]

@parameterize_frame(values, types=column_types, names=column_varnames)
def my_func(output1: str, input1: pd.Series(), input2: pd.Series(), input3: float) -> pd.DataFrame:
    return pd.DataFrame([input1 * 2, input2 * 3], columns=[output1])

(...and maybe you could still pass in the full multi-indexed dataframe instead of the types and names arguments if you wanted.)

Awesome, yeah, I think that makes a lot of sense. Will polish -- I like having two APIs as well (one if you need to change to be fully parameterized, the other if you want each parameterization to be the same structure).

Also think we can remove them from the function signatures as I do with parameteterize_extract

I'm going to polish and get out tomorrow in an RC version!

elijahbenizzy added a commit that referenced this issue Nov 28, 2022
This provides a convenience API for
#196. The idea is that
people want to do a parameterization and extract columns operator at
once -- this should be easy. The cool thing is that this just uses the
parameterize and extract APIs. It also has a from_df() function to allow
for passing in a dataframe to be more concise.
elijahbenizzy added a commit that referenced this issue Nov 28, 2022
This provides a convenience API for
#196. The idea is that
people want to do a parameterization and extract columns operator at
once -- this should be easy. The cool thing is that this just uses the
parameterize and extract APIs. It also has a from_df() function to allow
for passing in a dataframe to be more concise.
@skrawcz
Copy link
Collaborator

skrawcz commented Dec 6, 2022

@ropeladder any feedback on the release candidate? See https://hamilton-opensource.slack.com/archives/C03AJNGDGQL/p1669601959199269 for details.

elijahbenizzy added a commit that referenced this issue Dec 22, 2022
This provides a convenience API for
#196. The idea is that
people want to do a parameterization and extract columns operator at
once -- this should be easy. The cool thing is that this just uses the
parameterize and extract APIs. It also has a from_df() function to allow
for passing in a dataframe to be more concise.
elijahbenizzy added a commit that referenced this issue Dec 22, 2022
This provides a convenience API for
#196. The idea is that
people want to do a parameterization and extract columns operator at
once -- this should be easy. The cool thing is that this just uses the
parameterize and extract APIs. It also has a from_df() function to allow
for passing in a dataframe to be more concise.
elijahbenizzy added a commit that referenced this issue Dec 23, 2022
This provides a convenience API for
#196. The idea is that
people want to do a parameterization and extract columns operator at
once -- this should be easy. The cool thing is that this just uses the
parameterize and extract APIs. It also has a from_df() function to allow
for passing in a dataframe to be more concise.
elijahbenizzy added a commit that referenced this issue Dec 23, 2022
This provides a convenience API for
#196. The idea is that
people want to do a parameterization and extract columns operator at
once -- this should be easy. The cool thing is that this just uses the
parameterize and extract APIs. It also has a from_df() function to allow
for passing in a dataframe to be more concise.
elijahbenizzy added a commit that referenced this issue Dec 27, 2022
This provides a convenience API for
#196. The idea is that
people want to do a parameterization and extract columns operator at
once -- this should be easy. The cool thing is that this just uses the
parameterize and extract APIs. It also has a from_df() function to allow
for passing in a dataframe to be more concise.
elijahbenizzy added a commit that referenced this issue Dec 27, 2022
This provides a convenience API for
#196. The idea is that
people want to do a parameterization and extract columns operator at
once -- this should be easy. The cool thing is that this just uses the
parameterize and extract APIs. It also has a from_df() function to allow
for passing in a dataframe to be more concise.
elijahbenizzy added a commit that referenced this issue Jan 29, 2023
This provides a convenience API for
#196. The idea is that
people want to do a parameterization and extract columns operator at
once -- this should be easy. The cool thing is that this just uses the
parameterize and extract APIs. It also has a from_df() function to allow
for passing in a dataframe to be more concise.
elijahbenizzy added a commit that referenced this issue Jan 29, 2023
This provides a convenience API for
#196. The idea is that
people want to do a parameterization and extract columns operator at
once -- this should be easy. The cool thing is that this just uses the
parameterize and extract APIs. It also has a from_df() function to allow
for passing in a dataframe to be more concise.
elijahbenizzy added a commit that referenced this issue Jan 30, 2023
This provides a convenience API for
#196. The idea is that
people want to do a parameterization and extract columns operator at
once -- this should be easy. The cool thing is that this just uses the
parameterize and extract APIs. It also has a from_df() function to allow
for passing in a dataframe to be more concise.
elijahbenizzy added a commit that referenced this issue Jan 30, 2023
This provides a convenience API for
#196. The idea is that
people want to do a parameterization and extract columns operator at
once -- this should be easy. The cool thing is that this just uses the
parameterize and extract APIs. It also has a from_df() function to allow
for passing in a dataframe to be more concise.
@skrawcz
Copy link
Collaborator

skrawcz commented Feb 9, 2023

This has been pushed in the 1.16.0 release. See docs:

@skrawcz skrawcz closed this as completed Feb 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants