Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Requires incorrectly treats all luigi.Parameter as strings #1607

Closed
dlstadther opened this issue Mar 24, 2016 · 22 comments
Closed

Dynamic Requires incorrectly treats all luigi.Parameter as strings #1607

dlstadther opened this issue Mar 24, 2016 · 22 comments

Comments

@dlstadther
Copy link
Collaborator

As noted by the title, dynamic requirement Parameters are treated as strings (where the required class is expected luigi.Parameter()). DateParameter, IntParameter, etc are treated appropriately.

This issue was found when dynamically requiring an arbitrary number of redshift.S3CopyToTable statements to temporary tables. Columns and queries are passed as lists, but errors were being throw because these lists were being interpreted as strings.

I have written the following luigi code to note this error (I have noted the actual output in both cases - normal requires, as well as dynamic requires):

import datetime

import luigi
from luigi import mock

__author__ = "Dillon Stadther"
__date__ = '2016-03-24'


class TestTask(luigi.Task):
    date = luigi.DateParameter(default=datetime.date(2016, 03, 24))

    tmp_path = luigi.Parameter(config_path=dict(section='path', name='tmp_path'))
    table_path = luigi.Parameter(config_path=dict(section='path', name='table_path'))

    #def requires(self):
    #    yield TestClass(
    #        date=self.date,                # prints "2016-03-24" and "2016-03-17"
    #        string_in='Hello World',       # prints "Hello World" and "H"
    #        list_in=['Hello', 'World'],    # prints "('Hello', 'World')" and "Hello"
    #        tuple_in=('foo', 'bar'),       # prints "('foo', 'bar')" and "foo"
    #        int_in=10                      # prints "10" and "20"
    #    )

    def output(self):
        return mock.MockTarget('test_requires')

    def run(self):
        yield TestClass(
            date=self.date,                 # prints "2016-03-24" and "2016-03-17"
            string_in='Hello World',        # prints "Hello World" and "H"
            list_in=['Hello', 'World'],     # prints "('Hello', 'World')" and "("
            tuple_in=('foo', 'bar'),        # prints "('foo', 'bar')" and "("
            int_in=10                       # prints "10" and "20"
        )
        self.output().open('w').close()


class TestClass(luigi.Task):
    date = luigi.DateParameter()

    string_in = luigi.Parameter(default='')
    list_in = luigi.Parameter(default=[])
    tuple_in = luigi.Parameter(default=())
    int_in = luigi.IntParameter(default=0)

    tmp_path = luigi.Parameter(config_path=dict(section='path', name='tmp_path'))

    def output(self):
        return mock.MockTarget('test_out')

    def run(self):
        print(self.date)
        print(self.date + datetime.timedelta(days=-7))  # should print the date 7 days ago
        print(self.string_in)
        print(self.string_in[0])        # should print first character
        print(self.list_in)
        print(self.list_in[0])          # should print first element
        print(self.tuple_in)
        print(self.tuple_in[0])         # should print first element
        print(self.int_in)
        print(self.int_in * 2)          # should print double the int

        self.output().open('w').close()


if __name__ == "__main__":
    luigi.run(['TestTask', '--workers', '1'])
@erikbern
Copy link
Contributor

nice catch. is it simple to fix?

@dlstadther
Copy link
Collaborator Author

I've identified the result of the error, but not the cause of it.

I'm not sure exactly where to find the execution of dynamic requirements in Luigi Core. Can you point me in the right direction, @erikbern ?

@erikbern
Copy link
Contributor

should be somewhere in worker.py (which is a terrible mess unfortunately)

@dlstadther
Copy link
Collaborator Author

@erikbern Thanks. I'll take a look and see what i can find.

@erikbern
Copy link
Contributor

excellent!

@dlstadther
Copy link
Collaborator Author

I'm pretty lost when trying to decode what all is happening within worker.py...

I'll spend some more time trying to decipher when I return from PTO next week.

@davemt
Copy link
Contributor

davemt commented Mar 24, 2016

Check out https://github.com/spotify/luigi/blob/master/luigi/worker.py#L104

That is where you'll see the yielded dependencies are put on a multiprocessing Queue and then picked up by Worker class and loaded to be added to scheduler.

@dlstadther
Copy link
Collaborator Author

I can read through Worker.py and vaguely understand the logic, but am still lost when it comes to how this bug arises.

I see that DictParameter has been added to parameter.py. Would it be worth adding ListParameter and TupleParameter to ensure the type of Parameter being pass (and possibly bypass this dynamic requires bug), @erikbern ?

@erikbern
Copy link
Contributor

Do you have any custom parameters?

The Parameter class implements the methods parse and serialize – maybe you need to implement them

I haven't looked at the code

@erikbern
Copy link
Contributor

Looks like here it uses task.to_str_params which should serialize each parameter: https://github.com/spotify/luigi/blob/master/luigi/worker.py#L132

@dlstadther
Copy link
Collaborator Author

I do not have any customer parameters. Presently, Parameter.parse(...) just returns the value and serialize(...) casts the value to a string.

So yes, task.to_str_params calls serialize(...) which casts the list and tuple to a string. Which would then cause iteration over them to return the first character rather than the first element.

So although Parameter notes that it is an untyped Parameter, it assumes it has a StrParameter based on its serialize(...) method.

It then seems necessary to implement a ListParameter and TupleParameter.

If you believe otherwise, I'll go your route.

@davemt
Copy link
Contributor

davemt commented Mar 29, 2016

So to_str_params is called for every task when it is added to scheduler, not just dynamic deps. We may be jumping to conclusions on the issue.

@davemt
Copy link
Contributor

davemt commented Mar 29, 2016

@davemt
Copy link
Contributor

davemt commented Mar 29, 2016

With this said I would not expect passing lists to a standard Parameter object to work so I am surprised the requires() case even works...

@davemt
Copy link
Contributor

davemt commented Mar 29, 2016

Ah, dynamic dependencies have load_task called on them before they are added to worker's task cache whereas regular requires() do not. This requires that a parameter value can be properly serialized/deserialized.

@dlstadther
Copy link
Collaborator Author

So then, dynamic requires functioned as expected. The issue is that Parameter is often misused (non-string types were declared as Parameter)?

@erikbern
Copy link
Contributor

It might be good to do a warnings.warn in case serialize returns a non-string

@dlstadther
Copy link
Collaborator Author

Parameter.serialize(x) returns str(x). However, Parameter.normalize(x) and Parameter.parse(x) just returns x.

Since normalize is rarely overridden (and i'm not exactly sure what cases it's necessary), I'm inclined to leave it be.

Should parse then cast to string? Or would you prefer the warnings.warn route if x is not a string?

@dlstadther
Copy link
Collaborator Author

@erikbern , thoughts on

class Parameter(object):
    ....

    def parse(self, x):
        return str(x)

verus

class Parameter(object):
    ....

    def parse(self, x):
        if not isinstance(x, basestring) or not isinstance(x, str):
            warnings.warn("Parameter %r is not of type string." % x)
        return x

?

@erikbern
Copy link
Contributor

yes the latter looks good

@dlstadther
Copy link
Collaborator Author

As issues are solved via PRs, they need to be closed. I just came across the keywords used to reference and auto close issues. Maybe we can try to enforce this more? Keep things clean and organized.

@Tarrasch
Copy link
Contributor

I actually knew that keyword. Please start using it. I think it's hard to enforce as not everybody knows about it. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants