Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK - Added support for raw input artifact argument values to ContainerOp #791

Conversation

Ark-kun
Copy link
Contributor

@Ark-kun Ark-kun commented Feb 7, 2019

Only constant string artifact arguments are supported for now. (no artifact passing yet.)

Usage:

def consumer_op(text):
    return dsl.ContainerOp(
        name='consumer',
        image='alpine',
        command=['cat', dsl.InputArtifactArgument(text)],
    )

or

def consumer_op(text):
    return dsl.ContainerOp(
        name='consumer',
        image='alpine',
        command=['cat', '/tmp/inputs/text/data'],
        input_artifact_arguments=[
            dsl.InputArtifactArgument(argument=text, path='/tmp/inputs/text/data'), # path is optional
        ],
    )

Full artifact passing DSL:

# OP producing artifacts (already available)
def producer_op(text):
    return dsl.ContainerOp(
        name='producer',
        image='alpine',
        command=['sh', '-c', 'echo ' + text + ' > /tmp/output.txt'],
        #output_artifact_paths={'text-artifact': '/tmp/output.txt'}, # Use file_outputs
        file_outputs={'text-artifact': '/tmp/output.txt'},
    )

# Op consuming artifacts (this PR)
def consumer_op(text_artifact):
    return dsl.ContainerOp(
        name='consumer',
        image='alpine',
        command=['cat', dsl.InputArtifactArgument(text)],
    )

# Pipeline with artifact passing (non in this PR)
@pipeline()
def my_pipeline():
    producer_task = producer_op('Hello world!')
    consumer_task = consumer_op(producer_task.outputs['text-artifact'])


This change is Reviewable

@neuromage
Copy link
Contributor

How does this relate to @hongye-sun's proposal btw?

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Feb 28, 2019

How does this relate to @hongye-sun's proposal btw?

This PR was added before @hongye-sun's doc was created. I created this PR to enable the "raw" artifact arguments (that Hongye wanted previously) in a way that's aligned with the future artifact passing support.

I think that this design follows the Argo way.

@hongye-sun
Copy link
Contributor

Can we put this PR onhold as we haven't closed on the artifact interface yet.

@hongye-sun
Copy link
Contributor

/hold

@Ark-kun Ark-kun changed the title SDK - Added support for raw artifact values to ContainerOp SDK - Added support for raw input artifact argument values to ContainerOp Mar 22, 2019
@Ark-kun Ark-kun force-pushed the SDK---Added-support-for-raw-artifact-values-to-ContainerOp branch 2 times, most recently from 0dd0c3e to 7bd74cf Compare May 14, 2019 22:55
text_input_path = '/inputs/text/data'
return dsl.ContainerOp(
name='component_with_input_artifact',
input_artifact_paths={'text': text_input_path},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are input_artifact_paths and input_artifact_arguments needed? Can we field for all input artifacts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not expose full argo input_artifacts spec here?

Copy link
Contributor Author

@Ark-kun Ark-kun May 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not expose full argo input_artifacts spec here?

We are moving away from hard-coding any artifact storage information in the pipeline Argo YAML. This makes the pipelines more portable which is a common request from customers and will become more and more important.
Argo is moving in the same direction where all artifact repository information is configured in cluster-side configMaps.

Generally I do not want to expose more than we currently need, especially features that can be hard to support with different artifact passing systems (volumes, Airflow, Docker, etc)

Let's look at at Argo's spec for input artifacts: https://github.com/argoproj/argo/blob/master/pkg/apis/workflow/v1alpha1/types.go#L308

  • name - Supported
  • path - Supported
  • mode - Not desirable: Not requested at this moment. Hard to support on GCS, passive volumes etc.
  • from - N/A: Not supported for containers
  • ArtifactLocation - Not desirable to expose: We want to prevent hard-coding any artifact repository information at the component level. The same effect can be archived at the argument level instead.
  • globalName - Not desirable: Not requested at this moment. We do not want components to directly affect any global state.
  • ArchiveStrategy - Not desirable to expose: Too Argo-specific. Bad compatibility with other data storage systems.
  • optional - Might be desirable.

So, apart from name and path which I'm already including in this PR, there is only one additional field that can be desirable - optional.

Would you like me to replace
input_artifact_paths={'text': text_input_path},
with
input_artifacts=[InputArtifactSpec(name='text', path=text_input_path)],
?

I'll be happy to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why input_artifact_arguments is needed? Why not combine it with input_artifact_paths into one?

I still think that we should support full argo's artifact feature and ContainerOp is a low level API which is argo specific. I am fine to expose partial of the argo's feature for now, but I don't want to limit the interface to be too opinionated. Maybe we should call out a design meeting to discuss this.

Copy link
Contributor Author

@Ark-kun Ark-kun May 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why input_artifact_arguments is needed? Why not combine it with input_artifact_paths into one?

As a side note: please try giving more actionable feedback. If you want something to be different, provide a mockup of how you want it to should look like and what is the behavior that you want.

They are two different concepts: "input declaration" and "argument passing".
Both of these concepts involve inputs/parameters, but these concepts are very different.
These two lines are not the same:

# This line has input declarations
def my_func(param1: int = 3, param2 : str = 'bla'):
# This line has argument passing
my_func(param1=10, param2='bar')

Also you can only declare function parameters once, but you can pass arguments to it many times.

Argo has the same distinction:
Please check the relevant Argo examples: https://github.com/argoproj/argo/blob/master/examples/artifact-passing.yaml

This is Argo input artifact definition:

    inputs:
      artifacts:
      - name: message
        path: /tmp/message

This is Argo artifact argument passing:

        template: print-message
        arguments:
          artifacts:
          - name: message
            from: "{{steps.generate-artifact.outputs.artifacts.hello-art}}"

These are two separate concepts, two separate parts and Argo does not mix them.

How can you even mix them when the same template is used multiple times? In this case there is a single input definition section and multiple argument passing sections. How can you merge them together?

P.S. Combining input definitions and artifact passing is invalid in Argo for valid and obvious reasons. This is invalid:

  container:
    inputs:
      artifacts:
      - name: message
        path: /tmp/message
        from: "{{steps.generate-artifact.outputs.artifacts.hello-art}}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I prefer the feedback to be actionable.
Also It's nice to have small PRs, so it's good when acting upon feedback does not increase the scope and size of the PR. We had cases where reviewers' requests have changed a small and focused PR to a 5000+ lines monster. It would be really great if we can keep the PR scopes as small as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused by the interface, so I asked the question about the diff between input_artifact_arguments and input_artifact_paths and why do you need both? I also suggested to have a design discussion about the confusion here. I am not sure why you think it's non-actionable feedback.

Now I understand the diff from you last comment, but I still think it's unnecessary in DSL. Why do you need user to provide the name of the input artifact and duplicate it every time when they declare an input? The same reason applies why we don't ask user to declare each input parameters when they define the ContainerOp. For example, you can combine them into one field like following:

    return dsl.ContainerOp(
        name='component_with_input_artifact',
        input_artifacts={ '/path/to/artifact1': artifact_value},

The name of the input artifact can be auto-generated. When converting to argo template, the compiler can take care of generating the template and argument parts separately.

Copy link
Contributor Author

@Ark-kun Ark-kun May 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, you can combine them into one field like following:

Thank you for actionable suggestion.

The name of the input artifact can be auto-generated.

Hmm. What about auto-generating the path instead?

input_artifacts={'artifact1': artifact_value},

How does that look to you?

Should the input_artifacts be a list of a dictionary?
Should we make some data type to describe the input artifact entry?
E.g.

input_artifacts=[InputArtifact(path='/path/to/artifact1', argument=artifact_value)],

The same reason applies why we don't ask user to declare each input parameters when they define the ContainerOp.

ContainerOp is a somewhat strange mixture of Argo's ContainerTemplate and DagTask. It has parts of both (although some parts are missing). This has already brought us some problems especially when Ning was implementing type checking. (To check argument types, you need to know the input types and for that you need inputs. Ning had to introduce new decorator and do some tricks just to recover the information about inputs.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the input artifact can be auto-generated. When converting to argo template, the compiler can take care of generating the template and argument parts separately.

This makes sense and I've implemented it. I even simplified it further (path and name can also be specified, but they're optional).

    return dsl.ContainerOp(
        name='consumer',
        image='alpine',
        command=['cat', dsl.InputArtifactArgument("Hello world")],
    )

@hongye-sun
Copy link
Contributor

Do you mind sharing your design about how the full artifact DSL API will look like?

@Ark-kun
Copy link
Contributor Author

Ark-kun commented May 17, 2019

Do you mind sharing your design about how the full artifact DSL API will look like?

Sure. Here is the link demonstrating the whole DSL (producing, consuming, passing): #336 (comment) (Also included in this PR's description).
Passing artifacts uses the same syntax as passing parameters.

@Ark-kun Ark-kun force-pushed the SDK---Added-support-for-raw-artifact-values-to-ContainerOp branch from 1af6b7c to 6bd95bd Compare August 28, 2019 07:55
Also renamed input_artifact_arguments to artifact_argument_paths in the ContainerOp's constructor
@Ark-kun Ark-kun force-pushed the SDK---Added-support-for-raw-artifact-values-to-ContainerOp branch from 582b6cd to 4ab43ec Compare August 28, 2019 22:55
getattr is too fragile and can be broken by renames.
@Ark-kun
Copy link
Contributor Author

Ark-kun commented Aug 29, 2019

Talked with Hongye offline a while ago.
Our priorities have shifted and he's pretty busy having a lot of work on his hands, so it might not be the best use of his time to spend it on reviewing this PR. Fortunately, Hongye has agreed that someone else can review this PR if they have time.

I still consider this to be a step towards the important goal of supporting artifact passing, so maybe someone can help with the review.
/assign @numerology @ajchili
/hold cancel

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Aug 29, 2019

/cc @numerology @ajchili

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Aug 29, 2019

/assign @numerology @ajchili

Copy link
Member

@ajchili ajchili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit pick questions before an lgtm.

sdk/python/kfp/compiler/_op_to_template.py Outdated Show resolved Hide resolved
sdk/python/kfp/compiler/_op_to_template.py Show resolved Hide resolved
sdk/python/kfp/dsl/_container_op.py Outdated Show resolved Hide resolved
Added the test_input_path_placeholder_with_constant_argument test
@ajchili
Copy link
Member

ajchili commented Aug 29, 2019

/lgtm

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Aug 29, 2019

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Ark-kun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 0fc68bb into kubeflow:master Aug 29, 2019
HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this pull request Mar 11, 2024
…ubeflow#837)

* handle DSL for loop item separator

* handle compilation for separator

* add tests for loop with separator

* style: self -> cls in classmethods

* fix: dsl+compile

* update test results

* style: remove unused import

* style: blank lines

* add license to tests

* fix tests: no value passing of str loop src
HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this pull request Mar 11, 2024
* m: typo in a comment

* remove separator param in loop spec

* update test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants