Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Datacatalog cache #5

Merged
merged 5 commits into from
Sep 11, 2019
Merged

Datacatalog cache #5

merged 5 commits into from
Sep 11, 2019

Conversation

chanadian
Copy link
Contributor

This PR adds the catalog client to cache Task Executions to DataCatalog.

The steps to cache a task execution are:

  1. Create a dataset for the task
  2. Create an artifact that represents the execution, along with the artifact data that represents the execution output
  3. Tag the artifact with a unique hash of the input values

When retrieving a cached entry:

  1. Compute the tag by computing the hash of the input
  2. Check if a tagged artifact exists with that hash

Here's how fields in Catalog look like when they are cached:

Every task instance is represented as a DataSet:

Dataset {
  project: Flyte project the task was registered in
  domain: Flyte domain for the task execution
  name: flyte_task-<taskName>
  version: <discoverable_version>-<hash(input params)>-<hash(output params)>
// note that a change in the signature will change the hashes and create a new version of the Dataset
}

Every task execution is represented as an Artifact in the Dataset above:

Artifact {
  id: uuid
  Metadata: [executionName, executionVersion]
}

with outputs as ArtifactData:

ArtifactData {
  Name: <output-name>
  value: <literal value of the output>
}

To retrieve the Artifact, we use the tag associated with the Artifact which is composed of:

ArtifactTag {
  Name: flyte_cached-<unique hash of the input values>
}

@chanadian
Copy link
Contributor Author

@kumare3 could you please take a look? @surindersinghp reviewed it previously and was OK with it.

The PR adds the catalog client to cache task executions onto Data Catalog. Added tests, documentation and configs. In the future when we move the IDL to FlyteIDL we can remove the old legacy client.

}
logger.Debugf(ctx, "Created tag: %v, for task: %v", tagName, task.Id)

// TODO: We should create the artifact + tag in a transaction when the service supports that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it will avoid race conditions, for example if there are two run of the same task that happen in parallel we may not be tagging the artifact we just created in this service call. It's not a huge deal because either of those artifacts will work, just can be unexpected behavior.

Copy link
Contributor

@kumare3 kumare3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chanadian chanadian merged commit 2a9e28f into master Sep 11, 2019
@chanadian chanadian deleted the datacatalog-cache branch September 11, 2019 20:39
EngHabu pushed a commit that referenced this pull request Oct 8, 2019
kumare3 pushed a commit to nuclyde-io/flytepropeller that referenced this pull request Feb 4, 2021
* Update the executions model to add cluster column
eapolinario pushed a commit to eapolinario/flytepropeller that referenced this pull request Aug 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants