Datacatalog cache #5

chanadian · 2019-09-10T00:10:36Z

This PR adds the catalog client to cache Task Executions to DataCatalog.

The steps to cache a task execution are:

Create a dataset for the task
Create an artifact that represents the execution, along with the artifact data that represents the execution output
Tag the artifact with a unique hash of the input values

When retrieving a cached entry:

Compute the tag by computing the hash of the input
Check if a tagged artifact exists with that hash

Here's how fields in Catalog look like when they are cached:

Every task instance is represented as a DataSet:

Dataset {
  project: Flyte project the task was registered in
  domain: Flyte domain for the task execution
  name: flyte_task-<taskName>
  version: <discoverable_version>-<hash(input params)>-<hash(output params)>
// note that a change in the signature will change the hashes and create a new version of the Dataset
}

Every task execution is represented as an Artifact in the Dataset above:

Artifact {
  id: uuid
  Metadata: [executionName, executionVersion]
}

with outputs as ArtifactData:

ArtifactData {
  Name: <output-name>
  value: <literal value of the output>
}

To retrieve the Artifact, we use the tag associated with the Artifact which is composed of:

ArtifactTag {
  Name: flyte_cached-<unique hash of the input values>
}

chanadian · 2019-09-10T19:22:55Z

@kumare3 could you please take a look? @surindersinghp reviewed it previously and was OK with it.

The PR adds the catalog client to cache task executions onto Data Catalog. Added tests, documentation and configs. In the future when we move the IDL to FlyteIDL we can remove the old legacy client.

kumare3 · 2019-09-10T20:06:52Z

pkg/controller/catalog/datacatalog/datacatalog.go

+	}
+	logger.Debugf(ctx, "Created tag: %v, for task: %v", tagName, task.Id)
+
+	// TODO: We should create the artifact + tag in a transaction when the service supports that


Because it will avoid race conditions, for example if there are two run of the same task that happen in parallel we may not be tagging the artifact we just created in this service call. It's not a huge deal because either of those artifacts will work, just can be unexpected behavior.

kumare3

LGTM

Datacatalog cache

* Update the executions model to add cluster column

Datacatalog cache

chanadian added 2 commits September 9, 2019 17:13

Cache task executions to DataCatalog

c81a0a8

Correct metadata for execution name

8ba3946

chanadian force-pushed the datacatalog-cache branch from 2dacbf6 to 8ba3946 Compare September 10, 2019 00:13

chanadian added 3 commits September 10, 2019 10:40

Specify insecure connection with config

2404756

Add more comments

876c9a9

go-lint cleanup

4366de7

kumare3 reviewed Sep 10, 2019

View reviewed changes

kumare3 approved these changes Sep 10, 2019

View reviewed changes

chanadian merged commit 2a9e28f into master Sep 11, 2019

chanadian deleted the datacatalog-cache branch September 11, 2019 20:39

EngHabu pushed a commit that referenced this pull request Oct 8, 2019

Merge pull request #5 from lyft/datacatalog-cache

f23e3be

Datacatalog cache

kumare3 pushed a commit to nuclyde-io/flytepropeller that referenced this pull request Feb 4, 2021

Update the executions model (flyteorg#5)

cb3201b

* Update the executions model to add cluster column

eapolinario pushed a commit to eapolinario/flytepropeller that referenced this pull request Aug 9, 2023

Merge pull request flyteorg#5 from lyft/datacatalog-cache

224b94d

Datacatalog cache

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datacatalog cache #5

Datacatalog cache #5

chanadian commented Sep 10, 2019

chanadian commented Sep 10, 2019

kumare3 Sep 10, 2019

chanadian Sep 10, 2019

kumare3 left a comment

Datacatalog cache #5

Datacatalog cache #5

Conversation

chanadian commented Sep 10, 2019

chanadian commented Sep 10, 2019

kumare3 Sep 10, 2019

Choose a reason for hiding this comment

chanadian Sep 10, 2019

Choose a reason for hiding this comment

kumare3 left a comment

Choose a reason for hiding this comment