Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype dataset factories #2510

Closed
merelcht opened this issue Apr 12, 2023 · 1 comment
Closed

Prototype dataset factories #2510

merelcht opened this issue Apr 12, 2023 · 1 comment
Assignees

Comments

@merelcht
Copy link
Member

Description

Subtask of #2423

Context

In order to fully assess the datasets factories solution we need to get a somewhat functioning prototype. This will give insight into the complexity of the solution, risks involved and potential drawbacks we haven't yet considered in discussions.

To keep in mind/try-out while prototyping

  • Responsibility of creating default dataset/pattern matching should be in the DataCatalog and not in the Runner. Currently, default dataset creation happens in the Runner, but this was always odd and supposed to be a temporary solution.
  • Factory definition + syntax should ideally go into the catalog so you'd have:
def create_spark_dataset(dataset_name: str, *chunks):
    # e.g. here chunks=["root_namespace", "something-instead-the-*", "spark"]
    return  SparkDataSet(filepath=f"data/{chunks[0]}/{chunks[1]}.parquet", file_format="parquet")

"{root_namespace}.{*}@{spark}":
  type: spark.SparkDataSet
  filepath: data/{chunks[0]}/{chunks[1]}.parquet
  file_format: parquet
  • Ideally config loaders shouldn't know about/deal with this special syntax.
  • Catalog validation should happen lazily somehow. Or only on explicit catalog entries.

Question to answer

  1. Can this be implemented in a non-breaking way?
@merelcht
Copy link
Member Author

This prototype has been discussed and feedback written up: #2423 (comment) + the comments on the PR (#2560)

This feature is now ready to be implemented properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant