Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype parsing rules for dataset factories #2508

Closed
merelcht opened this issue Apr 12, 2023 · 3 comments
Closed

Prototype parsing rules for dataset factories #2508

merelcht opened this issue Apr 12, 2023 · 3 comments
Assignees

Comments

@merelcht
Copy link
Member

Description

Subtask of #2423

Context

For the dataset factories solution we'll need a way to parse the syntax to match datasets against.
Parse (https://github.com/r1chardj0n3s/parse) is a library that has a pattern-matching syntax which uses reverse Python f-strings format.

Example of what our syntax could be:

"{root_namespace}.{*}@{spark}":
  type: spark.SparkDataSet
  filepath: data/{chunks[0]}/{chunks[1]}.parquet
  file_format: parquet

And the function that would create the dataset entry:

def create_spark_dataset(dataset_name: str, *chunks):
    # e.g. here chunks=["root_namespace", "something-instead-the-*", "spark"]
    return  SparkDataSet(filepath=f"data/{chunks[0]}/{chunks[1]}.parquet", file_format="parquet")

Things to investigate

The goal of this ticket is to experiment with that library and come up with rules on how the matching should work.

It's especially important to determine what needs to happen when two patterns would match a datasets. E.g. france.companies@spark against:

"{root_namespace}.{dataset_name}@spark":
  type: spark.SparkDataSet
  filepath: data/{root_namespace}/{dataset_name}.parquet
  file_format: parquet

"{dataset_name}@spark":
  type: spark.SomethingElse
  filepath: data/{dataset_name}.parquet
  file_format: parquet

Pattern matching is a common problem in url matching, so take inspiration from web frameworks who solve that problem, e.g. Ruby/React.

Initial thoughts on rules

  1. The most explicit pattern should match first
  2. Then match alphabetically (?)
@noklam
Copy link
Contributor

noklam commented Apr 28, 2023

I did some investigation about how Ruby/Django/React doing with their Router, they all differ in slightly way but it follows a clear declaration order. i.e. for Django you have a view.py which has a list of routes, the first match will win.

I am in favor of simple solution at this stage, which is simply following the declaration order. We could add more complicated things like "local scope" or namespace later when needed.

@antonymilne
Copy link
Contributor

antonymilne commented May 4, 2023

One heuristic for "most specific pattern should match first" would be to count the number of {} in the expression (in fact, off the top of my head I can't think of any other way to do it).

@noklam following declaration order would be a great way to do this, but the problem is that we merge together multiple config files so the pattern that you're matching against could come from a different file. In this case it's not clear what declaration order means. Ideally we would e.g. give priority to a matching pattern defined in the same file first before looking in other files, but I don't think that will be easily possible due to the way we merge config files together since they're all treated equally. And what if you match against a pattern in two different files? Then we'd need to work out a rule for e.g. giving preference to the file based on alphabetical order. Besides, the information about what file the definition comes from is long gone by the time we would need it (when instantiating the dataset rather than when doing config loading).

So doing this based on declaration order would be great and very simple but I suspect there just's not a good way to get this to work in our case. Hence needing to define our own rules (like most explicit pattern matches first, followed by alphabetical). This is my main reservation against this whole approach, but unfortunately I don't see a way round it.

@merelcht
Copy link
Member Author

The prototype has been finished and will be used in the full implementation for #2423

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

4 participants