Prototype parsing rules for dataset factories #2508

merelcht · 2023-04-12T10:18:28Z

Description

Subtask of #2423

Context

For the dataset factories solution we'll need a way to parse the syntax to match datasets against.
Parse (https://github.com/r1chardj0n3s/parse) is a library that has a pattern-matching syntax which uses reverse Python f-strings format.

Example of what our syntax could be:

"{root_namespace}.{*}@{spark}":
  type: spark.SparkDataSet
  filepath: data/{chunks[0]}/{chunks[1]}.parquet
  file_format: parquet

And the function that would create the dataset entry:

def create_spark_dataset(dataset_name: str, *chunks):
    # e.g. here chunks=["root_namespace", "something-instead-the-*", "spark"]
    return  SparkDataSet(filepath=f"data/{chunks[0]}/{chunks[1]}.parquet", file_format="parquet")

Things to investigate

The goal of this ticket is to experiment with that library and come up with rules on how the matching should work.

It's especially important to determine what needs to happen when two patterns would match a datasets. E.g. france.companies@spark against:

"{root_namespace}.{dataset_name}@spark":
  type: spark.SparkDataSet
  filepath: data/{root_namespace}/{dataset_name}.parquet
  file_format: parquet

"{dataset_name}@spark":
  type: spark.SomethingElse
  filepath: data/{dataset_name}.parquet
  file_format: parquet

Pattern matching is a common problem in url matching, so take inspiration from web frameworks who solve that problem, e.g. Ruby/React.

Initial thoughts on rules

The most explicit pattern should match first
Then match alphabetically (?)

The text was updated successfully, but these errors were encountered:

noklam · 2023-04-28T10:49:41Z

I did some investigation about how Ruby/Django/React doing with their Router, they all differ in slightly way but it follows a clear declaration order. i.e. for Django you have a view.py which has a list of routes, the first match will win.

I am in favor of simple solution at this stage, which is simply following the declaration order. We could add more complicated things like "local scope" or namespace later when needed.

antonymilne · 2023-05-04T14:25:18Z

One heuristic for "most specific pattern should match first" would be to count the number of {} in the expression (in fact, off the top of my head I can't think of any other way to do it).

@noklam following declaration order would be a great way to do this, but the problem is that we merge together multiple config files so the pattern that you're matching against could come from a different file. In this case it's not clear what declaration order means. Ideally we would e.g. give priority to a matching pattern defined in the same file first before looking in other files, but I don't think that will be easily possible due to the way we merge config files together since they're all treated equally. And what if you match against a pattern in two different files? Then we'd need to work out a rule for e.g. giving preference to the file based on alphabetical order. Besides, the information about what file the definition comes from is long gone by the time we would need it (when instantiating the dataset rather than when doing config loading).

So doing this based on declaration order would be great and very simple but I suspect there just's not a good way to get this to work in our case. Hence needing to define our own rules (like most explicit pattern matches first, followed by alphabetical). This is my main reservation against this whole approach, but unfortunately I don't see a way round it.

merelcht · 2023-05-23T08:41:03Z

The prototype has been finished and will be used in the full implementation for #2423

merelcht added this to the Make `OmegaConfigLoader` ready for 0.19.0 milestone Apr 12, 2023

ankatiyar self-assigned this Apr 26, 2023

ankatiyar mentioned this issue May 5, 2023

[DRAFT] Dataset factory parsing rules demo #2559

Closed

5 tasks

merelcht closed this as completed May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prototype parsing rules for dataset factories #2508

Prototype parsing rules for dataset factories #2508

merelcht commented Apr 12, 2023

noklam commented Apr 28, 2023 •

edited

Loading

antonymilne commented May 4, 2023 •

edited

Loading

merelcht commented May 23, 2023

Prototype parsing rules for dataset factories #2508

Prototype parsing rules for dataset factories #2508

Comments

merelcht commented Apr 12, 2023

Description

Context

Things to investigate

Initial thoughts on rules

noklam commented Apr 28, 2023 • edited Loading

antonymilne commented May 4, 2023 • edited Loading

merelcht commented May 23, 2023

noklam commented Apr 28, 2023 •

edited

Loading

antonymilne commented May 4, 2023 •

edited

Loading