Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adjust how aliases are formatted #4750

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

max-sixty
Copy link
Member

@max-sixty max-sixty commented Jul 17, 2024

This proposes to adjust how aliases are formatted.

Currently we include a space around =, which means we get the quite reasonable:

sum_gross_cost = sum gross_cost

...but also the quite confusing:

join side:left manager = employees e.reports_to == manager.employee_id`

There, the least two bound items appear to be manager & employees, but in fact those are bound.

So this proposes to change these to:

  • Remove spaces around = in aliases
  • Add parentheses if the rvalue contains multiple items

So now we get:

join side:left manager=employees e.reports_to == manager.employee_id`

...while the standard case here becomes arguably a tiny bit worse, but I think still reasonable:

sum_gross_cost=(sum gross_cost)

What do folks think?

This proposes to adjust how aliases are formatted.

Currently we include a space around `=`, which means we get the quite reasonable:

```elm
sum_gross_cost = sum gross_cost
```

...but also the quite confusing:

```elm
join side:left manager = employees e.reports_to == manager.employee_id`
```

There, the _least_ two bound items appear to be `manager` & `employees`, but in fact those are bound.

So this proposes to change these to:
- Remove spaces around `=` in aliases
- Add parentheses if the rvalue contains multiple items

So now we get:

```elm
join side:left manager=employees e.reports_to == manager.employee_id`
```

...while the standard case here becomes arguable a bit worse, but still quite reasonable:

```elm
sum_gross_cost = (sum gross_cost)
```

What do folks think?
@snth
Copy link
Member

snth commented Jul 17, 2024

Is this just for PRQL code samples in the tests and docs? Presumably both versions will still compile fine?

I was quite surprised that the following actually compiles because I somehow always thought that the parentheses around the join conditions was mandatory:

join side:left manager=employees e.reports_to == manager.employee_id

Never really thought about this before and now that I have it's opened a bit of a Pandora's box for me and I have quite a few comments/questions. I'll put those in separate comments to make it easier to downvote/upvote them separately.

@snth
Copy link
Member

snth commented Jul 17, 2024

First thought

There's a difference between column aliases and relation aliases, should they be treated differently?

column aliases:

derive sum_gross_cost = sum gross_cost

relation references:

from worker=employees
join side:left manager=employees worker.reports_to == manager.employee_id

I don't actually have an opinion on this because I think it's deeper than that. See next comment.

@snth
Copy link
Member

snth commented Jul 17, 2024

Second thought

Column aliases

select and derive take a tuple as first argument and column aliases are a function of that (and "internal" to the tuple in a sense).

Relation aliases

from and join take a relation as first argument.

Q: Why is the aliasing in that case not a simple function argument alias, i.e. using : instead of =?

A: It's actually a bit different and there are a few reasons. Let's consider the following example:

from worker=employees
join side:left manager=employees worker.reports_to == manager.employee_id
  1. In the example above, worker and manager can be referenced in subsequent transforms/pipeline steps so they are not internal/local to the from and join transforms.
  2. Even if 1. above wasn't the case, we don't usually allow renaming arguments during a function call.
    I mean in Python you could do something like
    def join(conditions, that=None, alias="that", **kwargs):
        if that is None and alias!="that":
            that = kwargs[alias]
            conditions = _replace(alias, "that", conditions)
        return _join(that, conditions)
    join("worker.reports_to == manager.employee_id", alias="manager", manager=employees)
    This second point is probably a bit premature at this point of the discussion. I just wanted to capture that I think this part might be rescuable but I first want to address point 1.

Coming back to point 1., I believe this is a carryover from SQL and should be removed from PRQL because a) it is not a fundamental part of Relational Algebra, and b) breaks the local nature of PRQL transforms and puts in global dependencies between different pipeline steps. I've argued this before but I'm more convinced and resolved on this now. I think this would easily become clear if we did a simple implementation of PRQL on a backend of lists of tuples (without any SQL in between).

My conclusions from my investigations into Monads and Relational Algebra can best be summed up by saying that I believe PRQL should be a DSL for the List monad. By restricting ourselves to SQL backends we miss out a lot of potential application areas of PRQL and risk perpetuating mistakes from SQL (like the global scope of relation aliases). I'm not sure how this fits into the resolver rework that's currently going on but my suspicion is that it would probably greatly simplify the resolver as well. I think the only thing that stands in the way of this is join which I was surprised to learn recently isn't really a primary part of Relational Algebra.Of course this is a bigger topic and probably deserves a Github issue or discussion on its own. I'll try to write up my thoughts on this in more detail, especially how I think join should work.

tl;dr I think relation aliases area a mistake and should be removed from PRQL!

Copy link
Member

@snth snth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr @max-sixty I'm happy with how the codegen examples in this PR look and am in favour of this change.

@eitsupi
Copy link
Member

eitsupi commented Jul 17, 2024

I too feel that the disadvantages of complicating the syntax outweigh the advantages of being able to set aliases.

@kgutwin
Copy link
Collaborator

kgutwin commented Jul 17, 2024

The trouble about removing relation aliases in join (and possibly also an issue with from) is that it makes doing things like self joins much harder.

Extending your previous example slightly:

from worker=employees
join side:left manager=employees worker.reports_to == manager.employee_id
select { manager.employee_id, manager.name, worker_name=worker.name }

How else would you phrase the select to be able to specify which side of the join you were referring to? I suppose you could use try using this and that but those could get really confusing especially if you stack up additional joins.

@snth
Copy link
Member

snth commented Jul 17, 2024

@kgutwin You are quite right but I have a solution for that, at least semantically (the syntax would still need refinement). Unfortunately I don't have time to write it up right now but I'll try to do it as soon as I can. (There might actually already be a write up somewhere on here.) Should I put it in an "issue" or a "discussion"?

@aljazerzen
Copy link
Member

that I believe PRQL should be a DSL for the List monad

That is a big statement to make! I do feel the same way and the resolver rework does in fact make a few steps in this direction.

I've opened an issue about relational aliases: #4751

@aljazerzen
Copy link
Member

Let's move the discussion about relational aliases to the new issue and talk about the new code style here.


Am I correct that this change applies to "official PRQL code style" only, and not language itself? So no changes to the parser?

Regarding the new code style:

  • I really don't like the unnecessary parenthesis within tuples. I think that if they are not necessary for ambiguity, they shouldn't be considered the official code style of PRQL,
  • I see the need for the "compact" code style, the join is confusing.

We do use = in a few syntactical locations (function calls, tuple fields, import stmt). What about using the "compact" style only for functions calls and not tuple fields?

For example:

from x=employees
aggregate { sum_gross_cost = sum gross_cost }
derive y=sum_gross_cost

@max-sixty
Copy link
Member Author

max-sixty commented Jul 17, 2024

Am I correct that this change applies to "official PRQL code style" only, and not language itself? So no changes to the parser?

Correct!

We do use = in a few syntactical locations (function calls, tuple fields, import stmt). What about using the "compact" style only for functions calls and not tuple fields?

This is feasible. It's a bit more complicated but not intractable.

So to extend your example:

from x=employees
derive rounded=(round 2 gross_cost),
aggregate {
  sum_gross_cost = sum gross_cost,
  avg_gross_cost = average gross_cost,
  gross_cost_again = gross cost,
}
derive y=sum_gross_cost

What do folks think?

@aljazerzen
Copy link
Member

That looks good to me.

@max-sixty
Copy link
Member Author

(FYI this became a bit harder than I thought, so pausing on this in favor of trying the new fmt approach...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants