Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #496: support for joinCollections #500

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

pikatchu
Copy link
Contributor

@pikatchu pikatchu commented Nov 12, 2024

This diff implements the 'join' operation of two collections. It is assumed that:

  • both collection values are objects (keys can be of any type)
  • there never is more than 1 object on boths sides at any time unless the option allowMultipleValuesOnBothSides is passed (which should be rare).

If both those conditions are met, given a key, if the first collection defines an object of type V1 and the second collection defines an object V2 for the same key, the resulting collection will associate {...V1, ...V2} for that key. No value is associated if either side is missing.

}

class MergeJoinFields implements Mapper<JoinObject, JoinObject, Json, Json> {
constructor(private allowNN: boolean) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I can't figure out what "NN" stands for here -- if it's meant to be rare, shall we just expand that abbreviation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for N to N relations. But I renamed it to allowMultipleValuesOnBothSides.

implements Mapper<K, V, K, V>
{
mapEntry(key: K, values: NonEmptyIterator<V>): Iterable<[K, V]> {
return [[key, values.getUnique()]];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ever not either a no-op or an exception?

I'm not sure that the benefit of having a special UniqueEagerCollection type is worth the extra API surface and copying/allocation (though I suppose that will be minimal here due to sharing).

I'd propose either (a) checking uniqueness during the actual join operation and dropping this stuff or (b) making UniqueEagerCollection a more full-fledged component of the API, with its own getter/mapper types that take advantage of the uniqueness. Probably slightly prefer (a) for simplicity's sake, but open to arguments / the possibility that I'm misunderstanding something here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I removed it. The join doesn't need it to operate, so no change needed there.

// Joins
/*****************************************************************************/

type JoinObject = { value: Json; side: "left" | "right" };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would JsonObject instead of Json here be more accurate and simplify the code and interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try (I thought I tried to add JsonObject and it didn't work, but it might have been somewhere else).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yeah, typescript refuses it. I don't understand why ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried and pushed the result to your branch. Seems to work, or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I see you pushed a diff that makes JsonObject bubble up all the way to joinCollections. I don't know that this is a good idea.

The reason is that when I tried to use joinCollections on inputs, Typescript did not complain even when the input collections where not objects. That is a pretty subtle "gotcha" that I would like to avoid.

Comment on lines 80 to 96
class AddJoinSideField implements Mapper<Json, Json, Json, JoinObject> {
constructor(private joinSide: "left" | "right") {}

mapEntry(
key: Json,
values: NonEmptyIterator<Json>,
): Iterable<[Json, JoinObject]> {
const result: [Json, JoinObject][] = [];
for (const v of values) {
if (typeof v !== "object") {
throw new Error(
"joinCollection only works on objects, not: " + JSON.stringify(v),
);
}
result.push([key, { value: v, side: this.joinSide }]);
}
return result;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something or could this be simplified by extending OneToOneMapper instead of implementing Mapper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

);
}
const keys1 = Object.keys(v1);
if (keys1.some((key) => key in v2)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these objects expected to be small, or might they be dictionaries or arrays? Asking since this is quadratic and doesn't strictly need to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. That's a very good point. Let me revisit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the objects are indeed expected to be small ... still, no need to keep something O(n^2) for no reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annoyingly, Object.keys is not sorted, so they would have to be first sorted and then zipped together a la mergesort.

>(
col1: EagerCollection<K, V1>,
col2: EagerCollection<K, V2>,
allowMultipleValuesOnBothSides: boolean = false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expectation is that the fields of the values in the left and right value types are disjoint, right? Or is it expected that idiomatic use cases will have non-disjoint types, but only per-key will disjointness will hold?

My understanding is that the simplest case is where both the left and right argument collections map keys to unique values. On the other hand, if the left and right argument collections map keys to potentially multiple values, then the result will map each key to the merge of each pair in the cartesian product of the left values and right values for that key.

Maybe this is just me, but I don't see why the intermediate cases, where only one of the argument collections is multi-valued, are grouped with the simple case rather than the general one. Would it be a simpler interface if there were two booleans, one stating that the left collection is allowed to be multi-valued and the other stating that the right collection is allowed to be multi-valued?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is my experience with the SQL engine (which might be flawed). I have seen multiple cases where either side had more than one value, but never something that required a cartesian product that was actually useful. Let's review them, let's assume we have 2 large collections (large in the sense that N*M is too large to be computed).

Example, one / many: A table of users, a table of addresses, a user can have multiple addresses => maximum size of the able = number of address
Example, one / one: A table of users, a table of social security numbers => maximum size of the table = min(users, ssn)
Example, many / many: ??? Usually a contrived example that is only useful to explain the behavior of a join

I understand the cartesian product in the many/many case is the natural extension of what the join should do. But I have yet to see anything in practice using it. In fact, more often than not, it's a mistake, the query explodes and the user does not know why.

When you think about it, it makes sense. A cartesian product is only going to work for very small sets ...

So I would rather make it explicit. It's not banned. You just have to be explicit about it.

input1: UniqueEagerCollection<number, JsonObject>;
input2: EagerCollection<number, JsonObject>;
}): EagerCollection<number, JsonObject> {
return joinCollections(cs.input1, cs.input2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a TODO to add tests for the 1_to_N, N_to_1, and N_to_N cases?

import {
Sum,
joinCollections,
type UniqueEagerCollection,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gone now

@jberdine
Copy link
Contributor

Summarizing offline discussion: It would be possible to give joinCollections the type

export function joinCollections<
  K extends Json,
  V1 extends JsonObject,
  V2 extends JsonObject,
>(
  col1: EagerCollection<K, V1>,
  col2: EagerCollection<K, V2>,
  allowMultipleValuesOnBothSides: boolean = false,
): EagerCollection<K, V1 & V2>

and propagate JsonObject down, eliminating the checks that each value is an object.

BUT, since typescript unsoundly treats method parameters as bivariant, calling joinCollections e.g. from createGraph where inputCollections: Record<string, EagerCollection<Json, Json>> will not detect mistakes where general Json values are provided where JsonObjects are required. So this PR's dynamic checks on each value are still needed. Thanks for nothing typescript.

Comment on lines +101 to +130
throw new Error(
"Objects don't have distinct fields: " +
JSON.stringify(v1) +
" " +
JSON.stringify(v2),
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question on the error behavior: join in relational algebra drops pairs that do not agree on common fields, while here it is an error for there to be common fields (other than the key). No strong position, but it isn't immediately obvious to me that this choice is the one users will expect, or be the most useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I wonder about just dropping the pairs with common fields other than the key, rather than throwing, not taking pairs that happen to have matching values for common fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know people know relational algebra well enough to expect that. I suspect they don't.
I am happy to add an option (dropOnCommonFields for example), but making that the default is not a good idea IMO. Imagine that is not the behavior you expect, now imagine you end up inserting new kinds of objects in a collection with a field that collides with another collection by mistake. The debugging is going to be pretty brutal ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok makes sense, the debugging angle is compelling. And we can wait to see if anyone asks to drop instead of error.

}
}

export function joinCollections<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an implementation point of view, I understand why this function is here. But from the API point of view, I think it would be better if this was a join operation in the EagerCollection interface. Also, if/when we reimplement this in skip, putting it in helpers will not be the right place anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I wish there was a clean way of extending a class in another file. You know something like:

extension class EagerCollection {
  join(...): ...
}

pikatchu added a commit to pikatchu/skdb that referenced this pull request Nov 13, 2024
- AddJoinSideField now uses OneToOneMapper
- Check of distinct fields is now O(n*log(n))
- Added tests for one to many and many to one use case
}
}
if (countLeft > 1 && countRight > 1) {
if (!this.allowMultipleValuesOnBothSides) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could move up to before let countLeft... to avoid the counting loop in the allow case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the common case, so I think it's fine. If allowMultipleValuesOnBothSides is true, your set better be very small to begin with ...

skipruntime-ts/helpers/src/utils.ts Outdated Show resolved Hide resolved
Comment on lines 175 to 176
V1 extends Json,
V2 extends Json,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if TS doesn't check it, I think it would be a documentation improvement to use JsonObject for the value types here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Done.

pikatchu and others added 8 commits November 14, 2024 16:53
This diff implements the 'join' operation of two collections.
It is assumed that:
- both collection values are objects (keys can be of any type)
- there never is more than 1 object on boths sides at any time
  unless the option allowNN is passed (which should be rare).

If both those conditions are met, given a key, if the first collection
defines an object of type V1 and the second collection defines an
object V2 for the same key, the resulting collection will associate
{...V1, ...V2} for that key. No value is associated if either side is
missing.
- AddJoinSideField now uses OneToOneMapper
- Check of distinct fields is now O(n*log(n))
- Added tests for one to many and many to one use case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants