-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #496: support for joinCollections #500
base: main
Are you sure you want to change the base?
Conversation
skipruntime-ts/helpers/src/utils.ts
Outdated
} | ||
|
||
class MergeJoinFields implements Mapper<JoinObject, JoinObject, Json, Json> { | ||
constructor(private allowNN: boolean) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I can't figure out what "NN" stands for here -- if it's meant to be rare, shall we just expand that abbreviation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for N to N relations. But I renamed it to allowMultipleValuesOnBothSides.
skipruntime-ts/helpers/src/utils.ts
Outdated
implements Mapper<K, V, K, V> | ||
{ | ||
mapEntry(key: K, values: NonEmptyIterator<V>): Iterable<[K, V]> { | ||
return [[key, values.getUnique()]]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ever not either a no-op or an exception?
I'm not sure that the benefit of having a special UniqueEagerCollection
type is worth the extra API surface and copying/allocation (though I suppose that will be minimal here due to sharing).
I'd propose either (a) checking uniqueness during the actual join operation and dropping this stuff or (b) making UniqueEagerCollection a more full-fledged component of the API, with its own getter/mapper types that take advantage of the uniqueness. Probably slightly prefer (a) for simplicity's sake, but open to arguments / the possibility that I'm misunderstanding something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I removed it. The join doesn't need it to operate, so no change needed there.
// Joins | ||
/*****************************************************************************/ | ||
|
||
type JoinObject = { value: Json; side: "left" | "right" }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would JsonObject
instead of Json
here be more accurate and simplify the code and interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try (I thought I tried to add JsonObject and it didn't work, but it might have been somewhere else).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So yeah, typescript refuses it. I don't understand why ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried and pushed the result to your branch. Seems to work, or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I see you pushed a diff that makes JsonObject bubble up all the way to joinCollections. I don't know that this is a good idea.
The reason is that when I tried to use joinCollections on inputs, Typescript did not complain even when the input collections where not objects. That is a pretty subtle "gotcha" that I would like to avoid.
skipruntime-ts/helpers/src/utils.ts
Outdated
class AddJoinSideField implements Mapper<Json, Json, Json, JoinObject> { | ||
constructor(private joinSide: "left" | "right") {} | ||
|
||
mapEntry( | ||
key: Json, | ||
values: NonEmptyIterator<Json>, | ||
): Iterable<[Json, JoinObject]> { | ||
const result: [Json, JoinObject][] = []; | ||
for (const v of values) { | ||
if (typeof v !== "object") { | ||
throw new Error( | ||
"joinCollection only works on objects, not: " + JSON.stringify(v), | ||
); | ||
} | ||
result.push([key, { value: v, side: this.joinSide }]); | ||
} | ||
return result; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing something or could this be simplified by extending OneToOneMapper instead of implementing Mapper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
skipruntime-ts/helpers/src/utils.ts
Outdated
); | ||
} | ||
const keys1 = Object.keys(v1); | ||
if (keys1.some((key) => key in v2)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these objects expected to be small, or might they be dictionaries or arrays? Asking since this is quadratic and doesn't strictly need to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. That's a very good point. Let me revisit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the objects are indeed expected to be small ... still, no need to keep something O(n^2) for no reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annoyingly, Object.keys is not sorted, so they would have to be first sorted and then zipped together a la mergesort.
>( | ||
col1: EagerCollection<K, V1>, | ||
col2: EagerCollection<K, V2>, | ||
allowMultipleValuesOnBothSides: boolean = false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expectation is that the fields of the values in the left and right value types are disjoint, right? Or is it expected that idiomatic use cases will have non-disjoint types, but only per-key will disjointness will hold?
My understanding is that the simplest case is where both the left and right argument collections map keys to unique values. On the other hand, if the left and right argument collections map keys to potentially multiple values, then the result will map each key to the merge of each pair in the cartesian product of the left values and right values for that key.
Maybe this is just me, but I don't see why the intermediate cases, where only one of the argument collections is multi-valued, are grouped with the simple case rather than the general one. Would it be a simpler interface if there were two booleans, one stating that the left collection is allowed to be multi-valued and the other stating that the right collection is allowed to be multi-valued?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is my experience with the SQL engine (which might be flawed). I have seen multiple cases where either side had more than one value, but never something that required a cartesian product that was actually useful. Let's review them, let's assume we have 2 large collections (large in the sense that N*M is too large to be computed).
Example, one / many: A table of users, a table of addresses, a user can have multiple addresses => maximum size of the able = number of address
Example, one / one: A table of users, a table of social security numbers => maximum size of the table = min(users, ssn)
Example, many / many: ??? Usually a contrived example that is only useful to explain the behavior of a join
I understand the cartesian product in the many/many case is the natural extension of what the join should do. But I have yet to see anything in practice using it. In fact, more often than not, it's a mistake, the query explodes and the user does not know why.
When you think about it, it makes sense. A cartesian product is only going to work for very small sets ...
So I would rather make it explicit. It's not banned. You just have to be explicit about it.
input1: UniqueEagerCollection<number, JsonObject>; | ||
input2: EagerCollection<number, JsonObject>; | ||
}): EagerCollection<number, JsonObject> { | ||
return joinCollections(cs.input1, cs.input2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a TODO to add tests for the 1_to_N, N_to_1, and N_to_N cases?
skipruntime-ts/tests/src/tests.ts
Outdated
import { | ||
Sum, | ||
joinCollections, | ||
type UniqueEagerCollection, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gone now
Summarizing offline discussion: It would be possible to give
and propagate BUT, since typescript unsoundly treats method parameters as bivariant, calling |
throw new Error( | ||
"Objects don't have distinct fields: " + | ||
JSON.stringify(v1) + | ||
" " + | ||
JSON.stringify(v2), | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question on the error behavior: join in relational algebra drops pairs that do not agree on common fields, while here it is an error for there to be common fields (other than the key). No strong position, but it isn't immediately obvious to me that this choice is the one users will expect, or be the most useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, I wonder about just dropping the pairs with common fields other than the key, rather than throwing, not taking pairs that happen to have matching values for common fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know people know relational algebra well enough to expect that. I suspect they don't.
I am happy to add an option (dropOnCommonFields for example), but making that the default is not a good idea IMO. Imagine that is not the behavior you expect, now imagine you end up inserting new kinds of objects in a collection with a field that collides with another collection by mistake. The debugging is going to be pretty brutal ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok makes sense, the debugging angle is compelling. And we can wait to see if anyone asks to drop instead of error.
} | ||
} | ||
|
||
export function joinCollections< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an implementation point of view, I understand why this function is here. But from the API point of view, I think it would be better if this was a join
operation in the EagerCollection
interface. Also, if/when we reimplement this in skip, putting it in helpers will not be the right place anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I wish there was a clean way of extending a class in another file. You know something like:
extension class EagerCollection {
join(...): ...
}
- AddJoinSideField now uses OneToOneMapper - Check of distinct fields is now O(n*log(n)) - Added tests for one to many and many to one use case
} | ||
} | ||
if (countLeft > 1 && countRight > 1) { | ||
if (!this.allowMultipleValuesOnBothSides) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could move up to before let countLeft...
to avoid the counting loop in the allow case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not the common case, so I think it's fine. If allowMultipleValuesOnBothSides is true, your set better be very small to begin with ...
skipruntime-ts/helpers/src/utils.ts
Outdated
V1 extends Json, | ||
V2 extends Json, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if TS doesn't check it, I think it would be a documentation improvement to use JsonObject for the value types here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Done.
This diff implements the 'join' operation of two collections. It is assumed that: - both collection values are objects (keys can be of any type) - there never is more than 1 object on boths sides at any time unless the option allowNN is passed (which should be rare). If both those conditions are met, given a key, if the first collection defines an object of type V1 and the second collection defines an object V2 for the same key, the resulting collection will associate {...V1, ...V2} for that key. No value is associated if either side is missing.
- AddJoinSideField now uses OneToOneMapper - Check of distinct fields is now O(n*log(n)) - Added tests for one to many and many to one use case
Co-authored-by: Josh Berdine <[email protected]>
This diff implements the 'join' operation of two collections. It is assumed that:
If both those conditions are met, given a key, if the first collection defines an object of type V1 and the second collection defines an object V2 for the same key, the resulting collection will associate {...V1, ...V2} for that key. No value is associated if either side is missing.