-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/4196 speed up search bundles #4273
Conversation
…4196-speed-up-search-bundles
…ix/4196-speed-up-search-bundles
Do you have any before/after timings that you can share? |
Sure thing! Examples of speedup (run on prod): Command: cl search .shared data_size=.sum Original query: Updated query: Command: cl search .after=2021-08-20 .count Original query: Updated query: |
@@ -0,0 +1,9 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need docker_config files? Or should we remove them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just removed them. Sorry about that!
@@ -0,0 +1,26 @@ | |||
"""Remove is_superuser from User table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not part of speeding up search bundles, right? Ideally, would split the superuser thing into a different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shoot. It is in a different PR that I submitted Thursday (still hasn't been merged b/c needs review). I must have branched off from my local branch I used for that PR when I created the branch for this PR. Sorry about that! Perhaps I should revert those changes in this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can get that PR in first, since it's pretty straightforward, and then you don't have to undo anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll ping the chat for a review. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be easiest to revert those changes in this branch, so it doesn't have to depend on that PR. Up to you though.
codalab/model/bundle_model.py
Outdated
shortcuts = {'type': 'bundle_type', 'size': 'data_size', 'worksheet': 'host_worksheet'} | ||
|
||
join_tables = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add type annotations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added for the two arrays I'm using (joins and where_clause_conjuncts).
When I try to add type annotations for some of the other variables (like limit), I get a strange error from mypy:
Deferral trace:
codalab.model.bundle_model:409
codalab.model.bundle_model:412
codalab.model.bundle_model:414
codalab.model.bundle_model:417
codalab.model.bundle_model:419
codalab.model.bundle_model:433
codalab.model.bundle_model:438
codalab.model.bundle_model:438
codalab.model.bundle_model:438
codalab/model/bundle_model.py: error: INTERNAL ERROR: maximum semantic analysis iteration count reached
So, I've left those un-typed for now, but I'll look into this.
cl_group_bundle_permission.c.permission >= GROUP_OBJECT_PERMISSION_READ, | ||
) | ||
) | ||
add_join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember there were some subtleties with joins when the joined tables (e.g., permissions) were empty, whereas for subqueries, it was much more straightforward to reason about, so would be good to double check the edge cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. So, the queries I converted to inner joins above are all OK in this regard; I verified each of them individually. Basically, the subqueries (aside from the one you highlighted in a comment above and one other which I didn’t convert) are used as conjuncts in the final WHERE clause, aren’t negated, and require the existence of something in the table in the subquery to be true. In other words, each of those subqueries is automatically false if the table is empty anyway.
This is not true for the two other queries I didn't convert earlier (the one used for .floating and one used when the querying user is not root). For those, an inner join would not work for the reason you mentioned. For instance, for the negated query, it checks that the bundle_uuid is NOT in the Worksheet_Item table, and so using an INNER JOIN would be incorrect since the inner join would be empty even though the result we really want is for every bundle_uuid to be in the result since none belong to a worksheet (since worksheet_item is empty). For this query (and for the other I mentioned), we remedy this by using a LEFT OUTER JOIN instead of an INNER JOIN.
I verified that my changes work by adding test cases. I added some for .floating, and I added some cases for cl search queries used by the non-root user. I verified these work and that, as expected, they all fail when I switch the LEFT OUTER JOIN to an INNER JOIN.
Let me know if I didn't address your comment sufficiently and/or if I missed something. Thanks!
codalab/model/bundle_model.py
Outdated
cl_bundle.c.uuid == aliased_bundle_metadata.c.bundle_uuid, | ||
) | ||
|
||
where_clause = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more accurate, I'd call this where_clause_disjuncts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! (I think they're actually conjuncts in this case since we use and_ to connect them in line 623, but please correct me if I'm wrong!)
codalab/model/bundle_model.py
Outdated
) | ||
clause = or_(*clause) | ||
|
||
where_clause = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like you can just write or_(cl_bundle..., ..., ...)
directly instead of creating an array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
codalab/model/bundle_model.py
Outdated
if where_clause is not None: | ||
where_clauses.append(where_clause) | ||
|
||
where_clause = and_(*where_clauses) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be confusing to reuse where_clause
for both the top-level clause and the conjuncts. One might consider creating helper function(s) to break up this big function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
codalab/model/bundle_model.py
Outdated
or_( # Join constraint (group) | ||
cl_group_bundle_permission.c.group_uuid | ||
== self.public_group_uuid, # Public group | ||
cl_group_bundle_permission.c.group_uuid.in_( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't replace this subquery with a join?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. But see the comment above about INNER JOIN vs LEFT OUTER JOIN. I left this and the other subquery in the .floating case because they were harder to convert and subsequently forgot to convert them. Sorry about that!
codalab/model/bundle_model.py
Outdated
where_clause = and_(where_clause, or_(access_via_owner, access_via_group)) | ||
|
||
table = cl_bundle | ||
for join_table, join_condition in zip(join_tables, join_conditions): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using parallel arrays, which might be error prone, consider making a new dataclass
with two fields (the table and the join).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high-level, logic looks good, and it's great that we get such dramatic speedups! I'd just be careful about the edge cases, and we should probably add some test cases for this. Consider also splitting up the function into more modular pieces (something we've wanted to do for a while) if you think it makes sense.
…d clean up variable names
…ating to be joins as well. Add tests for both cases.
Thanks so much for the thorough review! I added more test cases to address the edge cases you mentioned (see my comment about INNER JOIN and LEFT OUTER JOIN above). In terms of splitting the function, I do think that makes sense and gave it some thought throughout the day. However, I'm still not sure how to do it. I'll keep thinking, but in the meantime I addressed your other comments. Thanks! |
Amazing!! |
codalab/model/bundle_model.py
Outdated
subquery_index = [0] | ||
|
||
@dataclass | ||
class JoinTable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this class declaration outside the search_bundles
function? Eventually (in a later PR probably!) we'll need to do the same thing as this PR for the search_worksheets and will end up using JoinTable
for that, too (#4277).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On that point, it might be worth thinking about whether it makes sense to abstract this code out into something more general, so that it could be implemented in search_worksheets
(without having to copy and paste everything again). But not necessary for this PR, but if you do think of a good way to do it that works well in this PR, do add it in!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I have a potential idea for abstracting the logic, but I think it might be better to do it in a separate PR. The issue is that the there aren't many shared keys between search_worksheets and search_bundles (aside from the "special" keywords at the top of the function, which can be abstracted quite easily) and even some shared keys are dealt with a little differently (e.g. the key ''), which makes it hard to abstract.
codalab/model/bundle_model.py
Outdated
table: Table | ||
condition: Any | ||
left_outer_join: bool = False | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! However, I think it's a bit difficult to understand how the code is working by just reading through it. Can you add a few sentences at the beginning of the method that describes the general approach this code is taking (and the role played by where_class_conjuncts, add_join, etc.)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
check_equals('', _run_command([cl, 'search', '.shared'])) | ||
# Check search with non-root user. | ||
if not os.getenv('CODALAB_PROTECTED_MODE'): | ||
# This test does not work when protected_mode is True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running cl search with non-root user yields a 401 Unauthorized when in protected mode. I could solve this also by using 'cl uedit --grant-access' with the root user before switching to the non-root user instead, though. Perhaps that's better.
…/codalab-worksheets into fix/4196-speed-up-search-bundles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Reasons for making this change
The search_bundles function was slow primarily because it relied on nested subqueries. We updated the function to use inner joins instead, which should be much faster.
Related issues
Fixes #4196. Also addresses #3787 (see example in comment below for that function).