Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support large, batch importing of relationships #94

Closed
josephschorr opened this issue Mar 3, 2022 · 0 comments · Fixed by #120
Closed

Support large, batch importing of relationships #94

josephschorr opened this issue Mar 3, 2022 · 0 comments · Fixed by #120
Labels
area/CLI Affects the command line priority/3 low This would be nice to have

Comments

@josephschorr
Copy link
Member

josephschorr commented Mar 3, 2022

zed import currently constructs a single WriteRelationships call, which will be exceeded after ~5000 relationships. We should add support for batching the relationships to be imported into chunks, and executing those chunks in parallel.

As mentioned in Discord (https://discord.com/channels/844600078504951838/844600078948630559/949071336574181457), there are a number of issues to address:

  1. The gRPC server has a limit on message sizes. This can easily be solved by batching the requests
  2. Batching is tricky, serially executing each batch works to an extent, but it's not very scalable. This could be improved by executing each batch in a goroutine. Of course, this comes with some risk, as there's roughly a 9k batch size limit (see 3). If a zed import is trying to import 6 million+ rows, that's about 650 connections each trying to shove through 9k tuples.
  3. Postgres and MySQL both have a limit on how many placeholders you can have in a single query. This appears to be 65535 for Postgres and MySQL (https://stackoverflow.com/a/49379324, https://stackoverflow.com/a/24447922). This roughly translating to a maximum of 9362 relationship tuple writes (assuming each is going to require 7 placeholders). I didn't initially hit this limitation because I was testing with an in memory database.

We'll need to see a reasonable limit on the number of parallelized write requests, have a progress bar for display, etc

@josephschorr josephschorr added area/CLI Affects the command line priority/3 low This would be nice to have labels Mar 3, 2022
@jzelinskie jzelinskie linked a pull request Mar 10, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CLI Affects the command line priority/3 low This would be nice to have
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant