Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add order to limit and offset queries for deterministic results #182

Merged
merged 2 commits into from
Apr 18, 2024

Conversation

d33bs
Copy link
Member

@d33bs d33bs commented Apr 2, 2024

Description

This PR addresses #175 by adding ORDER BY ALL to SQL queries which are used to help parallelize data work through LIMIT and OFFSET. This comes as the result of various testing surrounding PyArrow, DuckDB, and Parsl and reading + experimentation on what happens during DuckDB queries which leverage LIMIT, OFFSET, and lists of Parquet files treated as a single dataset. DuckDB provides documentation supporting this change for determinism: "Note that while LIMIT can be used without an ORDER BY clause, the results might not be deterministic without the ORDER BY clause.". (link). The same stands for SQLite-based queries (link).

These additions required a small changes due to the new sorting order of results.

References #175
May influence work towards #85
Should be prioritized before and will influence further changes in #181

What is the nature of your change?

  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

Copy link
Collaborator

@falquaddoomi falquaddoomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a reasonable change to me in terms of ensuring correctness, since as you observed LIMIT..OFFSET queries require a fixed ordering. From what I've read, ORDER BY ALL in DuckDB orders by all columns from first to last, which is the safest option since you'll always have a canonical ordering, but it could impact performance if the columns weren't indexed. I don't know much about DuckDB indexing, but from a cursory reading it seems the engine creates indices on all columns by default, so this may not be an issue.

In any case, I'd do some light checking on if or how much it would impact performance, but I don't see any reason not to approve this PR.

@d33bs d33bs requested a review from gwaybio April 17, 2024 21:55
Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best served by a different PR, but I'm wondering if it is worth adding a test that ensures, in a large enough dataset, that there are no duplicate rows and all expected rows are present.

@d33bs d33bs added release-patch Creates a patch release (e.g. `v0.0.1`) and removed release-patch Creates a patch release (e.g. `v0.0.1`) labels Apr 18, 2024
@d33bs
Copy link
Member Author

d33bs commented Apr 18, 2024

Thank you @gwaybio and @falquaddoomi for your reviews! I've added #193 for future-facing large dataset testing focus.

In any case, I'd do some light checking on if or how much it would impact performance, but I don't see any reason not to approve this PR.

I'm uncertain of the impacts this will have towards performance but found this article covering a bit how sorting operates within DuckDB (I'm unsure if this extends into the Parquet-focused readers, as these may operate differently based on how Parquet is stored). I plan to follow up within #175 with additional external large data tests, including source, to monitor the impact of how sorting might impact things.

Merging this in to help address next steps.

@d33bs d33bs merged commit 496ff36 into cytomining:main Apr 18, 2024
11 checks passed
@d33bs d33bs deleted the sorting-out-indeterministic-results branch April 18, 2024 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants