Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase sorting scalability via CytoTable metadata columns #204

Merged
merged 15 commits into from
Jun 12, 2024

Conversation

d33bs
Copy link
Member

@d33bs d33bs commented May 1, 2024

Description

This PR seeks to refine #175 by increasing the performance through generated CytoTable metadata columns which are primarily beneficial during large join operations. Anecdotally, I noticed that ORDER BY ALL memory consumption for joined tables becomes very high when working with a larger dataset. Before this change, large join operations attempt to sort by all columns included in the join. After this change, only CytoTable metadata columns are used for sorting, decreasing the amount of processing required to create deterministic datasets.

I hope to further refine this work through #193 and #176, which would I feel provide additional insights concerning performance and best practice recommendations. I can also see how these might be required to validate things here, but didn't want to hold review comments (as these also might further inform efforts within those issues).

Closes #175

What is the nature of your change?

  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

@d33bs d33bs changed the title Increase performance through sorting scalability via CytoTable metadata columns Increase sorting scalability via CytoTable metadata columns May 1, 2024
@d33bs d33bs marked this pull request as ready for review May 2, 2024 22:31
Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big PR @d33bs - tough for me (with my limited expertise) to go through! @falquaddoomi can you give a look when you get a chance.

In general, I have one concern. If we end up deciding (or duckdb fixes this) to go back to the old way (where we had thought but didn't end up having potential issues with sorting), it seems this will be super difficult to disentangle. Or maybe I'm not thinking about this correctly? Could it be that this increasing scalability is independent of the previous solution which reduced speed?

@gwaybio
Copy link
Member

gwaybio commented Jun 10, 2024

(some additional context @falquaddoomi - we are needing to solve this for an upcoming project that will use cytotable heavily. Thanks!)

Copy link
Collaborator

@falquaddoomi falquaddoomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine this PR is related to our discussion about ordering by a few columns versus ORDER BY ALL? If so, I think it's an improvement on the previous behavior; kudos.

On @gwaybio's point, I think I'm lacking some context; I recall that not ordering the intermediate joins caused some kind of problem, but perhaps not? If it's the case that it can complete without ordering, perhaps it makes sense to make at least the ordering by the metadata columns an option that the user can disable at runtime. I saw that you rely on the metadata columns for things other than sorting, so I think it's fine to just always add them and just provide the option to disable the ordering.

FYI, the comments I left were just about possibly disabling ordering; otherwise, the PR looks good to me!

cytotable/convert.py Outdated Show resolved Hide resolved
cytotable/convert.py Outdated Show resolved Hide resolved
cytotable/convert.py Outdated Show resolved Hide resolved
@d33bs
Copy link
Member Author

d33bs commented Jun 11, 2024

Thanks @gwaybio and @falquaddoomi for the reviews! I like the idea of an optional setting for this sorting mechanism, with a possible backup method which doesn't leverage CytoTable metadata.

Generally, I still feel that sorting should be required to guarantee no data loss with LIMIT and OFFSET because this aligns with both DuckDB's docs and general SQL guidance. A hypothesis about what was allowing this to succeed in earlier work: DuckDB may have successfully retained all data with LIMIT and OFFSET queries through low system process and thread competition. The failing tests for LIMIT and OFFSET I believe nearly always dealt with multithreaded behavior in moto, meaning procedures may have been subject to system scheduler decisions about which tasks to delay vs execute (or perhaps there were system thread or memory leaks of some kind).

While we plan to remove moto as a dependency by addressing #198, it feels fuzzy yet to me whether these challenges are all the same. For example, it could be that moto triggered a coincidental mutation test with regard to DuckDB thread behavior (giving us further software visibility through a mutated test state). It could have also been a "perfect storm" through a bug in DuckDB >0.10.x,<1.0.0 combined with moto's behavior in tests. Then again, this could all just be my imagination, I'm not sure!

@d33bs
Copy link
Member Author

d33bs commented Jun 12, 2024

Note: Initially failing tests for 4ffe9c1 appeared to have something to do with a Poetry (and not CytoTable) dependency failure (maybe fixed through a deploy by the time of a 3rd re-run?). I don't think these are related to CytoTable code as they were at the layer of Poetry installations.

Errors were:
AttributeError: '_CountedFileLock' object has no attribute 'thread_safe' from virtualenv and filelock site-packages.

Update: appears related to tox-dev/filelock#337

@d33bs
Copy link
Member Author

d33bs commented Jun 12, 2024

Thanks again @gwaybio and @falquaddoomi ! I've added some updates which make sorting optional through the use of parameters called sort_output. These changes retain the ability to keep output sorted and also an option to avoid it altogether (reverting to earlier CytoTable behavior). I've kept the default to sort_output=True as I feel this is the safest option for the time being, but understand there may be reasons to avoid it based on the data or performance desired.

Copy link
Collaborator

@falquaddoomi falquaddoomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see that you added the sort option! I anticipate somewhere down the line we could use that option as a means to compare the performance and correctness of sorting vs. not sorting.

@d33bs
Copy link
Member Author

d33bs commented Jun 12, 2024

Cheers, thanks @falquaddoomi ! Agreed on comparisons; it will be interesting to see the contrast, excited to learn more!

@d33bs d33bs merged commit fac4f56 into cytomining:main Jun 12, 2024
11 checks passed
@d33bs d33bs deleted the sorting-scalability branch June 12, 2024 21:34
@d33bs d33bs added the release-patch Creates a patch release (e.g. `v0.0.1`) label Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-patch Creates a patch release (e.g. `v0.0.1`)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Explore value sorting determinism (and possible changes)
3 participants