Increase sorting scalability via CytoTable metadata columns #204

d33bs · 2024-05-01T16:37:50Z

Description

This PR seeks to refine #175 by increasing the performance through generated CytoTable metadata columns which are primarily beneficial during large join operations. Anecdotally, I noticed that ORDER BY ALL memory consumption for joined tables becomes very high when working with a larger dataset. Before this change, large join operations attempt to sort by all columns included in the join. After this change, only CytoTable metadata columns are used for sorting, decreasing the amount of processing required to create deterministic datasets.

I hope to further refine this work through #193 and #176, which would I feel provide additional insights concerning performance and best practice recommendations. I can also see how these might be required to validate things here, but didn't want to hold review comments (as these also might further inform efforts within those issues).

Closes #175

What is the nature of your change?

Bug fix (fixes an issue).
Enhancement (adds functionality).
Breaking change (fix or feature that would cause existing functionality to not work as expected).
This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

I have read the CONTRIBUTING.md guidelines.
My code follows the style guidelines of this project.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
New and existing unit tests pass locally with my changes.
I have added tests that prove my fix is effective or that my feature works.
I have deleted all non-relevant text in this pull request template.

for further performance in cytomining#175

gwaybio

This is a big PR @d33bs - tough for me (with my limited expertise) to go through! @falquaddoomi can you give a look when you get a chance.

In general, I have one concern. If we end up deciding (or duckdb fixes this) to go back to the old way (where we had thought but didn't end up having potential issues with sorting), it seems this will be super difficult to disentangle. Or maybe I'm not thinking about this correctly? Could it be that this increasing scalability is independent of the previous solution which reduced speed?

gwaybio · 2024-06-10T21:35:49Z

(some additional context @falquaddoomi - we are needing to solve this for an upcoming project that will use cytotable heavily. Thanks!)

falquaddoomi

I imagine this PR is related to our discussion about ordering by a few columns versus ORDER BY ALL? If so, I think it's an improvement on the previous behavior; kudos.

On @gwaybio's point, I think I'm lacking some context; I recall that not ordering the intermediate joins caused some kind of problem, but perhaps not? If it's the case that it can complete without ordering, perhaps it makes sense to make at least the ordering by the metadata columns an option that the user can disable at runtime. I saw that you rely on the metadata columns for things other than sorting, so I think it's fine to just always add them and just provide the option to disable the ordering.

FYI, the comments I left were just about possibly disabling ordering; otherwise, the PR looks good to me!

cytotable/convert.py

d33bs · 2024-06-11T20:00:30Z

Thanks @gwaybio and @falquaddoomi for the reviews! I like the idea of an optional setting for this sorting mechanism, with a possible backup method which doesn't leverage CytoTable metadata.

Generally, I still feel that sorting should be required to guarantee no data loss with LIMIT and OFFSET because this aligns with both DuckDB's docs and general SQL guidance. A hypothesis about what was allowing this to succeed in earlier work: DuckDB may have successfully retained all data with LIMIT and OFFSET queries through low system process and thread competition. The failing tests for LIMIT and OFFSET I believe nearly always dealt with multithreaded behavior in moto, meaning procedures may have been subject to system scheduler decisions about which tasks to delay vs execute (or perhaps there were system thread or memory leaks of some kind).

While we plan to remove moto as a dependency by addressing #198, it feels fuzzy yet to me whether these challenges are all the same. For example, it could be that moto triggered a coincidental mutation test with regard to DuckDB thread behavior (giving us further software visibility through a mutated test state). It could have also been a "perfect storm" through a bug in DuckDB >0.10.x,<1.0.0 combined with moto's behavior in tests. Then again, this could all just be my imagination, I'm not sure!

d33bs · 2024-06-12T18:54:28Z

Note: Initially failing tests for 4ffe9c1 appeared to have something to do with a Poetry (and not CytoTable) dependency failure (maybe fixed through a deploy by the time of a 3rd re-run?). I don't think these are related to CytoTable code as they were at the layer of Poetry installations.

Errors were:
AttributeError: '_CountedFileLock' object has no attribute 'thread_safe' from virtualenv and filelock site-packages.

Update: appears related to tox-dev/filelock#337

d33bs · 2024-06-12T20:07:10Z

Thanks again @gwaybio and @falquaddoomi ! I've added some updates which make sorting optional through the use of parameters called sort_output. These changes retain the ability to keep output sorted and also an option to avoid it altogether (reverting to earlier CytoTable behavior). I've kept the default to sort_output=True as I feel this is the safest option for the time being, but understand there may be reasons to avoid it based on the data or performance desired.

falquaddoomi

Nice to see that you added the sort option! I anticipate somewhere down the line we could use that option as a means to compare the performance and correctness of sorting vs. not sorting.

d33bs · 2024-06-12T21:34:22Z

Cheers, thanks @falquaddoomi ! Agreed on comparisons; it will be interesting to see the contrast, excited to learn more!

d33bs added 11 commits April 27, 2024 14:56

customize sorting capabilities

7af1606

for further performance in cytomining#175

simplify sql; exclude cytotable meta

482efd7

exclude duplicate columns

7742b13

updating tests

9b39c4d

fixing tests

add4050

simulate csv source by removing meta

8076723

update preset sql to use refined syntax

a9f3199

address mixed type queries and tests

9b04d01

Merge remote-tracking branch 'upstream/main' into sorting-scalability

84cd79e

simplify and further clarity in test

4908f6e

correcting comment

49a9d14

d33bs changed the title ~~Increase performance through sorting scalability via CytoTable metadata columns~~ Increase sorting scalability via CytoTable metadata columns May 1, 2024

d33bs requested review from falquaddoomi, gwaybio and kenibrewer May 2, 2024 22:31

d33bs marked this pull request as ready for review May 2, 2024 22:31

gwaybio reviewed Jun 10, 2024

View reviewed changes

falquaddoomi reviewed Jun 11, 2024

View reviewed changes

cytotable/convert.py Outdated Show resolved Hide resolved

cytotable/convert.py Outdated Show resolved Hide resolved

cytotable/convert.py Outdated Show resolved Hide resolved

d33bs added 4 commits June 12, 2024 10:40

make sorting optional

88e7fe6

fix existing tests

5def23d

further sorting options applied

8c40263

add a test for unsorted output

4ffe9c1

d33bs requested review from falquaddoomi and gwaybio June 12, 2024 20:07

falquaddoomi approved these changes Jun 12, 2024

View reviewed changes

d33bs merged commit fac4f56 into cytomining:main Jun 12, 2024
11 checks passed

d33bs deleted the sorting-scalability branch June 12, 2024 21:34

d33bs added the release-patch Creates a patch release (e.g. `v0.0.1`) label Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase sorting scalability via CytoTable metadata columns #204

Increase sorting scalability via CytoTable metadata columns #204

d33bs commented May 1, 2024 •

edited

Loading

gwaybio left a comment

gwaybio commented Jun 10, 2024

falquaddoomi left a comment

d33bs commented Jun 11, 2024

d33bs commented Jun 12, 2024 •

edited

Loading

d33bs commented Jun 12, 2024

falquaddoomi left a comment

d33bs commented Jun 12, 2024

Increase sorting scalability via CytoTable metadata columns #204

Increase sorting scalability via CytoTable metadata columns #204

Conversation

d33bs commented May 1, 2024 • edited Loading

Description

What is the nature of your change?

Checklist

gwaybio left a comment

Choose a reason for hiding this comment

gwaybio commented Jun 10, 2024

falquaddoomi left a comment

Choose a reason for hiding this comment

d33bs commented Jun 11, 2024

d33bs commented Jun 12, 2024 • edited Loading

d33bs commented Jun 12, 2024

falquaddoomi left a comment

Choose a reason for hiding this comment

d33bs commented Jun 12, 2024

d33bs commented May 1, 2024 •

edited

Loading

d33bs commented Jun 12, 2024 •

edited

Loading