Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore value sorting determinism (and possible changes) #175

Closed
d33bs opened this issue Mar 22, 2024 · 1 comment · Fixed by #204
Closed

Explore value sorting determinism (and possible changes) #175

d33bs opened this issue Mar 22, 2024 · 1 comment · Fixed by #204
Assignees
Labels
bug Something isn't working

Comments

@d33bs
Copy link
Member

d33bs commented Mar 22, 2024

From #174:

I found that duckdb==0.10.1 entails lowered determinism for joined data result value sorting. We previously have relied on PyArrow to consistently sort all columns and their values for testing - this appears to also be potentially in need of adjustment. As a result, I've added SQL ORDER BY ALL (which orders all values in all columns from left to right) to one test which consistently failed with comparisons. As a to-do, we should explore why PyArrow is unable to perform the same level of data organization for testing and possibly raise an issue with that project if the results aren't in alignment with the design. Alternatively, moving to ORDER BY ALL or some parallel may be necessary to address the existing PyArrow-based data sorting for testing (otherwise we may sporadically see issues over time).

Because of the importance of this issue, adding that we need example cases where the fix has been validated with larger than testing datasets.

@d33bs d33bs added the bug Something isn't working label Mar 22, 2024
@d33bs d33bs changed the title Explore and make changes related to PyArrow table value sorting expectations Explore PyArrow table value sorting expectations (and possible changes) Mar 22, 2024
@d33bs
Copy link
Member Author

d33bs commented Apr 1, 2024

Thinking on this more and exploring a bit, outlining thoughts and findings so far below. This appeared again in #181, so I'm focusing on figuring out more of the reasons why this might occur.

Patterns

When this happens, the following appear to be consistent patterns:

  • Python version: Any (despite initially appearing after python==3.12 work in Enable Python 3.12 compatibility through dependency-based updates #174)
  • DuckDB version duckdb==0.10.1
  • PyArrow version pyarrow==15.0.2
  • Input data: occurs with SQLite data from NF1 project used for testing purposes (link)
  • Parsl config: occurs with both HTE (multiprocess'ed) and TPE (multithreaded) Parsl configuration.
  • Pytest tests: appears in newly added or modified tests only (existing tests don't seem to exhibit the same behavior).
  • The behavior is different each time: the values appear to be slightly different and in different amounts each test run.

Possible explanations

As a quick check I tried verifying that PyArrow sorting works the way it should. It seems that it does properly sort all values by all columns when implemented the way it is in CytoTable tests. See here for code demonstrating this.

I feel there are several other possibilities for what's occurring which I'll work through in order to verify what's happening.

  • DuckDB is inconsistently reading Parquet datasets (lists of files treated as a single table). (verified not the cause here)
  • DuckDB is inconsistently joining multiple Parquet datasets (perhaps an extension of the above or distinct from it) (verified not the cause here)
  • DuckDB threads are somehow causing challenges when used within multiple Parsl threads or processes on the same machine (perhaps overlapping one another or operating in a way that is unsafe). (verified not the cause here)
  • Parsl threads or processes causing challenges with task completions associated with DuckDB or PyArrow work.

@d33bs d33bs self-assigned this Apr 1, 2024
@d33bs d33bs changed the title Explore PyArrow table value sorting expectations (and possible changes) Explore value sorting determinism (and possible changes) Apr 2, 2024
d33bs added a commit to d33bs/CytoTable that referenced this issue Apr 27, 2024
d33bs added a commit that referenced this issue Jun 12, 2024
* customize sorting capabilities

for further performance in #175

* simplify sql; exclude cytotable meta

* exclude duplicate columns

* updating tests

* fixing tests

* simulate csv source by removing meta

* update preset sql to use refined syntax

* address mixed type queries and tests

* simplify and further clarity in test

* correcting comment

* make sorting optional

* fix existing tests

* further sorting options applied

* add a test for unsorted output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant