Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark catalog v2 for single and multi stream #1085

Merged
merged 4 commits into from
May 8, 2023

Conversation

twitu
Copy link
Collaborator

@twitu twitu commented Apr 25, 2023

Pull Request

Adds benchmarks for single and multi stream. Benchmarks measure how long it takes for catalog to load a single stream of data and multiple streams of data. This is a good stress test for the stream merging datafusion based catalog.

The benchmarks were run on an AMD 5600U machine and gave the following. To run the tests you need the bench_data directory in the root of the repository and then cargo bench in the persistence crate for rust benching and pytest -k stream_catalog_v2 for python catalog benching.

Test Rust Python
single stream (10M) 2.4s 3.2s
multi stream (72M, 61 files) 23s NA

The python tests are expected to take more time because of the overhead of initializing python objects and acquiring GIL.

The multi-stream python bench requires 8+ gb of ram and could not be run.

An appropriate measure for the catalog is its throughput which can be measured in terms of ticks per second. Here we can compare with previous iterations described in #705

  1. Rust catalog v1 - arrow2 (single stream, only rust structs, no stream merge logic) ~ 6M/s
  2. Rust catalog v2 - datafusion and kmerge ~ 3M/s
  3. Rust catalog v1 - arrow2 and Python merge logic dropping for significant implementation effort expected to be less than 3M/s
  4. Catalog - Pure Python - 100K/s

Type of change

  • Performance tests

How has this change been tested?

The benchmarks were run and passed.

@codecov
Copy link

codecov bot commented Apr 25, 2023

Codecov Report

Patch coverage has no change and project coverage change: -0.27 ⚠️

Comparison is base (efbef94) 90.26% compared to head (1c9afab) 90.00%.

❗ Current head 1c9afab differs from pull request most recent head c4d7ed2. Consider uploading reports for the commit c4d7ed2 to get more accurate results

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1085      +/-   ##
===========================================
- Coverage    90.26%   90.00%   -0.27%     
===========================================
  Files          257      257              
  Lines        26936    26817     -119     
===========================================
- Hits         24314    24136     -178     
- Misses        2622     2681      +59     

see 42 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@twitu twitu marked this pull request as draft April 26, 2023 04:16
@twitu twitu marked this pull request as ready for review May 7, 2023 11:07
@twitu
Copy link
Collaborator Author

twitu commented May 7, 2023

Closes #705

@twitu twitu linked an issue May 7, 2023 that may be closed by this pull request
@cjdsellers cjdsellers merged commit 91b3a2a into develop May 8, 2023
@cjdsellers cjdsellers deleted the catalog_benchmark branch May 8, 2023 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Porting persistence layer to Rust
2 participants