Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Cassandra online store, concurrency in bulk write operations #3367

Conversation

hemidactylus
Copy link
Collaborator

This implements the counterpart of #3356 but for writing. Using the native concurrency offered by Cassandra drivers allows for much faster write operations to the online store, which is crucial especially in the Materialize phase.

On a realistic setup (EC2 instance running the materialization on a DB in the same region), speedups of a factor about 12x are achieved.

Similarly to the read-optimization mentioned above, a new write_concurrency parameter is introduced (with defaults and full backward-compatibility) to control the level of concurrency should it ever be needed (the defaults should be fine in all cases, anyway).

In order to preserve the behaviour of the callbacks to progress during writes to the online store, which makes the progress bar behave correctly, in the function online_write_batch an ad-hoc iterator is built (see unroll_insertion_tuples) which, while it unfolds the whole set of rows to write, takes the care of invoking progress once per entity (each entity in general entails multiple rows to the DB table).

The documentation and the guided feast init -t cassandra procedure are also updated to reflect this.

write_concurrency parameter in configuration and bootstrap guided procedure

Signed-off-by: Stefano Lottini <[email protected]>
Copy link
Collaborator

@adchia adchia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adchia, hemidactylus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [adchia,hemidactylus]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@feast-ci-bot feast-ci-bot merged commit eaf354c into feast-dev:master Dec 2, 2022
@hemidactylus hemidactylus deleted the sl-cassandra-optimize-materialize branch December 2, 2022 15:15
kevjumba pushed a commit that referenced this pull request Dec 5, 2022
# [0.27.0](v0.26.0...v0.27.0) (2022-12-05)

### Bug Fixes

* Changing Snowflake template code to avoid query not implemented … ([#3319](#3319)) ([1590d6b](1590d6b))
* Dask zero division error if parquet dataset has only one partition ([#3236](#3236)) ([69e4a7d](69e4a7d))
* Enable Spark materialization on Yarn ([#3370](#3370)) ([0c20a4e](0c20a4e))
* Ensure that Snowflake accounts for number columns that overspecify precision ([#3306](#3306)) ([0ad0ace](0ad0ace))
* Fix memory leak from usage.py not properly cleaning up call stack ([#3371](#3371)) ([a0c6fde](a0c6fde))
* Fix workflow to contain env vars ([#3379](#3379)) ([548bed9](548bed9))
* Update bytewax materialization ([#3368](#3368)) ([4ebe00f](4ebe00f))
* Update the version counts ([#3378](#3378)) ([8112db5](8112db5))
* Updated AWS Athena template ([#3322](#3322)) ([5956981](5956981))
* Wrong UI data source type display ([#3276](#3276)) ([8f28062](8f28062))

### Features

* Cassandra online store, concurrency in bulk write operations ([#3367](#3367)) ([eaf354c](eaf354c))
* Cassandra online store, concurrent fetching for multiple entities ([#3356](#3356)) ([00fa21f](00fa21f))
* Get Snowflake Query Output As Pyspark Dataframe ([#2504](#2504)) ([#3358](#3358)) ([2f18957](2f18957))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants