-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add support for DynamoDB online_read in batches #2371
feat: Add support for DynamoDB online_read in batches #2371
Conversation
Signed-off-by: Miguel Trejo <[email protected]>
Signed-off-by: Miguel Trejo <[email protected]>
Signed-off-by: Miguel Trejo <[email protected]>
Signed-off-by: Miguel Trejo <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #2371 +/- ##
==========================================
+ Coverage 83.54% 83.64% +0.09%
==========================================
Files 123 125 +2
Lines 10749 10820 +71
==========================================
+ Hits 8980 9050 +70
- Misses 1769 1770 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@adchia @vlin-lgtm I didn't found any tests for |
Signed-off-by: Miguel Trejo <[email protected]>
@TremaMiguel thanks for this. Would you mind changing your PR title to some kind of explaination of what the PR does? Something like "Add support for DynamoDB reads and writes in batches". This will become a part of our changelog. |
Another thing is to not use git pull and instead use eg as far as tests go, technically there is a sets of tests that call into dynamo (test_universal_online for example). Though I don't think there's a test that does a batch fetch. moto looks interesting though. might be a good idea to use that so we can let contributors run tests locally for aws. would love to see how that looks. |
Signed-off-by: Miguel Trejo <[email protected]>
Signed-off-by: Miguel Trejo <[email protected]>
@adchia added a test for To run these test AWS credential shoud be set export AWS_ACCESS_KEY_ID=your_access_key_id
export AWS_SECRET_ACCESS_KEY=your_secret_access_key I think we can open an issue to improve tests for |
Signed-off-by: Miguel Trejo <[email protected]>
Signed-off-by: Miguel Trejo <[email protected]>
153afe5
to
e52a895
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we hide batch_size
from users? So something like this can work:
batch_size = 40
entity_ids = ...
batches = []
for i in range(0, len(entity_ids), batch_size):
batches.append(entity_ids[i:i + batch_size])
for batch in batches:
batch_entity_ids = ...
...
...
...
response = dynamodb_resource.batch_get_item(RequestItems=batch_entity_ids)
...
So this always tries to send at most 40 entities at a time. It is better to maximize the batch size as much as possible so there is less network calls to DynamoDB.
@adchia could you provide guidance on how to run integration tests locally for AWS? It appears there's something missing in
|
Hey! So what I'd actually do is make a new repo_configuration that uses a local offline store + dynamo as the online store. That's configured at https://github.com/feast-dev/feast/blob/master/sdk/python/tests/integration/feature_repos/repo_configuration.py#L88-L88 You can basically just ignore all these and set Normally to run everything, you'd run FEAST_USAGE=False IS_TEST=True python -m pytest -n 8 --integration --universal sdk/python/tests/integration/online_store/ |
Signed-off-by: Miguel Trejo <[email protected]>
Signed-off-by: Miguel Trejo <[email protected]>
Signed-off-by: Miguel Trejo <[email protected]>
@adchia some integrations tests expect results from DynamoDB query to have the same order as input I added a Do you have any suggestions to improve the implementation of |
Signed-off-by: Miguel Trejo <[email protected]>
@adchia test_online_store_cleanup with the case After doing some digging, there's a difference in the result of
metadata {
feature_names {
val: "driver_id"
val: "value"
}
}
results {
values {
int64_val: 5001
}
values {
float_val: 0.03799890726804733
}
statuses: PRESENT
statuses: PRESENT
event_timestamps {
}
event_timestamps {
seconds: 1647104400
}
}
results {
values {
int64_val: 5002
}
values {
float_val: 0.8108303546905518
}
statuses: PRESENT
statuses: PRESENT
event_timestamps {
}
event_timestamps {
seconds: 1647104400
}
}
metadata {
feature_names {
val: "driver_id"
val: "value"
}
}
results {
values {
int64_val: 5001
}
values {
}
statuses: PRESENT
statuses: NOT_FOUND
event_timestamps {
}
event_timestamps {
}
}
results {
values {
int64_val: 5002
}
statuses: PRESENT
event_timestamps {
}
}
results {
values {
int64_val: 5003
}
statuses: PRESENT
event_timestamps {
}
} do you have an idea why is this happening? |
Signed-off-by: Miguel Trejo <[email protected]>
Signed-off-by: Miguel Trejo <[email protected]>
Signed-off-by: Miguel Trejo <[email protected]>
@adchia just to follow up the conversations, is there something you'd like to add? |
Signed-off-by: Miguel Trejo <[email protected]>
Signed-off-by: Miguel Trejo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: adchia, TremaMiguel The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I'm late in the game, but I'd like to thank you @TremaMiguel very much for contributing this! This is going improve our feature serving latency by a great deal! ❤️ |
Thanks @vlin-lgtm, I believe there's room for improvement related to unit testing the DynamoDb Online store module, we could use something like moto for this, as we've done with the online_read method. Having these unitary tests plus the local integration tests would definitively help make the code more robust and facilitate future changes. |
# [0.20.0](v0.19.0...v0.20.0) (2022-04-14) ### Bug Fixes * Add inlined data sources to the top level registry ([#2456](#2456)) ([356788a](356788a)) * Add new value types to types.ts for web ui ([#2463](#2463)) ([ad5694e](ad5694e)) * Add PushSource proto and Python class ([#2428](#2428)) ([9a4bd63](9a4bd63)) * Add spark to lambda dockerfile ([#2480](#2480)) ([514666f](514666f)) * Added private_key auth for Snowflake ([#2508](#2508)) ([c42c9b0](c42c9b0)) * Added Redshift and Spark typecheck to data_source event_timestamp_col inference ([#2389](#2389)) ([04dea73](04dea73)) * Building of go extension fails ([#2448](#2448)) ([7d1efd5](7d1efd5)) * Bump the number of versions bumps expected to 27 ([#2549](#2549)) ([ecc9938](ecc9938)) * Create __init__ files for the proto-generated python dirs ([#2410](#2410)) ([e17028d](e17028d)) * Don't prevent apply from running given duplicate empty names in data sources. Also fix repeated apply of Spark data source. ([#2415](#2415)) ([b95f441](b95f441)) * Dynamodb deduplicate batch write request by partition keys ([#2515](#2515)) ([70d4a13](70d4a13)) * Ensure that __init__ files exist in proto dirs ([#2433](#2433)) ([9b94f7b](9b94f7b)) * Fix DataSource constructor to unbreak custom data sources ([#2492](#2492)) ([712653e](712653e)) * Fix default feast apply path without any extras ([#2373](#2373)) ([6ba7fc7](6ba7fc7)) * Fix definitions.py with new definition ([#2541](#2541)) ([eefc34a](eefc34a)) * Fix entity row to use join key instead of name ([#2521](#2521)) ([c22fa2c](c22fa2c)) * Fix Java Master ([#2499](#2499)) ([e083458](e083458)) * Fix registry proto ([#2435](#2435)) ([ea6a9b2](ea6a9b2)) * Fix some inconsistencies in the docs and comments in the code ([#2444](#2444)) ([ad008bf](ad008bf)) * Fix spark docs ([#2382](#2382)) ([d4a606a](d4a606a)) * Fix Spark template to work correctly on feast init -t spark ([#2393](#2393)) ([ae133fd](ae133fd)) * Fix the feature repo fixture used by java tests ([#2469](#2469)) ([32e925e](32e925e)) * Fix unhashable Snowflake and Redshift sources ([cd8f1c9](cd8f1c9)) * Fixed bug in passing config file params to snowflake python connector ([#2503](#2503)) ([34f2b59](34f2b59)) * Fixing Spark template to include source name ([#2381](#2381)) ([a985f1d](a985f1d)) * Make name a keyword arg for the Entity class ([#2467](#2467)) ([43847de](43847de)) * Making a name for data sources not a breaking change ([#2379](#2379)) ([71d7ae2](71d7ae2)) * Minor link fix in `CONTRIBUTING.md` ([#2481](#2481)) ([2917e27](2917e27)) * Preserve ordering of features in _get_column_names ([#2457](#2457)) ([495b435](495b435)) * Relax click python requirement to >=7 ([#2450](#2450)) ([f202f92](f202f92)) * Remove date partition column field from datasources that don't s… ([#2478](#2478)) ([ce35835](ce35835)) * Remove docker step from unit test workflow ([#2535](#2535)) ([6f22f22](6f22f22)) * Remove spark from the AWS Lambda dockerfile ([#2498](#2498)) ([6abae16](6abae16)) * Request data api update ([#2488](#2488)) ([0c9e5b7](0c9e5b7)) * Schema update ([#2509](#2509)) ([cf7bbc2](cf7bbc2)) * Simplify DataSource.from_proto logic ([#2424](#2424)) ([6bda4d2](6bda4d2)) * Snowflake api update ([#2487](#2487)) ([1181a9e](1181a9e)) * Support passing batch source to streaming sources for backfills ([#2523](#2523)) ([90db1d1](90db1d1)) * Timestamp update ([#2486](#2486)) ([bf23111](bf23111)) * Typos in Feast UI error message ([#2432](#2432)) ([e14369d](e14369d)) * Update feature view APIs to prefer keyword args ([#2472](#2472)) ([7c19cf7](7c19cf7)) * Update file api ([#2470](#2470)) ([83a11c6](83a11c6)) * Update Makefile to cd into python dir before running commands ([#2437](#2437)) ([ca32155](ca32155)) * Update redshift api ([#2479](#2479)) ([4fa73a9](4fa73a9)) * Update some fields optional in UI parser ([#2380](#2380)) ([cff7ac3](cff7ac3)) * Use a single version of jackson libraries and upgrade to 2.12.6.1 ([#2473](#2473)) ([5be1cc6](5be1cc6)) * Use dateutil parser to parse materialization times ([#2464](#2464)) ([6c55e49](6c55e49)) * Use the correct dockerhub image tag when building feature servers ([#2372](#2372)) ([0d62c1d](0d62c1d)) ### Features * Add `/write-to-online-store` method to the python feature server ([#2423](#2423)) ([d2fb048](d2fb048)) * Add description, tags, owner fields to all feature view classes ([#2440](#2440)) ([ed5e928](ed5e928)) * Add DQM Logging on GRPC Server with FileLogStorage for Testing ([#2403](#2403)) ([57a97d8](57a97d8)) * Add Feast types in preparation for changing type system ([#2475](#2475)) ([4864252](4864252)) * Add Field class ([#2500](#2500)) ([1279612](1279612)) * Add support for DynamoDB online_read in batches ([#2371](#2371)) ([702ec49](702ec49)) * Add Support for DynamodbOnlineStoreConfig endpoint_url parameter ([#2485](#2485)) ([7b863d1](7b863d1)) * Add templating for dynamodb table name ([#2394](#2394)) ([f591088](f591088)) * Allow local feature server to use Go feature server if enabled ([#2538](#2538)) ([a2ef375](a2ef375)) * Allow using entity's join_key in get_online_features ([#2420](#2420)) ([068c765](068c765)) * Data Source Api Update ([#2468](#2468)) ([6b96b21](6b96b21)) * Go server ([#2339](#2339)) ([d12e7ef](d12e7ef)), closes [#2354](#2354) [#2361](#2361) [#2332](#2332) [#2356](#2356) [#2363](#2363) [#2349](#2349) [#2355](#2355) [#2336](#2336) [#2361](#2361) [#2363](#2363) [#2344](#2344) [#2354](#2354) [#2347](#2347) [#2350](#2350) [#2356](#2356) [#2355](#2355) [#2349](#2349) [#2352](#2352) [#2341](#2341) [#2336](#2336) [#2373](#2373) [#2315](#2315) [#2372](#2372) [#2332](#2332) [#2349](#2349) [#2336](#2336) [#2361](#2361) [#2363](#2363) [#2344](#2344) [#2354](#2354) [#2347](#2347) [#2350](#2350) [#2356](#2356) [#2355](#2355) [#2349](#2349) [#2352](#2352) [#2341](#2341) [#2336](#2336) [#2373](#2373) [#2379](#2379) [#2380](#2380) [#2382](#2382) [#2364](#2364) [#2366](#2366) [#2386](#2386) * Graduate write_to_online_store out of experimental status ([#2426](#2426)) ([e7dd4b7](e7dd4b7)) * Make feast PEP 561 compliant ([#2405](#2405)) ([3c41f94](3c41f94)), closes [#2420](#2420) [#2418](#2418) [#2425](#2425) [#2426](#2426) [#2427](#2427) [#2431](#2431) [#2433](#2433) [#2420](#2420) [#2418](#2418) [#2425](#2425) [#2426](#2426) [#2427](#2427) [#2431](#2431) [#2433](#2433) * Makefile for contrib for Issue [#2364](#2364) ([#2366](#2366)) ([a02325b](a02325b)) * Support on demand feature views in go feature server ([#2494](#2494)) ([6edd274](6edd274)) * Switch from Feature to Field ([#2514](#2514)) ([6a03bed](6a03bed)) * Use a daemon thread to monitor the go feature server exclusively ([#2391](#2391)) ([0bb5e8c](0bb5e8c))
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #2247
Context
Given a list of
entity_ids
set abatch_size
parameter to determine the number of items to send in abatch_get_item
request to DynamoDB.DynamoDB can retrieve up to 16MB of data and the record size limit is 400kb,
batch_size
value could be at most 40.