Peng precommit update and new notebooks #72

pzdkn · 2022-11-28T13:05:56Z

Description

We add two new notebooks to demonstrate how Squirrel can be combined with Spark to do the following tasks:

Notebook 11: Shows how we can load time-series data
Notebook 12: Shows how we can split data to be stored into different stores, depending on a categorical label.

Fixes # issue

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring including code style reformatting
Other (please describe):

Checklist:

I have read the contributing guideline doc (external contributors only)
Lint and unit tests pass locally with my changes
I have kept the PR small so that it can be easily reviewed
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
All dependency changes have been reflected in the pip requirement files.

…tores

examples/11.TimeSeries_With_Spark_and_Squirrel.ipynb

AlirezaSohofi · 2022-11-29T16:51:14Z

examples/11.TimeSeries_With_Spark_and_Squirrel.ipynb

+   },
+   "outputs": [],
+   "source": [
+    "def save_iterable_as_shard(it, store, pad_len=10) -> None:\n",


can we use zip_index here and avoid it_list = list(it)?

IterableSource(it).batched(N_SHARDS).zip_index(pad_length=9).map(lambda x: store.set(key=x[0], value=x[1])).join()

actually, the key can't be just an index but should be the smallest timestamp, but i'll come up with something to avoid the list call

I am not so sure why we are doing it. We have already sharded the data by partitioning it. The function is called for each partition/shard, then we need to somehow serialize the iterator to store it. So we kind of need the list call.

Except we want to further shard the data inside the function, but I am not sure what the additional benefit is?

We have already sharded the data by partitioning it.
My bad sorry, in that case yes we should call list().

examples/11.TimeSeries_With_Spark_and_Squirrel.ipynb

examples/12.Split_Data_Into_Different_Stores.ipynb

AlirezaSohofi

LGTM 👍

pzdkn requested review from winfried-ripken and AlirezaSohofi November 28, 2022 13:06

pzdkn added 2 commits November 28, 2022 14:31

update comment

098e0d4

add notebook that demonstrates how to split the data into different s…

2a7ef39

…tores

pzdkn force-pushed the peng-precommit-update-and-new-notebooks branch from 1234883 to 2a7ef39 Compare November 28, 2022 13:32

AlirezaSohofi reviewed Nov 29, 2022

View reviewed changes

remove time measure; assert length; use more samples for notebook 12

20d059c

pzdkn requested a review from AlirezaSohofi November 30, 2022 17:28

AlirezaSohofi approved these changes Dec 1, 2022

View reviewed changes

pzdkn merged commit 82dbae2 into main Dec 1, 2022

pzdkn deleted the peng-precommit-update-and-new-notebooks branch December 1, 2022 19:36

github-actions bot locked and limited conversation to collaborators Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peng precommit update and new notebooks #72

Peng precommit update and new notebooks #72

pzdkn commented Nov 28, 2022

AlirezaSohofi Nov 29, 2022

pzdkn Nov 30, 2022

pzdkn Nov 30, 2022

AlirezaSohofi Dec 1, 2022

AlirezaSohofi left a comment

Peng precommit update and new notebooks #72

Peng precommit update and new notebooks #72

Conversation

pzdkn commented Nov 28, 2022

Description

Type of change

Checklist:

AlirezaSohofi Nov 29, 2022

Choose a reason for hiding this comment

pzdkn Nov 30, 2022

Choose a reason for hiding this comment

pzdkn Nov 30, 2022

Choose a reason for hiding this comment

AlirezaSohofi Dec 1, 2022

Choose a reason for hiding this comment

AlirezaSohofi left a comment

Choose a reason for hiding this comment