Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peng precommit update and new notebooks #72

Merged
merged 3 commits into from
Dec 1, 2022

Conversation

pzdkn
Copy link
Contributor

@pzdkn pzdkn commented Nov 28, 2022

Description

We add two new notebooks to demonstrate how Squirrel can be combined with Spark to do the following tasks:

  1. Notebook 11: Shows how we can load time-series data
  2. Notebook 12: Shows how we can split data to be stored into different stores, depending on a categorical label.

Fixes # issue

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring including code style reformatting
  • Other (please describe):

Checklist:

  • I have read the contributing guideline doc (external contributors only)
  • Lint and unit tests pass locally with my changes
  • I have kept the PR small so that it can be easily reviewed
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • All dependency changes have been reflected in the pip requirement files.

@pzdkn pzdkn force-pushed the peng-precommit-update-and-new-notebooks branch from 1234883 to 2a7ef39 Compare November 28, 2022 13:32
},
"outputs": [],
"source": [
"def save_iterable_as_shard(it, store, pad_len=10) -> None:\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use zip_index here and avoid it_list = list(it)?

IterableSource(it).batched(N_SHARDS).zip_index(pad_length=9).map(lambda x: store.set(key=x[0], value=x[1])).join()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, the key can't be just an index but should be the smallest timestamp, but i'll come up with something to avoid the list call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not so sure why we are doing it. We have already sharded the data by partitioning it. The function is called for each partition/shard, then we need to somehow serialize the iterator to store it. So we kind of need the list call.

Except we want to further shard the data inside the function, but I am not sure what the additional benefit is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have already sharded the data by partitioning it.
My bad sorry, in that case yes we should call list().

examples/11.TimeSeries_With_Spark_and_Squirrel.ipynb Outdated Show resolved Hide resolved
examples/12.Split_Data_Into_Different_Stores.ipynb Outdated Show resolved Hide resolved
Copy link
Contributor

@AlirezaSohofi AlirezaSohofi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@pzdkn pzdkn merged commit 82dbae2 into main Dec 1, 2022
@pzdkn pzdkn deleted the peng-precommit-update-and-new-notebooks branch December 1, 2022 19:36
@github-actions github-actions bot locked and limited conversation to collaborators Dec 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants