-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Peng precommit update and new notebooks #72
Conversation
1234883
to
2a7ef39
Compare
}, | ||
"outputs": [], | ||
"source": [ | ||
"def save_iterable_as_shard(it, store, pad_len=10) -> None:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use zip_index
here and avoid it_list = list(it)
?
IterableSource(it).batched(N_SHARDS).zip_index(pad_length=9).map(lambda x: store.set(key=x[0], value=x[1])).join()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, the key can't be just an index but should be the smallest timestamp, but i'll come up with something to avoid the list call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not so sure why we are doing it. We have already sharded the data by partitioning it. The function is called for each partition/shard, then we need to somehow serialize the iterator to store it. So we kind of need the list call.
Except we want to further shard the data inside the function, but I am not sure what the additional benefit is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have already sharded the data by partitioning it.
My bad sorry, in that case yes we should calllist()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Description
We add two new notebooks to demonstrate how Squirrel can be combined with Spark to do the following tasks:
Fixes # issue
Type of change
Checklist: