-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow easy paging for list operations #895
Comments
Sorry about that. In short, I want to write: for topic in client.list_topics(page_size=50, limit=5000):
do_something_with(topic) with the ability to short-circuit out ( |
@tseaver Do you want to take a stab at this in |
Should be pretty easy now that we have gcloud.iterator.Iterator
|
Indeed. I am moving the iterator in |
Is this not addressed? |
I just looked at documentation for pubsub and storage, both seem to use the iterator, so I'm happy. I'll open a separate issue if I see anything else. |
Yea, this isn't actually fixed. I think we're making a mistake returning a tuple here, and instead should return an iterator. This would mean that we can treat it as a list (not giving a shit about pagination -- aka, good for scripts that need to go through all topics) as well as a paged-resource if needed. Infiniscroll listfor topic in client.list_topics():
do_something_with_topic(topic) Explicitly paginatedtoken = request.GET['token']
topics = client.list_topics(next_page_token=token)
html = template.render({'token': topics.next_page_token, 'topics': topics.page})
response.write(html) The reason I care so much about this is that the first case is incredibly common when people first start using this library... So I want to make the common (intro) case fast. Compared to today: Infiniscroll listtoken = None
while True:
topics, token = client.list_topics(next_page_token=token):
for topic in topics:
do_something_with_topic(topic)
if token is None:
break
# or maybe...
token = True
while token is not None:
topics, token = client.list_topics(next_page_token=token)
for topic in topics:
do_something_with_topic(topic) (correct me if there's a better way?) Explicitly paginatedtoken = request.GET('page_token', None)
topics, token = client.list_topics(next_page_token=token)
html = template.render({'page_token': token, 'topics': topics})
response.write(html) Note that the paginated version doesn't really change all that much, but the infiniscroll version looks way worse off.... It forces people to learn about page tokens when they aren't interested in paginating anything themselves. They just want to iterate over all the stuff. |
@jgeewax #1699 added a much more natural spelling for the "just give me all the results of a method" crowd: >>> from gcloud.iterator import MethodIterator
>>> topics = list(MethodIterator(client.list_topics)) (We can rename it just We do make those users explicitly chose the "iterate everything" choice, but they don't have to deal with paging / tokens. In exchange, we make it possible for those who care about distributed-systems performance, or their API budgets, to do their thing cleanly: it is a "good API smell" when the "convenience" layer uses the "less conventient but more controlled" layer. I actually mean to revise the existing |
So you're saying that for topic in client.list_topics():
do_something_with_topic(topic) And token = request.GET['token']
topics = client.list_topics(next_page_token=token)
html = template.render({'token': topics.next_page_token, 'topics': topics.page})
response.write(html) ? The example of using |
No, for topic in MethodIterator(client.list_topics):
do_something_with(topic) |
OK, I really don't like that. It feels like we're just being jerks... Why wouldn't we just return a thing that is both iterable, and gives you access to the tuple |
I'd say it's the opposite of jerks. Encouraging paging is discouraging behavior that could accidentally use too many resources. Though we do fine with a hybrid approach / offer both in datastore |
I'm proposing a way to do this where both cases ("let me iterate over everything" and "I want to manually page through") are intuitive, easy, and simple. The answer being "import a special class, and wrap a method with it" doesn't seem like the right one to me. It seems like we're saying "we can make this easier for you, but we won't". I can't imagine why the result of Can someone give me a great reason why we wouldn't just return an iterable class that exposes the batch and the next page token if people want it? In other words, why the code I write shouldn't be: # I want to iterate through everything...
for topic in client.list_topics():
do_something_with_topic(topic) And # I want to go through things one page at a time...
token = request.GET['token']
topics = client.list_topics(next_page_token=token)
html = template.render({'token': topics.next_page_token, 'topics': topics.page})
response.write(html) |
I suspect there is also a bit of concern about doing the equivalent of a # I just want the first page:
for topic in client.list_topics().page:
do_something_with(topic) # Please only give me a few items, not everything I've ever added...
for topic in client.list_topics(limit=100):
do_something_with(topic) |
/cc @daspecster |
@daspecster the issue is that we will be encouraging users to naively loop over all results of a paged query API. Such looping cannot be best-practice in a distributed system where the users get charged (or exhaust quotas) based on API usage: it hides when / where the API calls are made. |
@tseaver, ahhh, I think I see. Would renaming the method help clarify this? I personally would expect a method called Calling it Feedback? UPDATE: |
Also, @stephenplusplus passed this on to me from previous discussions of a similar topic. Hopefully it's helpful. |
TL;DR: It's time to use the iterator with a reasonable default limit. The idea that we're making clear where the API calls are doesn't actually fix the problem. If you give me back chunks and page tokens, I'll write a loop and then we're back where we started. If I forget my exit condition, we're in the same boat I was before, except you made me write more code to do it: token = True
api_call_count = 0
while token is not None and api_call_count < 10:
topics, token = client.list_topics(next_page_token=token)
api_call_count += 1
for topic in topics:
do_something_with_topic(topic) This sucks for two reasons:
The right answer for me is being able to say "don't give me more than N topics", and understanding that this could mean a varying number of API calls: token = True
topic_count = 0
while token is not None and topic_count < 500:
topics, token = client.list_topics(next_page_token=token)
for topic in topics:
topic_count += 1
do_something_with_topic(topic) That seems more reasonable to me... however I'm still writing more code, and I still have to remember the exit condition. As a matter of fact, it's more difficult for me to do what I want and just as easy for me to mess up and loop forever over billions of topics. There's no way to get multiple pages with this model without writing my own loop and putting in a manual exit condition. In other words, if the problem we're trying to solve is "don't make people stupidly loop over everything", the right answer is likely: class Client:
def list_topics(self, ..., limit=1000):
return Iterator(...)
for topic in client.list_topics():
do_something_with(topic) This means the "loop forever" call would be... for topic in client.list_topics(limit=None):
do_something_with(topic) IIRC, we had the same discussion before with listing the objects in a bucket, and it currently returns an iterator of sorts. I believe that we had decided to have a reasonable default on how much listing we'd do, however it seems we only implemented that for the recursive delete and the global Regardless of all of this, based on our pricing, I could understand the argument for Datastore however that goes out the window due to Datastore's I think it's unreasonable to leak this paging abstraction which makes the API obnoxious to use simply to avoid someone spending $0.40. If you want the library to show you exactly where your API calls are made, you should be using https://github.com/google/google-api-python-client which is 1:1 with the API... |
Limiting the number of items by default seems arbitrary and incorrect to On Tue, Apr 12, 2016 at 6:31 AM, JJ Geewax [email protected] wrote:
|
I agree that the limit should be set manually, however if the argument against returning an iterator is "it might loop forever! we can't do that to our users!", an arbitrary limit (not many people will have 1k topics) is one way to do it. Regardless of that arbitrary limit, |
Here's my input, for what it's worth. I agree that it is important to provide a simple interface that doesn't require the user to deal with page tokens. If we don't, people will often choose to fetch just the first page. I know I've done that. Whether it is better to return a list or an iterator will depend on the API, as will whether it makes sense to offer a Default limits are rarely going to be appropriate. A default limit on PubSub topics would surely be bad, for instance. Failing to override such a default with If listing PubSub topics returns an iterator, I actually don't think it makes sense to offer a |
For comparison, boto's sns.get_all_topics() returns 100 at a time with a |
That's a fair point... The difference is that our answer will be:
|
I agree that we should have a simple iterate that iterates over everything, however, I want to make sure we don't remove the ability to manually page. This use case is extremely important for datastore where the pagination can happen across multiple user requests. |
My $0.02 on the strategy:
|
FYI I am really close to done |
|
@tseaver There are still some |
Taking the comment on https://github.com/GoogleCloudPlatform/gcloud-python/pull/889/files#r31042158 over here.
Requiring explicit paging for topics is really ugly, there must be a better way than:
If our concern is "people could make a boatload of requests and they don't realize it", we could always limit this like we do with the recursive deleting in storage?
We can do a bunch of this with the
page_size
parameter, but that might mean that I have to wait for everything to come back before starting any work, which seems kind of ridiculous.It'd be really nice if that limit and page-size stuff was both there, so it's easy to do things like "I want the first 5000 topics, and I want to pull them from the server in chunks of 50":
To add a bit more context, I'd like to toss out: what if we made all of our list operations return iterators?
The use cases I see here are...
for topic in list_topics()
)for topic in list_topics(limit=100)
)for topics in list_topics(): if topic.name == 'foo': break
)offset
)! (for topic in list_topics(page_token=token, limit=100)
)The "let's just always return
page, page_token
" thing doesn't really make all of those use-cases all that fun... But if we always return iterators, they are all easy.Further, let's say I have a weird case where I just want one page worth of stuff...
list_topics().get_current_page()
could return what you want, no?The text was updated successfully, but these errors were encountered: