Allow easy paging for list operations #895

jgeewax · 2015-05-26T15:10:19Z

Taking the comment on https://github.com/GoogleCloudPlatform/gcloud-python/pull/889/files#r31042158 over here.

Requiring explicit paging for topics is really ugly, there must be a better way than:

all_topics = []

topics, token = client.list_topics()
all_topics.extend(topics)
while token is not None:
  topics, token = client.list_topics(page_token=token)
  all_topics.extend(topics)

If our concern is "people could make a boatload of requests and they don't realize it", we could always limit this like we do with the recursive deleting in storage?

topics = client.list_topics(limit=100)

We can do a bunch of this with the page_size parameter, but that might mean that I have to wait for everything to come back before starting any work, which seems kind of ridiculous.

It'd be really nice if that limit and page-size stuff was both there, so it's easy to do things like "I want the first 5000 topics, and I want to pull them from the server in chunks of 50":

for topic in client.list_topics(page_size=50, limit=5000):
  push_work_to_other_system(topic)

To add a bit more context, I'd like to toss out: what if we made all of our list operations return iterators?

The use cases I see here are...

I want everything, give it all to me (for topic in list_topics())
I want up to N things, stop giving them to me then (for topic in list_topics(limit=100))
I don't know how many I want, I'll know when I want to stop though... (for topics in list_topics(): if topic.name == 'foo': break)
Combination of the previous two (I don't know when I want to stop, but don't let me go on forever, kill it at say... 1000)
I want to pick up where I left off, I saved a token somewhere (sort of like offset)! (for topic in list_topics(page_token=token, limit=100))

The "let's just always return page, page_token" thing doesn't really make all of those use-cases all that fun... But if we always return iterators, they are all easy.

Further, let's say I have a weird case where I just want one page worth of stuff... list_topics().get_current_page() could return what you want, no?

The text was updated successfully, but these errors were encountered:

dhermes · 2015-06-06T16:42:17Z

In general I like the idea, but was likely intimidated due to the sheer length of what you wrote.

Recall #90 was in place just in case this was necessary. We have a nice Iterator matching most of this pattern in datastore.query.

jgeewax · 2015-06-06T16:46:16Z

Sorry about that. In short, I want to write:

for topic in client.list_topics(page_size=50, limit=5000):
  do_something_with(topic)

with the ability to short-circuit out (break or return).

dhermes · 2015-07-07T17:46:01Z

@tseaver Do you want to take a stab at this in pubsub?

jgeewax · 2015-07-07T17:47:11Z

Should be pretty easy now that we have gcloud.iterator.Iterator

dhermes · 2015-07-07T17:50:25Z

Indeed. I am moving the iterator in storage into the client right now (for #952) and noticed we had no iterator in pubsub.

dhermes · 2015-08-27T23:56:07Z

Is this not addressed?

jgeewax · 2015-08-28T14:12:44Z

I just looked at documentation for pubsub and storage, both seem to use the iterator, so I'm happy.

I'll open a separate issue if I see anything else.

jgeewax · 2016-04-05T12:46:36Z

Yea, this isn't actually fixed.

See https://github.com/GoogleCloudPlatform/gcloud-python/blob/188515f192e3823ce21a47df2d4742995d08f109/gcloud/pubsub/client.py#L47

I think we're making a mistake returning a tuple here, and instead should return an iterator.

This would mean that we can treat it as a list (not giving a shit about pagination -- aka, good for scripts that need to go through all topics) as well as a paged-resource if needed.

Infiniscroll list

for topic in client.list_topics():
  do_something_with_topic(topic)

Explicitly paginated

token = request.GET['token']
topics = client.list_topics(next_page_token=token)
html = template.render({'token': topics.next_page_token, 'topics': topics.page})
response.write(html)

The reason I care so much about this is that the first case is incredibly common when people first start using this library... So I want to make the common (intro) case fast.

Compared to today:

Infiniscroll list

token = None
while True:
  topics, token = client.list_topics(next_page_token=token):
    for topic in topics:
      do_something_with_topic(topic)
    if token is None:
      break

# or maybe...

token = True
while token is not None:
  topics, token = client.list_topics(next_page_token=token)
    for topic in topics:
      do_something_with_topic(topic)

(correct me if there's a better way?)

Explicitly paginated

token = request.GET('page_token', None)
topics, token = client.list_topics(next_page_token=token)
html = template.render({'page_token': token, 'topics': topics})
response.write(html)

Note that the paginated version doesn't really change all that much, but the infiniscroll version looks way worse off.... It forces people to learn about page tokens when they aren't interested in paginating anything themselves. They just want to iterate over all the stuff.

tseaver · 2016-04-05T15:27:52Z

@jgeewax #1699 added a much more natural spelling for the "just give me all the results of a method" crowd:

>>> from gcloud.iterator import MethodIterator
>>> topics = list(MethodIterator(client.list_topics))

(We can rename it just Iterator later, after we finish removing all the subclasses of the current iterator).

We do make those users explicitly chose the "iterate everything" choice, but they don't have to deal with paging / tokens. In exchange, we make it possible for those who care about distributed-systems performance, or their API budgets, to do their thing cleanly: it is a "good API smell" when the "convenience" layer uses the "less conventient but more controlled" layer.

I actually mean to revise the existing list_foo() methods (e.g., Bucket.list_blobs) to emulate this pattern.

jgeewax · 2016-04-05T18:24:23Z

So you're saying that .list_foo() will look like the example above?

for topic in client.list_topics():
  do_something_with_topic(topic)

And

token = request.GET['token']
topics = client.list_topics(next_page_token=token)
html = template.render({'token': topics.next_page_token, 'topics': topics.page})
response.write(html)

?

The example of using MethodIterator and passing in a method (client.list_topics) seems ... not very friendly to me...

tseaver · 2016-04-05T19:10:59Z

No, list_topics won't return an iterable: it will return the batch, token pair (as it does now). If you don't care about when the API calls are made, or how many of them, you use the MethodIterator wrapper, which is an iterable:

for topic in MethodIterator(client.list_topics):
    do_something_with(topic)

jgeewax · 2016-04-05T19:41:10Z

OK, I really don't like that. It feels like we're just being jerks...

Why wouldn't we just return a thing that is both iterable, and gives you access to the tuple (batch, token) if you want it...?

dhermes · 2016-04-05T20:21:15Z

I'd say it's the opposite of jerks. Encouraging paging is discouraging behavior that could accidentally use too many resources. Though we do fine with a hybrid approach / offer both in datastore Query.

jgeewax · 2016-04-06T13:21:04Z

I'm proposing a way to do this where both cases ("let me iterate over everything" and "I want to manually page through") are intuitive, easy, and simple. The answer being "import a special class, and wrap a method with it" doesn't seem like the right one to me. It seems like we're saying "we can make this easier for you, but we won't".

I can't imagine why the result of list_something() wouldn't be iterable... It just seems wrong.

Can someone give me a great reason why we wouldn't just return an iterable class that exposes the batch and the next page token if people want it? In other words, why the code I write shouldn't be:

# I want to iterate through everything...
for topic in client.list_topics():
  do_something_with_topic(topic)

And

# I want to go through things one page at a time...
token = request.GET['token']
topics = client.list_topics(next_page_token=token)
html = template.render({'token': topics.next_page_token, 'topics': topics.page})
response.write(html)

jgeewax · 2016-04-06T13:53:18Z

I suspect there is also a bit of concern about doing the equivalent of a HEAD request, as well as the "I only want a few items back, please don't loop forever" situation. In those cases, here's the code I'm proposing:

# I just want the first page:
for topic in client.list_topics().page:
  do_something_with(topic)

# Please only give me a few items, not everything I've ever added...
for topic in client.list_topics(limit=100):
  do_something_with(topic)

jgeewax · 2016-04-07T19:54:22Z

@tseaver @dhermes : Any thoughts here?

jgeewax · 2016-04-11T20:15:35Z

/cc @daspecster

daspecster · 2016-04-11T20:48:07Z

Hello!

From the discussion that I've read here I'm not sure what's holding us back from making this return an iterator? I could be missing something but that seems like the natural result to me.

@dhermes, @tseaver are there more implications to this that I'm not seeing?

tseaver · 2016-04-11T21:01:48Z

@daspecster the issue is that we will be encouraging users to naively loop over all results of a paged query API. Such looping cannot be best-practice in a distributed system where the users get charged (or exhaust quotas) based on API usage: it hides when / where the API calls are made.

daspecster · 2016-04-11T21:12:41Z

@tseaver, ahhh, I think I see.
So having the API call abstracted down one level forces end users to dig deeper in order to confirm that they want to make an API call and use some of their quota?

Would renaming the method help clarify this? I personally would expect a method called list_topics to return a iterator.

Calling it client.get_topics could help imply quota usage. I'm a fan of very descriptive method names so I might even propose client.get_list_of_topics().

Feedback?

UPDATE:
Sorry typo. I meant that I would expect an iterator from a method called list_topics. Corrected.

daspecster · 2016-04-12T01:57:01Z

Also, @stephenplusplus passed this on to me from previous discussions of a similar topic. Hopefully it's helpful.

googleapis/google-cloud-node#692 (comment)

jgeewax · 2016-04-12T10:31:30Z

TL;DR: It's time to use the iterator with a reasonable default limit.

The idea that we're making clear where the API calls are doesn't actually fix the problem. If you give me back chunks and page tokens, I'll write a loop and then we're back where we started. If I forget my exit condition, we're in the same boat I was before, except you made me write more code to do it:

token = True
api_call_count = 0
while token is not None and api_call_count < 10:
  topics, token = client.list_topics(next_page_token=token)
  api_call_count += 1
  for topic in topics:
    do_something_with_topic(topic)

This sucks for two reasons:

I don't really care about the number of API calls here (they cost $0.0000004 each), I care about the number of topics I get back to make my app work.
There's no guarantee I'll get N items per page, so depending on how PubSub feels today, I'll get a different number of topics as a result.

The right answer for me is being able to say "don't give me more than N topics", and understanding that this could mean a varying number of API calls:

token = True
topic_count = 0
while token is not None and topic_count < 500:
  topics, token = client.list_topics(next_page_token=token)
  for topic in topics:
    topic_count += 1
    do_something_with_topic(topic)

That seems more reasonable to me... however I'm still writing more code, and I still have to remember the exit condition. As a matter of fact, it's more difficult for me to do what I want and just as easy for me to mess up and loop forever over billions of topics. There's no way to get multiple pages with this model without writing my own loop and putting in a manual exit condition.

In other words, if the problem we're trying to solve is "don't make people stupidly loop over everything", the right answer is likely:

class Client:
  def list_topics(self, ..., limit=1000):
    return Iterator(...)

for topic in client.list_topics():
  do_something_with(topic)

This means the "loop forever" call would be...

for topic in client.list_topics(limit=None):
  do_something_with(topic)

IIRC, we had the same discussion before with listing the objects in a bucket, and it currently returns an iterator of sorts.

I believe that we had decided to have a reasonable default on how much listing we'd do, however it seems we only implemented that for the recursive delete and the global make_public

Regardless of all of this, based on our pricing, I could understand the argument for Datastore however that goes out the window due to Datastore's LIMIT operator (aka, you're expected to say how many results you want back). For PubSub, if you created 1m topics, and looped over them, it'd take hours and cost you a whole 40 cents.

I think it's unreasonable to leak this paging abstraction which makes the API obnoxious to use simply to avoid someone spending $0.40. If you want the library to show you exactly where your API calls are made, you should be using https://github.com/google/google-api-python-client which is 1:1 with the API...

rimey · 2016-04-12T10:39:39Z

Limiting the number of items by default seems arbitrary and incorrect to
me, except when you are listing items like log entries in reverse time
order. If the ordering is not defined, limiting is useless.

On Tue, Apr 12, 2016 at 6:31 AM, JJ Geewax [email protected] wrote:

TL;DR: It's time to use the iterator with a reasonable default limit.

The idea that we're making clear where the API calls are doesn't actually
fix the problem. If you give me back chunks and page tokens, I'll write a
loop and then we're back where we started. If I forget my exit condition,
we're in the same boat I was before, except you made me write more code to
do it:

token = True
api_call_count = 0while token is not None and api_call_count < 10:
topics, token = client.list_topics(next_page_token=token)
api_call_count += 1
for topic in topics:
do_something_with_topic(topic)

This sucks for two reasons:

I don't really care about the number of API calls here (they cost
$0.0000004 each), I care about the number of topics I get back to make my
app work.

There's no guarantee I'll get N items per page, so depending on how
PubSub feels today, I'll get a different number of topics as a result.

The right answer for me is being able to say "don't give me more than N
topics", and understanding that this could mean a varying number of API
calls:

token = True
topic_count = 0while token is not None and topic_count < 500:
topics, token = client.list_topics(next_page_token=token)
for topic in topics:
topic_count += 1
do_something_with_topic(topic)

That seems more reasonable to me... however I'm still writing more code,
and I still have to remember the exit condition. As a matter of fact, it's
more difficult for me to do what I want and just as easy for me to mess up
and loop forever over billions of topics. There's no way to get multiple
pages with this model without writing my own loop and putting in a
manual exit condition.

In other words, if the problem we're trying to solve is "don't make people
stupidly loop over everything", the right answer is likely:

class Client:
def list_topics(self, ..., limit=1000):
return Iterator(...)
for topic in client.list_topics():
do_something_with(topic)

This means the "loop forever" call would be...

for topic in client.list_topics(limit=None):
do_something_with(topic)

IIRC, we had the same discussion before with listing the objects in a
bucket, and it currently returns an iterator of sorts
https://github.com/GoogleCloudPlatform/gcloud-python/blob/cefff49fb57eea68e15e3a64f1a71c1ca107f44d/gcloud/storage/bucket.py#L245
.

I believe that we had decided to have a reasonable default
https://github.com/GoogleCloudPlatform/gcloud-python/blob/cefff49fb57eea68e15e3a64f1a71c1ca107f44d/gcloud/storage/bucket.py#L84
on how much listing we'd do, however it seems we only implemented that for
the recursive delete and the global make_public

Regardless of all of this, based on our pricing, I could understand the
argument for Datastore however that goes out the window due to Datastore's
LIMIT operator (aka, you're expected to say how many results you want
back). For PubSub, if you created 1m topics, and looped over them, it'd
take hours and cost you a whole 40 cents.

I think it's unreasonable to leak this paging abstraction which makes the
API obnoxious to use simply to avoid someone spending $0.40. If you want
the library to show you exactly where your API calls are made, you should
be using https://github.com/google/google-api-python-client which is 1:1
with the API...

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#895 (comment)

jgeewax · 2016-04-12T10:41:38Z

I agree that the limit should be set manually, however if the argument against returning an iterator is "it might loop forever! we can't do that to our users!", an arbitrary limit (not many people will have 1k topics) is one way to do it.

Regardless of that arbitrary limit, list_* methods should return something I can iterate over rather than a batch, token tuple.

rimey · 2016-04-12T16:23:05Z

Here's my input, for what it's worth.

I agree that it is important to provide a simple interface that doesn't require the user to deal with page tokens. If we don't, people will often choose to fetch just the first page. I know I've done that.

Whether it is better to return a list or an iterator will depend on the API, as will whether it makes sense to offer a limit parameter.

Default limits are rarely going to be appropriate. A default limit on PubSub topics would surely be bad, for instance. Failing to override such a default with limit=None would almost always be an insidious bug.

If listing PubSub topics returns an iterator, I actually don't think it makes sense to offer a limit parameter at all. I can't think of a use-case for it. If the user really did want to retrieve some limited number of arbitrary topics, they could do that by terminating the iteration early.

daspecster · 2016-04-12T16:59:01Z

For comparison, boto's sns.get_all_topics() returns 100 at a time with a next_token in the results. However this limit isn't documented very well. It appears that questions on stackoverflow came up from people that knew they had over 100 topics. (http://stackoverflow.com/a/32019553)

jgeewax · 2016-04-12T17:39:23Z

That's a fair point... The difference is that our answer will be:

for topic in client.list_topics(limit=None) if you really want ALL of them. Otherwise, you'll get a maximum of 1000 (or some sufficiently large default).

theacodes · 2016-10-03T19:22:06Z

I agree that we should have a simple iterate that iterates over everything, however, I want to make sure we don't remove the ability to manually page. This use case is extremely important for datastore where the pagination can happen across multiple user requests.

tseaver · 2016-10-03T19:44:27Z

My $0.02 on the strategy:

Rename the current paging methods, from list_foo -> list_foo_paged.
Add sibling list_foo methods which just return an iterator already wrapped around list_foo_paged. These methods do not carry params for all the possible iterator properties to list_foo: instead, users who care can configure them before beginning to iterate.

dhermes · 2016-10-25T19:27:20Z

FYI I am really close to done

tseaver · 2016-11-15T20:52:23Z

0.21 releases include the new iterators.

dhermes · 2016-11-15T21:13:32Z

@tseaver There are still some list_ type methods in BigQuery and Monitoring that need to be changed over.

jgeewax added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. api: datastore Issues related to the Datastore API. api: storage Issues related to the Cloud Storage API. api: pubsub Issues related to the Pub/Sub API. labels May 26, 2015

jgeewax added this to the Core Future milestone May 26, 2015

jgeewax mentioned this issue May 26, 2015

#861: Implement 'pubsub.client.Client'. #889

Closed

This was referenced Jun 6, 2015

Attempt at implementing clients for Pub/Sub #910

Merged

Make pagination simpler? googleapis/google-cloud-node#640

Closed

jgeewax closed this as completed Aug 28, 2015

jgeewax reopened this Apr 5, 2016

jgeewax modified the milestones: Core Stable, Core Future Apr 7, 2016

rimey mentioned this issue Apr 9, 2016

Update iterator to take a method and **kwargs #1554

Closed

jgeewax mentioned this issue Apr 13, 2016

Policy on the basic CRUD operations #1722

Closed

geigerj mentioned this issue Apr 26, 2016

Change page-streaming behavior googleapis/gax-python#94

Merged

jgeewax mentioned this issue May 10, 2016

Query return limited to 87500 rows #1784

Closed

jgeewax mentioned this issue May 20, 2016

Pagination seems a bit ... confusing (specifically Datastore) googleapis/google-cloud-ruby#707

Closed

daspecster added the blocks-docs-samples label Oct 3, 2016

This was referenced Oct 4, 2016

Making max_results part of the base Iterator class. #2490

Merged

bucket.list_blobs(max_results=n) has no effect #1467

Closed

tseaver closed this as completed Nov 15, 2016

JustinBeckwith assigned jgeewax Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow easy paging for list operations #895

Allow easy paging for list operations #895

jgeewax commented May 26, 2015

dhermes commented Jun 6, 2015

jgeewax commented Jun 6, 2015

dhermes commented Jul 7, 2015

jgeewax commented Jul 7, 2015 via email

dhermes commented Jul 7, 2015

dhermes commented Aug 27, 2015

jgeewax commented Aug 28, 2015

jgeewax commented Apr 5, 2016

tseaver commented Apr 5, 2016

jgeewax commented Apr 5, 2016

tseaver commented Apr 5, 2016

jgeewax commented Apr 5, 2016

dhermes commented Apr 5, 2016

jgeewax commented Apr 6, 2016

jgeewax commented Apr 6, 2016

jgeewax commented Apr 7, 2016

jgeewax commented Apr 11, 2016

daspecster commented Apr 11, 2016

tseaver commented Apr 11, 2016

daspecster commented Apr 11, 2016

daspecster commented Apr 12, 2016

jgeewax commented Apr 12, 2016

rimey commented Apr 12, 2016

jgeewax commented Apr 12, 2016

rimey commented Apr 12, 2016

daspecster commented Apr 12, 2016

jgeewax commented Apr 12, 2016

theacodes commented Oct 3, 2016

tseaver commented Oct 3, 2016

dhermes commented Oct 25, 2016

tseaver commented Nov 15, 2016

dhermes commented Nov 15, 2016

Allow easy paging for list operations #895

Allow easy paging for list operations #895

Comments

jgeewax commented May 26, 2015

dhermes commented Jun 6, 2015

jgeewax commented Jun 6, 2015

dhermes commented Jul 7, 2015

jgeewax commented Jul 7, 2015 via email

dhermes commented Jul 7, 2015

dhermes commented Aug 27, 2015

jgeewax commented Aug 28, 2015

jgeewax commented Apr 5, 2016

Infiniscroll list

Explicitly paginated

Infiniscroll list

Explicitly paginated

tseaver commented Apr 5, 2016

jgeewax commented Apr 5, 2016

tseaver commented Apr 5, 2016

jgeewax commented Apr 5, 2016

dhermes commented Apr 5, 2016

jgeewax commented Apr 6, 2016

jgeewax commented Apr 6, 2016

jgeewax commented Apr 7, 2016

jgeewax commented Apr 11, 2016

daspecster commented Apr 11, 2016

tseaver commented Apr 11, 2016

daspecster commented Apr 11, 2016

daspecster commented Apr 12, 2016

jgeewax commented Apr 12, 2016

rimey commented Apr 12, 2016

jgeewax commented Apr 12, 2016

rimey commented Apr 12, 2016

daspecster commented Apr 12, 2016

jgeewax commented Apr 12, 2016

theacodes commented Oct 3, 2016

tseaver commented Oct 3, 2016

dhermes commented Oct 25, 2016

tseaver commented Nov 15, 2016

dhermes commented Nov 15, 2016