Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow easy paging for list operations #895

Closed
jgeewax opened this issue May 26, 2015 · 32 comments
Closed

Allow easy paging for list operations #895

jgeewax opened this issue May 26, 2015 · 32 comments
Assignees
Labels
api: datastore Issues related to the Datastore API. api: pubsub Issues related to the Pub/Sub API. api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@jgeewax
Copy link
Contributor

jgeewax commented May 26, 2015

Taking the comment on https://github.com/GoogleCloudPlatform/gcloud-python/pull/889/files#r31042158 over here.

Requiring explicit paging for topics is really ugly, there must be a better way than:

all_topics = []

topics, token = client.list_topics()
all_topics.extend(topics)
while token is not None:
  topics, token = client.list_topics(page_token=token)
  all_topics.extend(topics)

If our concern is "people could make a boatload of requests and they don't realize it", we could always limit this like we do with the recursive deleting in storage?

topics = client.list_topics(limit=100)

We can do a bunch of this with the page_size parameter, but that might mean that I have to wait for everything to come back before starting any work, which seems kind of ridiculous.

It'd be really nice if that limit and page-size stuff was both there, so it's easy to do things like "I want the first 5000 topics, and I want to pull them from the server in chunks of 50":

for topic in client.list_topics(page_size=50, limit=5000):
  push_work_to_other_system(topic)

To add a bit more context, I'd like to toss out: what if we made all of our list operations return iterators?

The use cases I see here are...

  1. I want everything, give it all to me (for topic in list_topics())
  2. I want up to N things, stop giving them to me then (for topic in list_topics(limit=100))
  3. I don't know how many I want, I'll know when I want to stop though... (for topics in list_topics(): if topic.name == 'foo': break)
  4. Combination of the previous two (I don't know when I want to stop, but don't let me go on forever, kill it at say... 1000)
  5. I want to pick up where I left off, I saved a token somewhere (sort of like offset)! (for topic in list_topics(page_token=token, limit=100))

The "let's just always return page, page_token" thing doesn't really make all of those use-cases all that fun... But if we always return iterators, they are all easy.

Further, let's say I have a weird case where I just want one page worth of stuff... list_topics().get_current_page() could return what you want, no?

@jgeewax jgeewax added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. api: datastore Issues related to the Datastore API. api: storage Issues related to the Cloud Storage API. api: pubsub Issues related to the Pub/Sub API. labels May 26, 2015
@jgeewax jgeewax added this to the Core Future milestone May 26, 2015
@dhermes
Copy link
Contributor

dhermes commented Jun 6, 2015

In general I like the idea, but was likely intimidated due to the sheer length of what you wrote.

Recall #90 was in place just in case this was necessary. We have a nice Iterator matching most of this pattern in datastore.query.

@jgeewax
Copy link
Contributor Author

jgeewax commented Jun 6, 2015

Sorry about that. In short, I want to write:

for topic in client.list_topics(page_size=50, limit=5000):
  do_something_with(topic)

with the ability to short-circuit out (break or return).

@dhermes
Copy link
Contributor

dhermes commented Jul 7, 2015

@tseaver Do you want to take a stab at this in pubsub?

@jgeewax
Copy link
Contributor Author

jgeewax commented Jul 7, 2015 via email

@dhermes
Copy link
Contributor

dhermes commented Jul 7, 2015

Indeed. I am moving the iterator in storage into the client right now (for #952) and noticed we had no iterator in pubsub.

@dhermes
Copy link
Contributor

dhermes commented Aug 27, 2015

Is this not addressed?

@jgeewax
Copy link
Contributor Author

jgeewax commented Aug 28, 2015

I just looked at documentation for pubsub and storage, both seem to use the iterator, so I'm happy.

I'll open a separate issue if I see anything else.

@jgeewax jgeewax closed this as completed Aug 28, 2015
@jgeewax jgeewax reopened this Apr 5, 2016
@jgeewax
Copy link
Contributor Author

jgeewax commented Apr 5, 2016

Yea, this isn't actually fixed.

See https://github.com/GoogleCloudPlatform/gcloud-python/blob/188515f192e3823ce21a47df2d4742995d08f109/gcloud/pubsub/client.py#L47

I think we're making a mistake returning a tuple here, and instead should return an iterator.

This would mean that we can treat it as a list (not giving a shit about pagination -- aka, good for scripts that need to go through all topics) as well as a paged-resource if needed.

Infiniscroll list

for topic in client.list_topics():
  do_something_with_topic(topic)

Explicitly paginated

token = request.GET['token']
topics = client.list_topics(next_page_token=token)
html = template.render({'token': topics.next_page_token, 'topics': topics.page})
response.write(html)

The reason I care so much about this is that the first case is incredibly common when people first start using this library... So I want to make the common (intro) case fast.

Compared to today:

Infiniscroll list

token = None
while True:
  topics, token = client.list_topics(next_page_token=token):
    for topic in topics:
      do_something_with_topic(topic)
    if token is None:
      break

# or maybe...

token = True
while token is not None:
  topics, token = client.list_topics(next_page_token=token)
    for topic in topics:
      do_something_with_topic(topic)

(correct me if there's a better way?)

Explicitly paginated

token = request.GET('page_token', None)
topics, token = client.list_topics(next_page_token=token)
html = template.render({'page_token': token, 'topics': topics})
response.write(html)

Note that the paginated version doesn't really change all that much, but the infiniscroll version looks way worse off.... It forces people to learn about page tokens when they aren't interested in paginating anything themselves. They just want to iterate over all the stuff.

@tseaver
Copy link
Contributor

tseaver commented Apr 5, 2016

@jgeewax #1699 added a much more natural spelling for the "just give me all the results of a method" crowd:

>>> from gcloud.iterator import MethodIterator
>>> topics = list(MethodIterator(client.list_topics))

(We can rename it just Iterator later, after we finish removing all the subclasses of the current iterator).

We do make those users explicitly chose the "iterate everything" choice, but they don't have to deal with paging / tokens. In exchange, we make it possible for those who care about distributed-systems performance, or their API budgets, to do their thing cleanly: it is a "good API smell" when the "convenience" layer uses the "less conventient but more controlled" layer.

I actually mean to revise the existing list_foo() methods (e.g., Bucket.list_blobs) to emulate this pattern.

@jgeewax
Copy link
Contributor Author

jgeewax commented Apr 5, 2016

So you're saying that .list_foo() will look like the example above?

for topic in client.list_topics():
  do_something_with_topic(topic)

And

token = request.GET['token']
topics = client.list_topics(next_page_token=token)
html = template.render({'token': topics.next_page_token, 'topics': topics.page})
response.write(html)

?

The example of using MethodIterator and passing in a method (client.list_topics) seems ... not very friendly to me...

@tseaver
Copy link
Contributor

tseaver commented Apr 5, 2016

No, list_topics won't return an iterable: it will return the batch, token pair (as it does now). If you don't care about when the API calls are made, or how many of them, you use the MethodIterator wrapper, which is an iterable:

for topic in MethodIterator(client.list_topics):
    do_something_with(topic)

@jgeewax
Copy link
Contributor Author

jgeewax commented Apr 5, 2016

OK, I really don't like that. It feels like we're just being jerks...

Why wouldn't we just return a thing that is both iterable, and gives you access to the tuple (batch, token) if you want it...?

@dhermes
Copy link
Contributor

dhermes commented Apr 5, 2016

I'd say it's the opposite of jerks. Encouraging paging is discouraging behavior that could accidentally use too many resources. Though we do fine with a hybrid approach / offer both in datastore Query.

@jgeewax
Copy link
Contributor Author

jgeewax commented Apr 6, 2016

I'm proposing a way to do this where both cases ("let me iterate over everything" and "I want to manually page through") are intuitive, easy, and simple. The answer being "import a special class, and wrap a method with it" doesn't seem like the right one to me. It seems like we're saying "we can make this easier for you, but we won't".

I can't imagine why the result of list_something() wouldn't be iterable... It just seems wrong.

Can someone give me a great reason why we wouldn't just return an iterable class that exposes the batch and the next page token if people want it? In other words, why the code I write shouldn't be:

# I want to iterate through everything...
for topic in client.list_topics():
  do_something_with_topic(topic)

And

# I want to go through things one page at a time...
token = request.GET['token']
topics = client.list_topics(next_page_token=token)
html = template.render({'token': topics.next_page_token, 'topics': topics.page})
response.write(html)

@jgeewax
Copy link
Contributor Author

jgeewax commented Apr 6, 2016

I suspect there is also a bit of concern about doing the equivalent of a HEAD request, as well as the "I only want a few items back, please don't loop forever" situation. In those cases, here's the code I'm proposing:

# I just want the first page:
for topic in client.list_topics().page:
  do_something_with(topic)
# Please only give me a few items, not everything I've ever added...
for topic in client.list_topics(limit=100):
  do_something_with(topic)

@jgeewax jgeewax modified the milestones: Core Stable, Core Future Apr 7, 2016
@jgeewax
Copy link
Contributor Author

jgeewax commented Apr 7, 2016

@tseaver @dhermes : Any thoughts here?

@jgeewax
Copy link
Contributor Author

jgeewax commented Apr 11, 2016

/cc @daspecster

@daspecster
Copy link
Contributor

Hello!

From the discussion that I've read here I'm not sure what's holding us back from making this return an iterator? I could be missing something but that seems like the natural result to me.

@dhermes, @tseaver are there more implications to this that I'm not seeing?

@tseaver
Copy link
Contributor

tseaver commented Apr 11, 2016

@daspecster the issue is that we will be encouraging users to naively loop over all results of a paged query API. Such looping cannot be best-practice in a distributed system where the users get charged (or exhaust quotas) based on API usage: it hides when / where the API calls are made.

@daspecster
Copy link
Contributor

@tseaver, ahhh, I think I see.
So having the API call abstracted down one level forces end users to dig deeper in order to confirm that they want to make an API call and use some of their quota?

Would renaming the method help clarify this? I personally would expect a method called list_topics to return a iterator.

Calling it client.get_topics could help imply quota usage. I'm a fan of very descriptive method names so I might even propose client.get_list_of_topics().

Feedback?

UPDATE:
Sorry typo. I meant that I would expect an iterator from a method called list_topics. Corrected.

@daspecster
Copy link
Contributor

Also, @stephenplusplus passed this on to me from previous discussions of a similar topic. Hopefully it's helpful.

googleapis/google-cloud-node#692 (comment)

@jgeewax
Copy link
Contributor Author

jgeewax commented Apr 12, 2016

TL;DR: It's time to use the iterator with a reasonable default limit.

The idea that we're making clear where the API calls are doesn't actually fix the problem. If you give me back chunks and page tokens, I'll write a loop and then we're back where we started. If I forget my exit condition, we're in the same boat I was before, except you made me write more code to do it:

token = True
api_call_count = 0
while token is not None and api_call_count < 10:
  topics, token = client.list_topics(next_page_token=token)
  api_call_count += 1
  for topic in topics:
    do_something_with_topic(topic)

This sucks for two reasons:

  1. I don't really care about the number of API calls here (they cost $0.0000004 each), I care about the number of topics I get back to make my app work.
  2. There's no guarantee I'll get N items per page, so depending on how PubSub feels today, I'll get a different number of topics as a result.

The right answer for me is being able to say "don't give me more than N topics", and understanding that this could mean a varying number of API calls:

token = True
topic_count = 0
while token is not None and topic_count < 500:
  topics, token = client.list_topics(next_page_token=token)
  for topic in topics:
    topic_count += 1
    do_something_with_topic(topic)

That seems more reasonable to me... however I'm still writing more code, and I still have to remember the exit condition. As a matter of fact, it's more difficult for me to do what I want and just as easy for me to mess up and loop forever over billions of topics. There's no way to get multiple pages with this model without writing my own loop and putting in a manual exit condition.

In other words, if the problem we're trying to solve is "don't make people stupidly loop over everything", the right answer is likely:

class Client:
  def list_topics(self, ..., limit=1000):
    return Iterator(...)

for topic in client.list_topics():
  do_something_with(topic)

This means the "loop forever" call would be...

for topic in client.list_topics(limit=None):
  do_something_with(topic)

IIRC, we had the same discussion before with listing the objects in a bucket, and it currently returns an iterator of sorts.

I believe that we had decided to have a reasonable default on how much listing we'd do, however it seems we only implemented that for the recursive delete and the global make_public

Regardless of all of this, based on our pricing, I could understand the argument for Datastore however that goes out the window due to Datastore's LIMIT operator (aka, you're expected to say how many results you want back). For PubSub, if you created 1m topics, and looped over them, it'd take hours and cost you a whole 40 cents.

I think it's unreasonable to leak this paging abstraction which makes the API obnoxious to use simply to avoid someone spending $0.40. If you want the library to show you exactly where your API calls are made, you should be using https://github.com/google/google-api-python-client which is 1:1 with the API...

@rimey
Copy link
Contributor

rimey commented Apr 12, 2016

Limiting the number of items by default seems arbitrary and incorrect to
me, except when you are listing items like log entries in reverse time
order. If the ordering is not defined, limiting is useless.

On Tue, Apr 12, 2016 at 6:31 AM, JJ Geewax [email protected] wrote:

TL;DR: It's time to use the iterator with a reasonable default limit.

The idea that we're making clear where the API calls are doesn't actually
fix the problem. If you give me back chunks and page tokens, I'll write a
loop and then we're back where we started. If I forget my exit condition,
we're in the same boat I was before, except you made me write more code to
do it:

token = True
api_call_count = 0while token is not None and api_call_count < 10:
topics, token = client.list_topics(next_page_token=token)
api_call_count += 1
for topic in topics:
do_something_with_topic(topic)

This sucks for two reasons:

  1. I don't really care about the number of API calls here (they cost
    $0.0000004 each), I care about the number of topics I get back to make my
    app work.
  2. There's no guarantee I'll get N items per page, so depending on how
    PubSub feels today, I'll get a different number of topics as a result.

The right answer for me is being able to say "don't give me more than N
topics", and understanding that this could mean a varying number of API
calls:

token = True
topic_count = 0while token is not None and topic_count < 500:
topics, token = client.list_topics(next_page_token=token)
for topic in topics:
topic_count += 1
do_something_with_topic(topic)

That seems more reasonable to me... however I'm still writing more code,
and I still have to remember the exit condition. As a matter of fact, it's
more difficult for me to do what I want and just as easy for me to mess up
and loop forever over billions of topics. There's no way to get multiple
pages with this model without writing my own loop and putting in a
manual exit condition.

In other words, if the problem we're trying to solve is "don't make people
stupidly loop over everything", the right answer is likely:

class Client:
def list_topics(self, ..., limit=1000):
return Iterator(...)
for topic in client.list_topics():
do_something_with(topic)

This means the "loop forever" call would be...

for topic in client.list_topics(limit=None):
do_something_with(topic)

IIRC, we had the same discussion before with listing the objects in a
bucket, and it currently returns an iterator of sorts
https://github.com/GoogleCloudPlatform/gcloud-python/blob/cefff49fb57eea68e15e3a64f1a71c1ca107f44d/gcloud/storage/bucket.py#L245
.

I believe that we had decided to have a reasonable default
https://github.com/GoogleCloudPlatform/gcloud-python/blob/cefff49fb57eea68e15e3a64f1a71c1ca107f44d/gcloud/storage/bucket.py#L84
on how much listing we'd do, however it seems we only implemented that for
the recursive delete and the global make_public

Regardless of all of this, based on our pricing, I could understand the
argument for Datastore however that goes out the window due to Datastore's
LIMIT operator (aka, you're expected to say how many results you want
back). For PubSub, if you created 1m topics, and looped over them, it'd
take hours and cost you a whole 40 cents.

I think it's unreasonable to leak this paging abstraction which makes the
API obnoxious to use simply to avoid someone spending $0.40. If you want
the library to show you exactly where your API calls are made, you should
be using https://github.com/google/google-api-python-client which is 1:1
with the API...


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#895 (comment)

@jgeewax
Copy link
Contributor Author

jgeewax commented Apr 12, 2016

I agree that the limit should be set manually, however if the argument against returning an iterator is "it might loop forever! we can't do that to our users!", an arbitrary limit (not many people will have 1k topics) is one way to do it.

Regardless of that arbitrary limit, list_* methods should return something I can iterate over rather than a batch, token tuple.

@rimey
Copy link
Contributor

rimey commented Apr 12, 2016

Here's my input, for what it's worth.

I agree that it is important to provide a simple interface that doesn't require the user to deal with page tokens. If we don't, people will often choose to fetch just the first page. I know I've done that.

Whether it is better to return a list or an iterator will depend on the API, as will whether it makes sense to offer a limit parameter.

Default limits are rarely going to be appropriate. A default limit on PubSub topics would surely be bad, for instance. Failing to override such a default with limit=None would almost always be an insidious bug.

If listing PubSub topics returns an iterator, I actually don't think it makes sense to offer a limit parameter at all. I can't think of a use-case for it. If the user really did want to retrieve some limited number of arbitrary topics, they could do that by terminating the iteration early.

@daspecster
Copy link
Contributor

For comparison, boto's sns.get_all_topics() returns 100 at a time with a next_token in the results. However this limit isn't documented very well. It appears that questions on stackoverflow came up from people that knew they had over 100 topics. (http://stackoverflow.com/a/32019553)

@jgeewax
Copy link
Contributor Author

jgeewax commented Apr 12, 2016

That's a fair point... The difference is that our answer will be:

for topic in client.list_topics(limit=None) if you really want ALL of them. Otherwise, you'll get a maximum of 1000 (or some sufficiently large default).

@theacodes
Copy link
Contributor

I agree that we should have a simple iterate that iterates over everything, however, I want to make sure we don't remove the ability to manually page. This use case is extremely important for datastore where the pagination can happen across multiple user requests.

@tseaver
Copy link
Contributor

tseaver commented Oct 3, 2016

My $0.02 on the strategy:

  • Rename the current paging methods, from list_foo -> list_foo_paged.
  • Add sibling list_foo methods which just return an iterator already wrapped around list_foo_paged. These methods do not carry params for all the possible iterator properties to list_foo: instead, users who care can configure them before beginning to iterate.

@dhermes
Copy link
Contributor

dhermes commented Oct 25, 2016

FYI I am really close to done

@tseaver
Copy link
Contributor

tseaver commented Nov 15, 2016

0.21 releases include the new iterators.

@tseaver tseaver closed this as completed Nov 15, 2016
@dhermes
Copy link
Contributor

dhermes commented Nov 15, 2016

@tseaver There are still some list_ type methods in BigQuery and Monitoring that need to be changed over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: datastore Issues related to the Datastore API. api: pubsub Issues related to the Pub/Sub API. api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

6 participants