Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiencing Extreme (10s+) Latencies with Pub/Sub #1974

Closed
MarkHerhold opened this issue Feb 8, 2017 · 27 comments
Closed

Experiencing Extreme (10s+) Latencies with Pub/Sub #1974

MarkHerhold opened this issue Feb 8, 2017 · 27 comments
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p2 Moderately-important priority. Fix may not be included in next release.

Comments

@MarkHerhold
Copy link

I described my situation in this GCP Pub/Sub forum post:
https://groups.google.com/forum/#!topic/cloud-pubsub-discuss/xhDvw63IVI8

I'm using (or trying to) GCP Pub/Sub by publishes a message on an existing subscription and topic every 5 minute and then acknowledging it on the same topic. I get monitoring alerts that the process is failing after waiting for 10 seconds and timing out a few times a day. My monitoring reports that there have been 109 timeouts over the past 24 hours. I think 10 seconds is a very high threshold for waiting for a message and would love to know what is causing this.

Kamal Aboul-Hosn said to create an issue on this repo and commented that it may be a perf issue. Here's what he said:

It looks like the Node.js Pub/Sub library has a performance issue. The way it looks to work is that it creates a pull request, waits for a response for that request, and then 10 milliseconds later creates a new pull request. This is not going to be the best way to minimize message end-to-end latency. To minimize that, one needs to have multiple outstanding pull requests simultaneously. As soon as a response comes, another request needs to be created. I recommend entering an issue in the google-cloud-node Github repository.

Even with that change, I expect that you will still occasionally see 10-second latencies for some messages.

Environment details

  • OS: Debian 8
  • Node.js version: 6.9.5
  • npm version: 4.x.x
  • google-cloud-node version: 0.8.0

Steps to reproduce

  1. require @google-cloud/pubsub
  2. create topic
  3. create subscription
  4. send a message on topic created above (this happens every 5 minutes)
  5. immediately try to receive the same message on the same host
  6. randomly experience 10s+ timeouts

Image from monitoring tool showing frequency of timeouts over the past 24 hours:
image

@stephenplusplus stephenplusplus added perf api: pubsub Issues related to the Pub/Sub API. labels Feb 8, 2017
@stephenplusplus
Copy link
Contributor

one needs to have multiple outstanding pull requests simultaneously.

@tmatsuo can you confirm we should be firing multiple pull requests when a user registers a message event (subscription.on('message', function(msg) {}))? And if so, is there any additional logic to consider, e.g. the amount of simultaneous requests?

As soon as a response comes, another request needs to be created.

This is the current behavior, with the exception of the 10 millisecond timeout. This can be changed:

var subscription = topic.subscription('sub-name', { timeout: 0 })

@kamalaboulhosn
Copy link

can you confirm we should be firing multiple pull requests when a user registers a message event (subscription.on('message', function(msg) {}))? And if so, is there any additional logic to consider, e.g. the amount of simultaneous requests?

Developer on the Cloud Pub/Sub team here. I can confirm that multiple outstanding pull requests is the correct behavior. The number of concurrent requests would have to grow fairly large before it would be an issue on the server side. 5-10 simultaneous pull requests would be reasonable.

@MarkHerhold
Copy link
Author

This is still a significant problem for us (beyond performance). Is there an ETA on a fix? I'm not sure how GCP bugs are prioritized and do realize the API is Alpha quality.

@stephenplusplus stephenplusplus self-assigned this Feb 17, 2017
@stephenplusplus
Copy link
Contributor

On it!

@stephenplusplus
Copy link
Contributor

@MarkHerhold could you give this a try:

$ npm install --save @google-cloud/[email protected]

@MarkHerhold
Copy link
Author

The good news is that Pub/Sub is way more consistent now. The bad news is that nearly everything is timing out. I'll poke around the code to see if we are doing something wrong.

image

@MarkHerhold
Copy link
Author

@stephenplusplus Yeah, it breaks after a few sequential publish/subscribes. This isn't the most elegant test script, but it reproduces the problem for me, usually sometime after the 10th iteration. 🤔

const pubsub = require('@google-cloud/pubsub');
const client = pubsub(/* secret here */);

async function go() {

    const topic = (await client.createTopic('sometopic' + Math.floor(Math.random() * 2000)))[0];
    const subscription = (await topic.subscribe({
            ackDeadlineSeconds: 10,
            autoAck: false
        }))[0];

    for (let i = 0; i < 20; i++) {
        const start = Date.now();
        await new Promise(function(resolve, reject) {
            function onMessage(message) {
                if (message.data === 'test') {
                    subscription.removeListener('message', onMessage);
                    message.ack().then(resolve).catch(reject);
                }
            }

            subscription.on('message', onMessage);

            topic.publish('test');

            setTimeout(function() {
                subscription.removeListener('message', onMessage);
                reject('Pub/Sub timed out after ' + 10000 + ' ms');
            }, 10000);
        }).then(function() {
            console.log(`Pub/Sub resolved in ${Date.now() - start} ms`);
        }).catch(function(err) {
            console.error(err);
        });
    }
}

try {
    go();
} catch (err) {
    console.error(err);
}

@askmon
Copy link

askmon commented Feb 20, 2017

@MarkHerhold I think that maybe your issue is similar to mine. The message counts as pulled, but can't be acked for some reason, then you have to wait until the ackDeadlineSecond time passes in order to try to receive it again...

@danoscarmike danoscarmike added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. Type: Enhancement labels Feb 27, 2017
@MarkHerhold
Copy link
Author

Can I have a rough ballpark estimate of when this can be looked at/fixed? An estimate would really help me plan my own features. Thank you!

@stephenplusplus
Copy link
Contributor

Thanks for your patience on this. The Pub/Sub team reviewed our API and had a lot of feedback for us, which could see us re-designing big parts of how our API works. These changes will probably roll out over the next few weeks (no guarantee), but would likely address this issue at the same time.

Given the big changes ahead, I'm hesitant to perfect a solution (i.e., what #2006 was meant to evolve into), when the feedback from the API team could squash it. If this isn't explicitly covered from their feedback, we will make sure it's addressed after the bigger changes we make to the API.

@MarkHerhold
Copy link
Author

@stephenplusplus Ok, thanks for the insight!

@Splaktar
Copy link
Contributor

OK, that's really good to know. We've been having a lot of issues with PubSub, especially when writing tests for it. I'm looking forward to major improvements to the NodeJS lib for it :)

@stephenplusplus
Copy link
Contributor

Thanks to both of you for your help along the way!

@caseyduquettesc
Copy link

caseyduquettesc commented Apr 4, 2017

@stephenplusplus Hey, was there a change last night? My latency for first request went from 70s+ to 2s+

Edit:
Full disclosure, last night I also blew away my node_modules folder and reinstalled all my dependencies.

@lukesneeringer
Copy link
Contributor

No change last night. I would love to know what your new node modules are.
In particular, did your grpc dependencies update?

@AVVS
Copy link
Contributor

AVVS commented Apr 8, 2017

Hey, any ETA on updates to pubsub here?

@lukesneeringer
Copy link
Contributor

I am pulling together a design proposal, and will be assigning it sometime in the next couple weeks.

@MarkHerhold
Copy link
Author

@lukesneeringer @stephenplusplus
Any progress or new estimates?

@stephenplusplus
Copy link
Contributor

A redesign is currently in progress. Expect a PR within the coming week or so.

@stephenplusplus
Copy link
Contributor

I'm going to close as the new PR is about to be sent. The Pub/Sub team was behind the entire redesign, so it will be using the best practices, following their recommendations.

@lukesneeringer lukesneeringer removed the priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. label Jun 12, 2017
@stephenplusplus
Copy link
Contributor

The PR has been sent: #2380. Thanks for your patience, everyone.

@Splaktar
Copy link
Contributor

Awesome that the PR is here! But it seems premature to close this as the PR still has a number of TODOs and a 'do not merge' tag. Thanks for your hard work on this issue and it's great news that the Pub/Sub team was involved!

@MarkHerhold
Copy link
Author

I agree with @Splaktar . It is unfair to customers when issues get closed before they are actually resolved. Please consider keeping this open until the PR is merged for tracking and visibility purposes.

@stephenplusplus
Copy link
Contributor

Thanks for the feedback! I will re-open, with a quick explanation. Our APIs are shifting towards using a generated layer, meaning it would be the fault of that, or a child layer if this problem persisted. Keeping this open makes a lot of sense, and closing it wouldn't guarantee a resolution. Thanks for keeping us doing the right things-- always feel free to speak your mind if we head down the wrong road again.

@evaogbe evaogbe added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Aug 7, 2017
@bgabler
Copy link

bgabler commented Aug 16, 2017

@MarkHerhold we were seeing the same exact behavior in our pub/sub POC. I can confirm that so far #2380 has resolved the random 10s+ delay.

@stephenplusplus
Copy link
Contributor

The new release is out! Please upgrade to @google-cloud/[email protected] if you haven't already. If there are any issues remaining, please let us know.

@dskdas
Copy link

dskdas commented Apr 30, 2018

@stephenplusplus If i was using subscription.pull({ returnImmediately: true }) earlier, is there some guidance on how i should do it instead with the api re-design. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p2 Moderately-important priority. Fix may not be included in next release.
Projects
None yet
Development

No branches or pull requests