Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with /v1/status/leader API #1560

Closed
sebi-hgdata opened this issue Jan 4, 2016 · 3 comments
Closed

Issue with /v1/status/leader API #1560

sebi-hgdata opened this issue Jan 4, 2016 · 3 comments

Comments

@sebi-hgdata
Copy link

I have a Consul 0.5.2 setup with 3 servers and 1 agent that is used mainly for supporting docker overlay networks. I have the following upstart script for consul:

start on (local-filesystems and net-device-up IFACE!=lo)
stop on runlevel [!12345]
respawn
respawn limit 10 10

setuid ubuntu
setgid ubuntu
script
  /home/ubuntu/hgdata/deployments/consul/consul agent -ui-dir=/home/ubuntu/hgdata/deployments/consul/web_ui -config-dir=/home/ubuntu/hgdata/deployments/consul/config/server -data-dir=/home/ubuntu/hgdata/deployments/consul/data -bootstrap-expect=3 -node=api -client 0.0.0.0
end script
post-start script
   while ! curl  http://localhost:8500/v1/status/leader 2>&1|grep 8300; do echo '[UPSTART] wait till cluster has a leader'; sleep 1; done
end script

and the docker service starts only after consul.

I'm doing some disaster recovery tests in which I reboot all 4 machine at the same time and check that the docker containers are properly restarted... and observed that the post-script section script does not work as expected.. that is it reports wrongly that a leader is elected, before one really is elected, triggering the docker service startup and in consequence the errors that follow of not having a cluster leader and containers not restarting properly.

Here are the logs :

==> WARNING: Expect Mode enabled, expecting 3 servers
==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
==> Starting raft data migration...
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
         Node name: 'api'
        Datacenter: 'dc1'
            Server: true (bootstrap: false)
       Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: -1, DNS: 8600, RPC: 8400)
      Cluster Addr: 10.3.0.226 (LAN: 8301, WAN: 8302)
    Gossip encrypt: true, RPC-TLS: false, TLS-Incoming: false
             Atlas: <disabled>

==> Log data will now stream in as it occurs:

    2016/01/04 09:41:35 [INFO] serf: EventMemberJoin: api 10.3.0.226
    2016/01/04 09:41:35 [INFO] serf: EventMemberJoin: api.dc1 10.3.0.226
    2016/01/04 09:41:35 [WARN] Service name "api_bearfist_v4" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
    2016/01/04 09:41:35 [WARN] Service name "logstash_bearfist" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
    2016/01/04 09:41:35 [INFO] raft: Node at 10.3.0.226:8300 [Follower] entering Follower state
    2016/01/04 09:41:35 [INFO] serf: Attempting re-join to previously known node: httpd: 10.3.0.216:8301
    2016/01/04 09:41:35 [WARN] serf: Failed to re-join any previously known node
    2016/01/04 09:41:35 [INFO] consul: adding server api (Addr: 10.3.0.226:8300) (DC: dc1)
    2016/01/04 09:41:35 [INFO] consul: adding server api.dc1 (Addr: 10.3.0.226:8300) (DC: dc1)
    2016/01/04 09:41:35 [ERR] agent: failed to sync remote state: No cluster leader
[UPSTART] wait till cluster has a leader
    2016/01/04 09:41:36 [WARN] raft: Rejecting vote from 10.3.0.227:8300 since we have a leader: 10.3.0.227:8300
"10.3.0.227:8300"
    2016/01/04 09:41:37 [ERR] http: Request /v1/kv/docker/nodes?recurse=&wait=15000ms, error: No cluster leader
    2016/01/04 09:41:37 [ERR] http: Request /v1/kv/docker/nodes/10.3.0.226:2375, error: No cluster leader
    2016/01/04 09:41:37 [ERR] http: Request /v1/kv/docker/nodes/10.3.0.226:2375, error: No cluster leader
    2016/01/04 09:41:37 [ERR] http: Request /v1/kv/docker/nodes/10.3.0.226:2375, error: No cluster leader
    2016/01/04 09:41:37 [ERR] http: Request /v1/kv/docker/nodes/10.3.0.226:2375, error: No cluster leader
    2016/01/04 09:41:37 [ERR] http: Request /v1/kv/docker/nodes/10.3.0.226:2375, error: No cluster leader
.........
    2016/01/04 09:41:37 [ERR] http: Request /v1/kv/docker/network/v1.0/network/0cf6c24801e0d432959cc71588d6b1c9485119a566cb6b0a985be917bd03a317/?consistent=, error: No cluster leader
==> Newer Consul version available: 0.6.0
    2016/01/04 09:41:41 [WARN] agent: Check 'service:api_bearfist_v4' is now critical
    2016/01/04 09:41:41 [INFO] serf: EventMemberJoin: ldap 10.3.0.227
    2016/01/04 09:41:41 [INFO] consul: adding server ldap (Addr: 10.3.0.227:8300) (DC: dc1)
    2016/01/04 09:41:41 [INFO] consul: New leader elected: ldap
    2016/01/04 09:41:42 [INFO] agent: Synced service 'consul'
    2016/01/04 09:41:42 [INFO] agent: Synced service 'api_bearfist_v4'
    2016/01/04 09:41:42 [INFO] agent: Synced service 'logstash_bearfist'
    2016/01/04 09:41:44 [INFO] serf: EventMemberJoin: httpd 10.3.0.216
    2016/01/04 09:41:44 [INFO] consul: adding server httpd (Addr: 10.3.0.216:8300) (DC: dc1)
    2016/01/04 09:41:45 [WARN] raft: Rejecting vote from 10.3.0.216:8300 since we have a leader: 10.3.0.227:8300
    2016/01/04 09:41:45 [INFO] serf: Attempting re-join to previously known node: focus: 10.3.0.217:8301
    2016/01/04 09:41:45 [INFO] serf: EventMemberJoin: focus 10.3.0.217
    2016/01/04 09:41:45 [INFO] serf: Re-joined to previously known node: focus: 10.3.0.217:8301

See the log "[UPSTART] wait till cluster has a leader" and the output of "10.3.0.227:8300" before docker queries the KV store.

@slackpad
Copy link
Contributor

slackpad commented Jan 5, 2016

Hi @sebi-hgdata - I'll have to take a deeper look on the Raft side, but I think you might be seeing some startup behavior that's allowed by Raft but noisy for your gating check. You might want to try polling the https://www.consul.io/docs/agent/http/status.html#status_peers endpoint and looking for that to have 3 entries (you could pipe through jq or similar). That should give you a good view of everything once elections have settled down and all the servers are joined.

@sebi-hgdata
Copy link
Author

@slackpad Thanks for the quick response.
I changed the post-script script to do a request for an inexistent key and check that it returns an empty string (returns 'No cluster servers' and/or 'No cluster leader' ) and it seems to do the job for me... Anyway... I assume that the KV and the leader API's should have returned the same response... but they don't.. might be some inconsistent leader checks?

@slackpad
Copy link
Contributor

slackpad commented Jan 5, 2016

@sebi-hgdata I think there are times where multiple nodes think they are the leader, but only one will be able to perform writes, so looking at the leader endpoint for this application isn't a super reliable method to wait for Consul to get into a good state. Please take a look at the conversation on #1562 which has a way to check using the peers list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants