Issue with /v1/status/leader API #1560

sebi-hgdata · 2016-01-04T11:22:38Z

I have a Consul 0.5.2 setup with 3 servers and 1 agent that is used mainly for supporting docker overlay networks. I have the following upstart script for consul:

start on (local-filesystems and net-device-up IFACE!=lo)
stop on runlevel [!12345]
respawn
respawn limit 10 10

setuid ubuntu
setgid ubuntu
script
  /home/ubuntu/hgdata/deployments/consul/consul agent -ui-dir=/home/ubuntu/hgdata/deployments/consul/web_ui -config-dir=/home/ubuntu/hgdata/deployments/consul/config/server -data-dir=/home/ubuntu/hgdata/deployments/consul/data -bootstrap-expect=3 -node=api -client 0.0.0.0
end script
post-start script
   while ! curl  http://localhost:8500/v1/status/leader 2>&1|grep 8300; do echo '[UPSTART] wait till cluster has a leader'; sleep 1; done
end script

and the docker service starts only after consul.

I'm doing some disaster recovery tests in which I reboot all 4 machine at the same time and check that the docker containers are properly restarted... and observed that the post-script section script does not work as expected.. that is it reports wrongly that a leader is elected, before one really is elected, triggering the docker service startup and in consequence the errors that follow of not having a cluster leader and containers not restarting properly.

Here are the logs :

==> WARNING: Expect Mode enabled, expecting 3 servers
==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
==> Starting raft data migration...
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
         Node name: 'api'
        Datacenter: 'dc1'
            Server: true (bootstrap: false)
       Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: -1, DNS: 8600, RPC: 8400)
      Cluster Addr: 10.3.0.226 (LAN: 8301, WAN: 8302)
    Gossip encrypt: true, RPC-TLS: false, TLS-Incoming: false
             Atlas: <disabled>

==> Log data will now stream in as it occurs:

    2016/01/04 09:41:35 [INFO] serf: EventMemberJoin: api 10.3.0.226
    2016/01/04 09:41:35 [INFO] serf: EventMemberJoin: api.dc1 10.3.0.226
    2016/01/04 09:41:35 [WARN] Service name "api_bearfist_v4" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
    2016/01/04 09:41:35 [WARN] Service name "logstash_bearfist" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
    2016/01/04 09:41:35 [INFO] raft: Node at 10.3.0.226:8300 [Follower] entering Follower state
    2016/01/04 09:41:35 [INFO] serf: Attempting re-join to previously known node: httpd: 10.3.0.216:8301
    2016/01/04 09:41:35 [WARN] serf: Failed to re-join any previously known node
    2016/01/04 09:41:35 [INFO] consul: adding server api (Addr: 10.3.0.226:8300) (DC: dc1)
    2016/01/04 09:41:35 [INFO] consul: adding server api.dc1 (Addr: 10.3.0.226:8300) (DC: dc1)
    2016/01/04 09:41:35 [ERR] agent: failed to sync remote state: No cluster leader
[UPSTART] wait till cluster has a leader
    2016/01/04 09:41:36 [WARN] raft: Rejecting vote from 10.3.0.227:8300 since we have a leader: 10.3.0.227:8300
"10.3.0.227:8300"
    2016/01/04 09:41:37 [ERR] http: Request /v1/kv/docker/nodes?recurse=&wait=15000ms, error: No cluster leader
    2016/01/04 09:41:37 [ERR] http: Request /v1/kv/docker/nodes/10.3.0.226:2375, error: No cluster leader
    2016/01/04 09:41:37 [ERR] http: Request /v1/kv/docker/nodes/10.3.0.226:2375, error: No cluster leader
    2016/01/04 09:41:37 [ERR] http: Request /v1/kv/docker/nodes/10.3.0.226:2375, error: No cluster leader
    2016/01/04 09:41:37 [ERR] http: Request /v1/kv/docker/nodes/10.3.0.226:2375, error: No cluster leader
    2016/01/04 09:41:37 [ERR] http: Request /v1/kv/docker/nodes/10.3.0.226:2375, error: No cluster leader
.........
    2016/01/04 09:41:37 [ERR] http: Request /v1/kv/docker/network/v1.0/network/0cf6c24801e0d432959cc71588d6b1c9485119a566cb6b0a985be917bd03a317/?consistent=, error: No cluster leader
==> Newer Consul version available: 0.6.0
    2016/01/04 09:41:41 [WARN] agent: Check 'service:api_bearfist_v4' is now critical
    2016/01/04 09:41:41 [INFO] serf: EventMemberJoin: ldap 10.3.0.227
    2016/01/04 09:41:41 [INFO] consul: adding server ldap (Addr: 10.3.0.227:8300) (DC: dc1)
    2016/01/04 09:41:41 [INFO] consul: New leader elected: ldap
    2016/01/04 09:41:42 [INFO] agent: Synced service 'consul'
    2016/01/04 09:41:42 [INFO] agent: Synced service 'api_bearfist_v4'
    2016/01/04 09:41:42 [INFO] agent: Synced service 'logstash_bearfist'
    2016/01/04 09:41:44 [INFO] serf: EventMemberJoin: httpd 10.3.0.216
    2016/01/04 09:41:44 [INFO] consul: adding server httpd (Addr: 10.3.0.216:8300) (DC: dc1)
    2016/01/04 09:41:45 [WARN] raft: Rejecting vote from 10.3.0.216:8300 since we have a leader: 10.3.0.227:8300
    2016/01/04 09:41:45 [INFO] serf: Attempting re-join to previously known node: focus: 10.3.0.217:8301
    2016/01/04 09:41:45 [INFO] serf: EventMemberJoin: focus 10.3.0.217
    2016/01/04 09:41:45 [INFO] serf: Re-joined to previously known node: focus: 10.3.0.217:8301

See the log "[UPSTART] wait till cluster has a leader" and the output of "10.3.0.227:8300" before docker queries the KV store.

The text was updated successfully, but these errors were encountered:

slackpad · 2016-01-05T03:46:07Z

Hi @sebi-hgdata - I'll have to take a deeper look on the Raft side, but I think you might be seeing some startup behavior that's allowed by Raft but noisy for your gating check. You might want to try polling the https://www.consul.io/docs/agent/http/status.html#status_peers endpoint and looking for that to have 3 entries (you could pipe through jq or similar). That should give you a good view of everything once elections have settled down and all the servers are joined.

sebi-hgdata · 2016-01-05T09:12:05Z

@slackpad Thanks for the quick response.
I changed the post-script script to do a request for an inexistent key and check that it returns an empty string (returns 'No cluster servers' and/or 'No cluster leader' ) and it seems to do the job for me... Anyway... I assume that the KV and the leader API's should have returned the same response... but they don't.. might be some inconsistent leader checks?

slackpad · 2016-01-05T18:21:27Z

@sebi-hgdata I think there are times where multiple nodes think they are the leader, but only one will be able to perform writes, so looking at the leader endpoint for this application isn't a super reliable method to wait for Consul to get into a good state. Please take a look at the conversation on #1562 which has a way to check using the peers list.

slackpad mentioned this issue Jan 5, 2016

consul members and v1/status/peers inconsistent #1562

Closed

slackpad closed this as completed Apr 12, 2017

ekmixon mentioned this issue Aug 20, 2023

[Snyk] Fix for 1 vulnerabilities ekmixon/consul#517

Open

ekmixon mentioned this issue Mar 8, 2024

[Snyk] Fix for 2 vulnerabilities ekmixon/consul#541

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with /v1/status/leader API #1560

Issue with /v1/status/leader API #1560

sebi-hgdata commented Jan 4, 2016

slackpad commented Jan 5, 2016

sebi-hgdata commented Jan 5, 2016

slackpad commented Jan 5, 2016

Issue with /v1/status/leader API #1560

Issue with /v1/status/leader API #1560

Comments

sebi-hgdata commented Jan 4, 2016

slackpad commented Jan 5, 2016

sebi-hgdata commented Jan 5, 2016

slackpad commented Jan 5, 2016