Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add a new backend host to a fnv1a_ch cluster with a replication factor of 1? #283

Closed
mwtzzz-zz opened this issue Jun 25, 2017 · 35 comments

Comments

@mwtzzz-zz
Copy link

mwtzzz-zz commented Jun 25, 2017

Hi, my understanding (admittedly without spending much time thinking about it), was that I could simply add an additional host to the fnv1a_ch cluster definition (replication factor 1) and (a) existing metrics would still be routed to the same hosts as before; (b) while brand-new metrics would start to make their way to the new host. I assumed that the hashing mechanism would contine to route existing metrics as before.

But this isn't the case. I added a new host, and it immediately starting receiving an even spread of metrics that were previously going to the other hosts. Of course this had the result of our frontend giving unpredictable and incorrect graphs.

My question is: what is the proper way to add a new backend host?

I'm using instance ids. my cluster definition looks like this:

cluster radar122
  fnv1a_ch
    radar122-a:1905=a
    radar122-b:1905=b
    radar122-c:1905=c
    radar122-d:1905=d
    ....
    ;
@mwtzzz-zz
Copy link
Author

looking at the man page, the following jumps out at me:

When using the fnv1a_ch cluster, this instance overrides the hash key in use.

Is this the reason for the behavior I observed?

@deniszh
Copy link

deniszh commented Jun 25, 2017

Hello @mwtzzz,
Unfortunately, your initial assumption is wrong. If you use any consistent hashing (carbon_ch, fnv1a_ch or jump_fnv1a_ch - doesn't matter), after adding a host to the cluster of N nodes and K metrics routing of K/N metrics will be changed (and that's good, because for normal - "non-consistent" - hashing adding a host causing changing in all K metric's routing, see wiki)

So, you should rebalance your cluster after adding a node. You can use carbonate (but it supports only carbon hashing) or buckytools (supports carbon_ch, fnv1a_ch or jump_fnv1a_ch)
Please also note that modern Graphite (0.9.15/16 or 1.0.x) can "merge" metrics on the fly (set REMOTE_STORE_MERGE_RESULTS=True), so, you should get consistent graphs even if you didn't rebalance your cluster (but only for carbon_ch and fnv1a_ch hashes, not jump one).

@mwtzzz-zz
Copy link
Author

@deniszh Thanks for your quick response and for providing these two suggestions. I think I'm going to look at the Merge option. If I go with Merge, should I still rebalance?

@deniszh
Copy link

deniszh commented Jun 25, 2017

It depends. I think too much merging will make rendering slow after some point. Also, note that whisper size is fixed (if you not using sparse files), so, adding new host will cause the creation of K/N new whisper files, which will consume disk space.

@mwtzzz-zz
Copy link
Author

mwtzzz-zz commented Jun 26, 2017

I'm testing buckytools, but it says it doesn't support fnv1a_ch:

setuidgid uuu ./buckyd -node ec2-xxx.compute-1.amazonaws.com -hash fnv1a_ch
2017/06/25 20:16:19 Invalide hash type.  Supported types: [carbon jump_fnv1a]

Is there another way to rebalance the cluster?

@grobian
Copy link
Owner

grobian commented Jun 26, 2017

@jjneely: fnv1a_ch indeed seems unimplemented, would you accept a patch adding it? Looks as if the change would mostly be adding code to hashing.go, I could try adding it.

@mwtzzz-zz
Copy link
Author

@grobian Thanks for working on it. If you make a patch, I'll test it on my cluster of 12 hosts, each of which has about 600GB of metric data.

@mwtzzz-zz
Copy link
Author

Hi @grobian any success with a patch?

@grobian
Copy link
Owner

grobian commented Jun 29, 2017

haven't got the spare cycles to look into it yet, sorry

@deniszh
Copy link

deniszh commented Jun 29, 2017

@mwtzzz - you can migrate to jump_fnv1a_hash - you will need to move more data, ofc, but only once

@mwtzzz-zz
Copy link
Author

@deniszh What are the steps to migrate from fnv1a_ch to jump_fnv1a_ch ?

@mwtzzz-zz
Copy link
Author

@grobian Thanks for working on it. Do you have a patch I can test out?

@deniszh
Copy link

deniszh commented Jul 8, 2017

@mwtzzz : sorry, disregard my advise - graphite-web doesn't support jump_fnv1a_ch, so, you'll need or migrate to carbon_ch or use something like go-carbon + carbonzipper

@grobian
Copy link
Owner

grobian commented Jul 10, 2017

I can't seem to build bucktools (my go is too new or something?) so no patch. Seems like it's not necessary either if you don't use carbonzipper.

@mwtzzz-zz
Copy link
Author

Thanks for working on it, I'll think about what my next steps will be.

@grobian
Copy link
Owner

grobian commented Jul 11, 2017

Adding it to buckytools is not that trivial, because it port is currently ignored, and the fnv1a_ch hash type needs it.

@deniszh
Copy link

deniszh commented Jul 11, 2017

Yep, I tried to add it to buckytools too, but lost.
I added support to latest carbonate, but it will need support from latest carbon too.
So, maybe carbonzipper will be best option for you.

@grobian
Copy link
Owner

grobian commented Jul 12, 2017

@mwtzzz-zz
Copy link
Author

Ok, I'll test it out soon.

@mwtzzz-zz
Copy link
Author

I installed your patch and am running buckyd on each of our 12 hosts as follows:
./buckyd -node radar122-X.mgmt -p /media/ephemeral0/carbon/storage/whisper/ -hash fnv1a radar122-{a..l}.mgmt
But it doesn't seem to be working:

[root@ec2-xxx radar122 bin]$ ./bucky inconsistent
2017/07/16 06:23:28 Results from radar122-a.mgmt:4242 not available. Sleeping.
2017/07/16 06:23:28 Results from radar122-i.mgmt:4242 not available. Sleeping.
...
[root@ec2-xxx radar122 bin]$  ./bucky list -r '^carbon\.'                                    
2017/07/16 06:24:58 Results from radar122-i.mgmt:4242 not available. Sleeping.
2017/07/16 06:24:58 Results from radar122-g.mgmt:4242 not available. Sleeping.
...

Am I running it correctly?

@deniszh
Copy link

deniszh commented Jul 16, 2017

I think radar122-{a..l}.mgmt will not work. You need to enter all hosts, space separated.
Like radar122-a.mgmt radar122-b.mgmt radar122-c.mgmt radar122-d.mgmt radar122-e.mgmt radar122-f.mgmt radar122-g.mgmt radar122-h.mgmt radar122-i.mgmt radar122-j.mgmt radar122-k.mgmt radar122-l.mgmt

@deniszh
Copy link

deniszh commented Jul 16, 2017

Also please note that if you're using non-2003 port and/or instance names - they also need to be included, like radar122-a.mgmt:2103:a radar122-b.mgmt:2013:a radar122-c.mgmt:2103:a ...
But it depends hoiw it's configured in relay.conf ofc.

@mwtzzz-zz
Copy link
Author

my relay config looks like this:

cluster radar122
  fnv1a_ch 
    radar122-a.mgmt:1905=a 
    radar122-b.mgmt:1905=b 
    radar122-c.mgmt:1905=c 
  ...

I took your suggestion and tried running buckyd like this:
/tmp/buckyd -node radar122-b.mgmt -p /media/ephemeral0/carbon/storage/whisper/ -hash fnv1a radar122-a.mgmt:1905:a radar122-b.mgmt:1905:b radar122-c.mgmt:1905:c
Port 4242 is reachable from all the hosts, but I still see the following messages:

[root@ec2- radar122 bin]$ ./bucky list -h radar122-b.mgmt:4242                    
2017/07/16 20:19:53 Results from radar122-c.mgmt:4242 not available. Sleeping.
2017/07/16 20:19:53 Results from radar122-a.mgmt:4242 not available. Sleeping.

@deniszh
Copy link

deniszh commented Jul 17, 2017

What buckyd logs from stdout / stderr says?

@mwtzzz-zz
Copy link
Author

Here are the buckyd stdout/stderr logs:

2017/07/18 20:02:17 Starting server on 0.0.0.0:4242
2017/07/18 20:04:24 172.17.35.131:58099 - - GET /hashring
2017/07/18 20:04:24 172.17.35.131:58099 - - GET /metrics?force=true
2017/07/18 20:04:24 Scaning /media/ephemeral0/carbon/storage/whisper/ for metrics...
2017/07/18 20:04:24 172.17.35.131:58129 - - GET /metrics?force=true
2017/07/18 20:04:25 172.17.35.131:58149 - - GET /metrics?force=true
2017/07/18 20:04:26 172.17.35.131:58191 - - GET /metrics?force=true
2017/07/18 20:04:38 172.17.35.131:58203 - - GET /hashring
2017/07/18 20:04:39 172.17.35.131:58203 - - GET /metrics?force=true
2017/07/18 20:04:39 172.17.35.131:58237 - - GET /metrics?force=true
2017/07/18 20:04:40 172.17.35.131:58259 - - GET /metrics?force=true
2017/07/18 20:04:41 172.17.35.131:58277 - - GET /metrics?force=true
2017/07/18 20:04:43 172.17.35.131:58355 - - GET /metrics?force=true
2017/07/18 20:04:49 172.17.35.131:58403 - - GET /hashring
2017/07/18 20:04:49 172.17.35.131:58403 - - GET /metrics?force=true
2017/07/18 20:04:49 172.17.35.131:58439 - - GET /metrics?force=true
2017/07/18 20:04:50 172.17.35.131:58463 - - GET /metrics?force=true
2017/07/18 20:04:51 172.17.35.131:58493 - - GET /metrics?force=true
2017/07/18 20:04:53 172.17.35.131:58519 - - GET /metrics?force=true
2017/07/18 20:04:56 172.17.35.131:58539 - - GET /metrics?force=true
2017/07/18 20:05:01 172.17.35.131:58571 - - GET /metrics?force=true
2017/07/18 20:05:09 172.17.35.131:58625 - - GET /metrics?force=true

@grobian
Copy link
Owner

grobian commented Aug 2, 2017

let's move this to the buckytools issue.

@grobian grobian closed this as completed Aug 2, 2017
@mwtzzz-zz
Copy link
Author

If we can get buckytools working with fnv1a_ch, that would be fantastic.

@grobian
Copy link
Owner

grobian commented Aug 2, 2017

jjneely/buckytools#17

@mwtzzz-zz
Copy link
Author

mwtzzz-zz commented Aug 5, 2017

@grobian I'll try your patch again this weekend. Maybe I'm not running buckyd correctly. I'll experiment with different ways of specifying the members of the ring on the command line.

@grobian
Copy link
Owner

grobian commented Aug 6, 2017

I'm no expert on buckytools, if I find some cycles, I'll try myself

@mwtzzz-zz
Copy link
Author

That would be great

@mwtzzz-zz
Copy link
Author

Hi @grobian I'm just getting back at looking at this issue. I got pulled away on other things at work but now I need to take a look again.

Have you had a chance to try making a patch?

@grobian
Copy link
Owner

grobian commented Oct 3, 2017

I thought we concluded in jjneely/buckytools#17 :)

@mwtzzz-zz
Copy link
Author

Oh wow, I missed that! Excellent, let me try it out today. Thanks!

@mwtzzz-zz
Copy link
Author

@grobian I'm having issues with version 0.40. Would you mind taking a look at my comment in jjneely/buckytools#17 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants