Couldn't take snapshot error with replicas #2266

sameervitian · 2018-03-27T05:33:19Z

If you suspect this could be a bug, follow the template.

What version of Dgraph are you using?
Dgraph version : v1.0.4-dev
Commit SHA-1 : 807976c
Commit timestamp : 2018-03-22 14:55:24 +1100
Branch : HEAD
Have you tried reproducing the issue with latest release?
Yes
What is the hardware spec (RAM, OS)?
ubuntu 14.04 / 16 core 32GB
Steps to reproduce the issue (command/config used to run Dgraph).
config for dgraph

export: export

gentlecommit: 0.33

idx: 1

memory_mb: 16087.0

trace: 0.33

postings: /data/dgraph/p

wal: /data/dgraph/w

debugmode: False

bindall: True

my: "<server_ip>:7080"

zero: "<zero_ip>:5080"

3 dgraph servers running in cluster with replica 3. I see from /state that all nodes are in groupId 1. Following starts appearing regularly in logs, seems something is wrong.

2018/03/27 13:04:23 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [644079]
2018/03/27 13:04:53 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [644477]
2018/03/27 13:05:23 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [644859]
2018/03/27 13:05:53 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [645205]
2018/03/27 13:06:23 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [645612]
2018/03/27 13:06:53 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [646052]
2018/03/27 13:07:23 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [646430]
2018/03/27 13:07:53 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [646890]
2018/03/27 13:08:23 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [647264]

Along with that I see that the load average is in range 6-13 in all servers. I am running this in production. The cpu utilization is very less and I am using SSD for data.
following is the cpu metrics from top-

 (%Cpu(s):  4.2 us,  0.7 sy,  0.0 ni, 94.8 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st )

this is what I see in vmstat -

r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 219784 201912 12934544    0    0     2    37    5    7  2  0 98  0  0
 2  0      0 219608 201912 12934552    0    0     0   592 5106 8252 12  1 87  0  0
 0  0      0 219448 201912 12934560    0    0     0   704 4640 7944  2  0 97  0  0
 0  1      0 219672 201912 12934568    0    0     0   300 2609 4504  1  0 99  0  0
 0  0      0 219736 201912 12934568    0    0     0   200 2193 3796  0  0 99  0  0
 0  0      0 219576 201912 12934588    0    0     4   688 4512 7827  1  1 97  0  0
 0  0      0 219576 201912 12934588    0    0     0   208 3463 5337  8  1 91  0  0
 0  0      0 219640 201912 12934604    0    0     8   480 3902 6796  1  0 98  0  0
 1  0      0 219320 201912 12934612    0    0     4   660 4703 8094  2  1 96  1  0
129  0      0 219704 201920 12934612    0    0     0   108 3388 2999 91  0  8  0  0

iostat result -

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.24    0.00    0.62    0.37    0.00   95.77

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda               0.00         0.00         0.00          0          0
vdb             100.00         0.00       416.00          0        416
dm-0            103.00         0.00       416.00          0        416

Expected behaviour and actual result.

As cpu idle time is high and wait time is less, I expect the load average to be less. Also is the logs appearing frequently alarming?
Could someone check what is wrong here.

The text was updated successfully, but these errors were encountered:

pawanrawal · 2018-03-27T05:42:19Z

Hey @sameervitian

The below error is an interesting one and linked to #2254. I am investigating this. What about the other replicas, do they also have similar logs?

2018/03/27 13:04:23 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [644079]
2018/03/27 13:04:53 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [644477]
2018/03/27 13:05:23 draft.go:682: Couldn't take snapshot, txn watermark: [4604], applied watermark: [644859]

Is the high load average causing any performance degradation?

sameervitian · 2018-03-27T10:23:30Z

@pawanrawal yes, others also have similar logs.

I dont have a reference point to measure if there is degradation in performance. I could see that our read calls are taking average of ~200ms.( /p folder is of 15GB). The machine used is 32GB 16core so definitely would want to degrade the machine to cut down cost.

pawanrawal · 2018-03-28T03:11:25Z

I was able to reproduce the Couldn't take snapshot issue with a cluster which has replicas and am working on a fix.

sameervitian · 2018-03-28T05:30:10Z

@pawanrawal is Couldn't take snapshot issue also the reason behind high load avg ?

pawanrawal · 2018-03-28T05:58:27Z

I am not sure, how do you get this load average value? The Couldn't take snapshot issue could have caused OOM issues. It has been fixed in master with f66c7df and the nighly binary should be updated soon. Also since you have got a 16 core machine, load average less than that should be fine.

pawanrawal · 2018-03-29T00:35:40Z

Closing this as the fix for the main issue which was a bug has been merged.

sameervitian changed the title ~~Dgraph high load~~ Dgraph high load avg Mar 27, 2018

pawanrawal changed the title ~~Dgraph high load avg~~ Couldn't take snapshot error with replicas Mar 28, 2018

pawanrawal added the kind/bug Something is broken. label Mar 28, 2018

pawanrawal self-assigned this Mar 28, 2018

pawanrawal mentioned this issue Mar 28, 2018

Dont mark entry as done in Oracle map if we didn't propose it yet. #2274

Closed

dgraph-bot added the kind/bug Something is broken. label Mar 28, 2018

manishrjain added this to the Sprint-000 milestone Mar 28, 2018

manishrjain added the backlog label Mar 29, 2018

pawanrawal closed this as completed Mar 29, 2018

manishrjain removed the backlog label Mar 29, 2018

This was referenced Apr 3, 2018

DGraph server killed due to OOM #2301

Closed

Hanging server issue via gRPC #2231

Closed

Dgraph unresponsive on read & mutation both #2311

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Couldn't take snapshot error with replicas #2266

Couldn't take snapshot error with replicas #2266

sameervitian commented Mar 27, 2018 •

edited by dgraph-bot

Loading

pawanrawal commented Mar 27, 2018

sameervitian commented Mar 27, 2018

pawanrawal commented Mar 28, 2018

sameervitian commented Mar 28, 2018 •

edited

Loading

pawanrawal commented Mar 28, 2018

pawanrawal commented Mar 29, 2018

Couldn't take snapshot error with replicas #2266

Couldn't take snapshot error with replicas #2266

Comments

sameervitian commented Mar 27, 2018 • edited by dgraph-bot Loading

pawanrawal commented Mar 27, 2018

sameervitian commented Mar 27, 2018

pawanrawal commented Mar 28, 2018

sameervitian commented Mar 28, 2018 • edited Loading

pawanrawal commented Mar 28, 2018

pawanrawal commented Mar 29, 2018

sameervitian commented Mar 27, 2018 •

edited by dgraph-bot

Loading

sameervitian commented Mar 28, 2018 •

edited

Loading