Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CRASH] The replicas are crashing after the master is restarted with replicas and multi-master enabled #849

Open
ramonhpr opened this issue Jul 4, 2024 · 0 comments

Comments

@ramonhpr
Copy link

ramonhpr commented Jul 4, 2024

Crash report

Paste the complete crash log between the quotes below. Please include a few lines from the log preceding the crash report to provide some context.

1:12:S 04 Jul 2024 12:07:57.915 # Connection with master lost.


=== KEYDB BUG REPORT START: Cut & paste starting from here ===
1:12:S 04 Jul 2024 12:07:57.915 # === ASSERTION FAILED ===
1:12:S 04 Jul 2024 12:07:57.915 # ==> replication.cpp:4324 'mi->master != NULL && mi->cached_master == NULL' is not true

------ STACK TRACE ------

Backtrace:
keydb-server *:6379(replicationCacheMaster(redisMaster*, client*)+0x269) [0x55fb3ba05cb9]
keydb-server *:6379(freeClient(client*)+0x62c) [0x55fb3ba2af9c]
keydb-server *:6379(freeClientsInAsyncFreeQueue(int)+0x175) [0x55fb3ba2d675]
keydb-server *:6379(beforeSleep(aeEventLoop*)+0x3cf) [0x55fb3ba494ff]
keydb-server *:6379(aeProcessEvents+0x3d0) [0x55fb3ba54750]
keydb-server *:6379(aeMain+0x3a) [0x55fb3ba58cda]
keydb-server *:6379(workerThreadMain(void*)+0x7e) [0x55fb3ba40ade]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f1f7dc62609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f1f7db87133]

------ INFO OUTPUT ------
# Server
redis_version:6.3.4
redis_git_sha1:7e7e5e57
redis_git_dirty:1
redis_build_id:bb8fc59400781b64
redis_mode:standalone
os:Linux 5.4.210-yocto-standard x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:9.4.0
process_id:1
process_supervised:no
run_id:16e7ea1e8ddcf8481d68a53274f1edd69a9955c6
tcp_port:6379
server_time_usec:1720094877917040
uptime_in_seconds:62833
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:8818845
executable:/data/keydb-server
config_file:/etc/keydb/keydb.conf
availability_zone:
features:cluster_mget

# Clients
connected_clients:3
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:48
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0
current_client_thread:0
thread_0_clients:4

# Memory
used_memory:3125960
used_memory_human:2.98M
used_memory_rss:21667840
used_memory_rss_human:20.66M
used_memory_peak:3268608
used_memory_peak_human:3.12M
used_memory_peak_perc:95.64%
used_memory_overhead:2883768
used_memory_startup:1711312
used_memory_dataset:242192
used_memory_dataset_perc:17.12%
allocator_allocated:3755208
allocator_active:4468736
allocator_resident:7700480
total_system_memory:16794349568
total_system_memory_human:15.64G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.19
allocator_frag_bytes:713528
allocator_rss_ratio:1.72
allocator_rss_bytes:3231744
rss_overhead_ratio:2.81
rss_overhead_bytes:13967360
mem_fragmentation_ratio:7.07
mem_fragmentation_bytes:18604336
mem_not_counted_for_evict:0
mem_replication_backlog:1048576
mem_clients_slaves:20512
mem_clients_normal:102616
mem_aof_buffer:0
mem_allocator:jemalloc-5.2.1
active_defrag_running:0
lazyfree_pending_objects:0
lazyfreed_objects:0
storage_provider:none
available_system_memory:unavailable

# Persistence
loading:0
current_cow_size:0
current_cow_size_age:0
current_fork_perc:0.00
current_save_keys_processed:0
current_save_keys_total:0
rdb_changes_since_last_save:26
rdb_bgsave_in_progress:0
rdb_last_save_time:1720094864
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:1060864
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0

# Stats
total_connections_received:7
total_commands_processed:77123
instantaneous_ops_per_sec:2
total_net_input_bytes:2199410
total_net_output_bytes:3064275
instantaneous_input_kbps:0.08
instantaneous_output_kbps:0.15
rejected_connections:0
sync_full:3
sync_partial_ok:0
sync_partial_err:0
expired_keys:221
expired_stale_perc:0.00
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:3148
evicted_keys:0
keyspace_hits:62886
keyspace_misses:217
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:1429
total_forks:20
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_error_replies:2
dump_payload_sanitizations:0
total_reads_processed:70187
total_writes_processed:126027
instantaneous_lock_contention:1
avg_lock_contention:0.968750
storage_provider_read_hits:0
storage_provider_read_misses:0

# Replication
role:active-replica
master_global_link_status:down
connected_masters:2
master_host:192.168.128.52
master_port:6379
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:0
slave_read_repl_offset:1446031
slave_repl_offset:1446031
master_link_down_since_seconds:99
master_1_host:my-host
master_1_port:6379
master_1_link_status:up
master_1_last_io_seconds_ago:4
master_1_sync_in_progress:0
slave_read_repl_offset:787049
slave_repl_offset:787049
master_2_host:192.168.128.43
master_2_port:6379
master_2_link_status:up
master_2_last_io_seconds_ago:4
master_2_sync_in_progress:0
slave_read_repl_offset:21628
slave_repl_offset:21628
master_3_host:192.168.128.45
master_3_port:6379
master_3_link_status:down
master_3_last_io_seconds_ago:-1
master_3_sync_in_progress:0
slave_read_repl_offset:787049
slave_repl_offset:787049
master_3_link_down_since_seconds:0
slave_priority:100
slave_read_only:0
replica_announced:1
connected_slaves:1
slave0:ip=192.168.128.43,port=6379,state=online,offset=1465740,lag=0
master_failover_state:no-failover
master_replid:a46bf7931fe37841aabf64269d17d86bd89ca1b9
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:1465740
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:417165
repl_backlog_histlen:1048576

# CPU
used_cpu_sys:118.894166
used_cpu_user:215.689913
used_cpu_sys_children:0.149172
used_cpu_user_children:0.049788
server_threads:1
long_lock_waits:105
used_cpu_sys_main_thread:81.137745
used_cpu_user_main_thread:160.760466

# Modules

# Commandstats
cmdstat_set:calls=438,usec=5727,usec_per_call=13.08,rejected_calls=0,failed_calls=0
cmdstat_replconf:calls=273,usec=515,usec_per_call=1.89,rejected_calls=0,failed_calls=0
cmdstat_rreplay:calls=6981,usec=142151,usec_per_call=20.36,rejected_calls=0,failed_calls=0
cmdstat_ping:calls=6250,usec=4429,usec_per_call=0.71,rejected_calls=0,failed_calls=0
cmdstat_replicaof:calls=3,usec=247,usec_per_call=82.33,rejected_calls=0,failed_calls=0
cmdstat_replping:calls=3,usec=5,usec_per_call=1.67,rejected_calls=0,failed_calls=0
cmdstat_get:calls=63103,usec=599477,usec_per_call=9.50,rejected_calls=0,failed_calls=0
cmdstat_KEYDB.MVCCRESTORE:calls=67,usec=235,usec_per_call=3.51,rejected_calls=0,failed_calls=0
cmdstat_psync:calls=3,usec=5215,usec_per_call=1738.33,rejected_calls=0,failed_calls=0
cmdstat_client:calls=2,usec=27,usec_per_call=13.50,rejected_calls=0,failed_calls=2

# Errorstats
errorstat_ERR:count=2

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=13,expires=13,avg_ttl=1983715,cached_keys=13

# KeyDB
mvcc_depth:0

------ CLIENT LIST OUTPUT ------
id=18 addr=192.168.128.43:32790 laddr=192.168.128.51:6379 fd=18 name= age=13 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20512 events=r cmd=replconf user=default redir=-1
id=4 addr=127.0.0.1:40658 laddr=127.0.0.1:6379 fd=0 name= age=62833 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20504 events=r cmd=get user=default redir=-1
id=6 addr=192.168.56.105:6379 laddr=192.168.128.51:52254 fd=13 name= age=62770 idle=4 flags=M db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=61480 events=r cmd=rreplay user=(superuser) redir=-1
id=17 addr=192.168.128.43:6379 laddr=192.168.128.51:35276 fd=15 name= age=28 idle=4 flags=M db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20528 events=r cmd=rreplay user=(superuser) redir=-1

------ MODULES INFO OUTPUT ------

------ FAST MEMORY TEST ------
1:12:S 04 Jul 2024 12:07:57.919 # main thread terminated
1:12:S 04 Jul 2024 12:07:57.919 # Bio thread for job type #0 terminated
1:12:S 04 Jul 2024 12:07:57.921 # Bio thread for job type #1 terminated
1:12:S 04 Jul 2024 12:07:57.922 # Bio thread for job type #2 terminated

Fast memory test PASSED, however your memory can still be broken. Please run a memory test for several hours if possible.

=== KEYDB BUG REPORT END. Make sure to include from START to END. ===

       Please report the crash by opening an issue on github:

           https://github.com/JohnSully/KeyDB/issues

  Suspect RAM error? Use keydb-server --test-memory to verify it.
2024-07-04T12:07:57.922356834Z

Aditional information

  1. Running in a kubernetes environment using the image available on dockerhub
  2. Setup 3 instances of keydb enabling multi-master and replication
  3. Make sure that each replica is a replicaof from another (a mesh topology)
  4. After restarting master the other replicas crash with this error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant