Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start replica server failed due to incomplete created RocksDB directory #1450

Closed
acelyc111 opened this issue Apr 17, 2023 · 2 comments
Closed
Labels
type/bug This issue reports a bug.

Comments

@acelyc111
Copy link
Member

acelyc111 commented Apr 17, 2023

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do?

Construct an incomplete RocksDB directory, the column families are not completed before a crash.

  1. What did you expect to see?

Replica server could start normally even if the RocksDB directory is incomplete.

  1. What did you see instead?

Replica server start failed, the error logs like:

I2023-04-17 15:46:15.325 (1681746375325177743 1122982) replica.replica0.0301000000000003: pegasus_server_impl.cpp:1511:start(): [[email protected]:34801] start to open app /home/laiyingchun/data/pegasus/onebox/replica1/data/replica/reps/1.3.pegasus/data
I2023-04-17 15:46:15.325 (1681746375325185037 1122982) replica.replica0.0301000000000003: pegasus_server_impl.cpp:1555:start(): [[email protected]:34801] rdb is already exist, path = /home/laiyingchun/data/pegasus/onebox/replica1/data/replica/reps/1.3.pegasus/data/rdb
I2023-04-17 15:46:15.325 (1681746375325196529 1122982) replica.replica0.0301000000000003: pegasus_server_impl.cpp:1611:start(): [[email protected]:34801] start to open rocksDB's rdb(/home/laiyingchun/data/pegasus/onebox/replica1/data/replica/reps/1.3.pegasus/data/rdb)
F2023-04-17 15:46:15.325 (1681746375325286117 1122982) replica.replica0.0301000000000003: pegasus_server_impl.cpp:1626:start(): assertion expression: !missing_meta_cf
F2023-04-17 15:46:15.325 (1681746375325299873 1122982) replica.replica0.0301000000000003: pegasus_server_impl.cpp:1626:start(): [[email protected]:34801] You must upgrade Pegasus server from 2.0
  1. What version of Pegasus are you using?

The master branch.

@acelyc111 acelyc111 added the type/bug This issue reports a bug. label Apr 17, 2023
@acelyc111
Copy link
Member Author

acelyc111 commented Apr 17, 2023

This issuse is reported when I try to fix #1383, I injected a write error in the write path of a replica server, the replica server will try to recover (i.e. open new rocksdb instance) replicas after the "injected corrupted" replicas automatically closed. If the server crashed when some rocksdb instances are during creating, the instances maybe incomplete, then reproduce this issue.

@acelyc111
Copy link
Member Author

acelyc111 commented Apr 18, 2023

Got another incomplete RocksDB instance and crashed too, logs:

I2023-04-18 15:31:05.592 (1681831865592605498 1384110) replica.replica0.030100000000000b: pegasus_server_impl.cpp:1501:start(): [[email protected]:34801] start to open app /home/laiyingchun/data/pegasus/onebox/replica1/data/replica/reps/2.1.pegasus/data
I2023-04-18 15:31:05.592 (1681831865592610397 1384110) replica.replica0.030100000000000b: pegasus_server_impl.cpp:1545:start(): [[email protected]:34801] rdb is already exist, path = /home/laiyingchun/data/pegasus/onebox/replica1/data/replica/reps/2.1.pegasus/data/rdb
I2023-04-18 15:31:05.592 (1681831865592611559 1384110) replica.replica0.030100000000000b: pegasus_server_impl.cpp:1601:start(): [[email protected]:34801] start to open rocksDB's rdb(/home/laiyingchun/data/pegasus/onebox/replica1/data/replica/reps/2.1.pegasus/data/rdb)
E2023-04-18 15:31:05.592 (1681831865592652015 1384110) replica.replica0.030100000000000b: pegasus_server_impl.cpp:3181:check_column_families(): [[email protected]:34801] column family name: default
E2023-04-18 15:31:05.592 (1681831865592656654 1384110) replica.replica0.030100000000000b: pegasus_server_impl.cpp:3181:check_column_families(): [[email protected]:34801] column family name: pegasus_meta_cf
F2023-04-18 15:31:05.608 (1681831865608720349 1384110) replica.replica0.030100000000000b: meta_store.cpp:51:get_last_flushed_decree(): [[email protected]:34801] ERR_OK vs ERR_OBJECT_NOT_FOUND

@acelyc111 acelyc111 changed the title Start replica server failed due to imcomplete RocksDB directory Start replica server failed due to incomplete created RocksDB directory Apr 21, 2023
empiredan pushed a commit that referenced this issue Apr 21, 2023
…#1451)

#1450

If replica server attempt to open an incomplete RocksDB instance (maybe caused
by a previous crash), it will crash before moving the incomplete path to ".err"
trash path, and it will crash again if restart the server.

This patch avoid to crash before moving the incomplete RocksDB path to ".err" path,
thus the replica has an opportunity to recovery automatically without move the
incomplete RocksDB path manually.
GehaFearless pushed a commit to GehaFearless/incubator-pegasus that referenced this issue Feb 28, 2024
…B instance (apache#1451)

对应社区commit: https://github.com/apache/incubator-pegasus/pull/1451/files

注: 由于单测部分变更较大,本次未合入

apache#1450

If replica server attempt to open an incomplete RocksDB instance (maybe caused
by a previous crash), it will crash before moving the incomplete path to ".err"
trash path, and it will crash again if restart the server.

This patch avoid to crash before moving the incomplete RocksDB path to ".err" path,
thus the replica has an opportunity to recovery automatically without move the
incomplete RocksDB path manually.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug This issue reports a bug.
Projects
None yet
Development

No branches or pull requests

1 participant