Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compactions should not cause corruption #6435

Open
sakridge opened this issue Feb 19, 2020 · 6 comments
Open

Compactions should not cause corruption #6435

sakridge opened this issue Feb 19, 2020 · 6 comments

Comments

@sakridge
Copy link

sakridge commented Feb 19, 2020

Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://groups.google.com/forum/#!forum/rocksdb or https://www.facebook.com/groups/rocksdb.dev

Expected behavior

When doing an insert, errors are seen from rocksdb and the values are not inserted:

RocksDb(Error { message: "Corruption: block checksum mismatch: expected 1837325370, got 1292688122  in /home/ubuntu/validator-ledger/rocksdb/039077.sst offset 13991299 size 3758" }))
2020/02/19-11:23:44.797082 7f958effd700 (Original Log Time 2020/02/19-11:23:44.793153) [db/memtable_list.cc:386] [code_shred] Level-0 commit table #39137 started
2020/02/19-11:23:44.797085 7f958effd700 (Original Log Time 2020/02/19-11:23:44.796789) [db/memtable_list.cc:434] [code_shred] Level-0 commit table #39137: memtable #1 done
2020/02/19-11:23:44.797087 7f958effd700 (Original Log Time 2020/02/19-11:23:44.796850) EVENT_LOG_v1 {"time_micros": 1582140224796817, "job": 20, "event": "flush_finished", "output_compression": "NoCompression", "lsm_state": [3, 2, 0, 0, 0, 0, 0], "immutable_memtables": 0}
2020/02/19-11:23:44.797088 7f958effd700 (Original Log Time 2020/02/19-11:23:44.796959) [db/db_impl_compaction_flush.cc:201] [code_shred] Level summary: files[3 2 0 0 0 0 0] max score 0.75
2020/02/19-11:23:44.797200 7f9021bed700 [db/db_impl_compaction_flush.cc:1453] [code_shred] Manual compaction starting
2020/02/19-11:23:44.797858 7f9599ffb700 (Original Log Time 2020/02/19-11:23:44.797807) [db/db_impl_compaction_flush.cc:2428] [code_shred] Manual compaction from level-0 to level-1 from ' .. '; will stop at (end)
2020/02/19-11:23:44.797865 7f9599ffb700 [db/compaction_job.cc:1650] [code_shred] [JOB 23] Compacting 3@0 + 2@1 files to L1, score -1.00
2020/02/19-11:23:44.797876 7f9599ffb700 [db/compaction_job.cc:1654] [code_shred] Compaction start summary: Base version 37 Base level 0, inputs: [39137(483KB) 39119(5848B) 39107(191KB)], [39077(103MB) 39078(17MB)]
2020/02/19-11:23:44.798038 7f9599ffb700 EVENT_LOG_v1 {"time_micros": 1582140224797981, "job": 23, "event": "compaction_started", "compaction_reason": "ManualCompaction", "files_L0": [39137, 39119, 39107], "files_L1": [39077, 39078], "score": -1, "input_data_size": 127316315}
2020/02/19-11:23:44.865719 7f9599ffb700 [WARN] [db/db_impl_compaction_flush.cc:2721] Compaction error: Corruption: block checksum mismatch: expected 1837325370, got 1292688122  in /home/ubuntu/validator-ledger/rocksdb/039077.sst offset 13991299 size 3758
2020/02/19-11:23:44.865874 7f9599ffb700 (Original Log Time 2020/02/19-11:23:44.865499) [db/compaction_job.cc:771] [code_shred] compacted to: files[3 2 0 0 0 0 0] max score 0.00, MB/sec: 1889.9 rd, 0.0 wr, level 1, files in(3, 2) out(0) MB in(0.7, 120.8) out(0.0), read-write-amplify(182.8) write-amplify(0.0) Corruption: block checksum mismatch: exp
ected 1837325370, got 1292688122  in /home/ubuntu/validator-ledger/rocksdb/039077.sst offset 13991299 size 3758, records in: 100485, records dropped: 11148 output_compression: NoCompression
2020/02/19-11:23:44.865882 7f9599ffb700 (Original Log Time 2020/02/19-11:23:44.865664) EVENT_LOG_v1 {"time_micros": 1582140224865530, "job": 23, "event": "compaction_finished", "compaction_time_micros": 67366, "compaction_time_cpu_micros": 51101, "output_level": 1, "num_output_files": 0, "total_output_size": 0, "num_input_records": 11148, "num_out
put_records": 0, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [3, 2, 0, 0, 0, 0, 0]}
2020/02/19-11:23:44.865914 7f9599ffb700 [ERROR] [db/db_impl_compaction_flush.cc:2266] Waiting after background compaction error: Corruption: block checksum mismatch: expected 1837325370, got 1292688122  in /home/ubuntu/validator-ledger/rocksdb/039077.sst offset 13991299 size 3758, Accumulated background error counts: 1
2020/02/19-11:23:44.866269 7f9021bed700 [db/db_impl_write.cc:1438] [dead_slots] New memtable created with log file: #39138. Immutable memtables: 0.
2020/02/19-11:23:44.866481 7f9021bed700 [db/db_impl_write.cc:1438] [erasure_meta] New memtable created with log file: #39138. Immutable memtables: 0.
2020/02/19-11:23:44.866655 7f9021bed700 [db/db_impl_write.cc:1438] [orphans] New memtable created with log file: #39138. Immutable memtables: 0.
2020/02/19-11:23:44.866821 7f9021bed700 [db/db_impl_write.cc:1438] [index] New memtable created with log file: #39138. Immutable memtables: 0.
2020/02/19-11:23:44.867014 7f9021bed700 [db/db_impl_write.cc:1438] [transaction_status] New memtable created with log file: #39138. Immutable memtables: 0.

In the log we see errors with corruption during a compaction.

Rocks version:

2020/02/18-16:01:36.479968 7f2b4f185680 RocksDB version: 6.2.4
2020/02/18-16:01:36.480219 7f2b4f185680 Git sha rocksdb_build_git_sha:@76a56d89a7740f8dbb01edabf1ea5abc95a67657@
2020/02/18-16:01:36.480222 7f2b4f185680 Compile date Feb 18 2020
2020/02/18-16:01:36.480224 7f2b4f185680 DB SUMMARY
2020/02/18-16:01:36.480313 7f2b4f185680 CURRENT file:  CURRENT
2020/02/18-16:01:36.480315 7f2b4f185680 IDENTITY file:  IDENTITY
2020/02/18-16:01:36.480331 7f2b4f185680 MANIFEST file:  MANIFEST-000004 size: 550 Bytes

Actual behavior

Corruption messages and failures from insert.

Steps to reproduce the behavior

@ajkr
Copy link
Contributor

ajkr commented Feb 20, 2020

This is the expected behavior when RocksDB background operation encounters an error: bring the DB into read-only mode. In read-only mode, any writes will return the same error that the background operation encountered. In this case it's a checksum error.

It is always difficult (impossible?) to distinguish silent corruption in device from RocksDB corruption. Have you checked for firmware updates and the device health (e.g., with smartctl -a /dev/<your device name>)? What is the device type?

Also it'd be interesting to know if the checksum error is permanent or transient. One way to tell is restart the DB, wait until compaction happens, and see if it fails again. Or you can use a tool like sst_dump to scan just the affected file without needing to reopen the DB.

@sakridge
Copy link
Author

This is the expected behavior when RocksDB background operation encounters an error: bring the DB into read-only mode. In read-only mode, any writes will return the same error that the background operation encountered. In this case it's a checksum error.

I see.

It is always difficult (impossible?) to distinguish silent corruption in device from RocksDB corruption. Have you checked for firmware updates and the device health (e.g., with smartctl -a /dev/<your device name>)? What is the device type?

smartctl:

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-64-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 960 EVO 1TB
Serial Number:                      S3ETNX0J501460T
Firmware Version:                   2B7QCXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        40 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    3%
Data Units Read:                    2,791,728 [1.42 TB]
Data Units Written:                 74,341,811 [38.0 TB]
Host Read Commands:                 144,790,589
Host Write Commands:                771,924,291
Controller Busy Time:               2,245
Power Cycles:                       83
Power On Hours:                     6,979
Unsafe Shutdowns:                   19
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               40 Celsius
Temperature Sensor 2:               51 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

Also it'd be interesting to know if the checksum error is permanent or transient. One way to tell is restart the DB, wait until compaction happens, and see if it fails again. Or you can use a tool like sst_dump to scan just the affected file without needing to reopen the DB.

We saw this error twice within a pretty short period after having the DB open for about 24 hours. We have also had reports of it on other hardware, so I would be surprised if it is a hardware issue.

When I run the sst_dump tool, it shows the file is corrupted just as the log:

sakridge@sagan:/home/ubuntu/validator-ledger/rocksdb$ ~/src/rocksdb/sst_dump --file=039077.sst --
command=verify --verify_checksum
from [] to []
Process 039077.sst
Sst file format: block-based
039077.sst is corrupted: Corruption: block checksum mismatch: expected 1837325370, got 1292688122
  in 039077.sst offset 13991299 size 3758

@ajkr
Copy link
Contributor

ajkr commented Feb 20, 2020

Thanks for all the info!

We have also had reports of it on other hardware

That is very good to know.

@tomlinton
Copy link

Experienced the same thing regularly, two disks setup in software RAID1:

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG MZ7LN512HMJP-00000
Serial Number:    S2MHNB0J301080
LU WWN Device Id: 5 002538 d41d710a4
Firmware Version: MAV0100Q
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Mar 23 10:03:16 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Complete output here. Going to try different disk to see if that helps.

@DXist
Copy link

DXist commented Aug 23, 2023

Would it possible to localize corrupted key range/sst file through client API?

In case of replicated system it's possible to reinsert keys/ingest sst file from another replica and try to heal programmatically.

@DXist
Copy link

DXist commented Aug 23, 2023

here

Do you use direct IO or page cache IO?

An article about fsync failures on several filesystems - https://www.usenix.org/system/files/atc20-rebello.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants