Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store size suddenly started increasing due to stale snapshot #9331

Open
hongshaoyang opened this issue Aug 16, 2024 · 2 comments
Open

Store size suddenly started increasing due to stale snapshot #9331

hongshaoyang opened this issue Aug 16, 2024 · 2 comments
Labels

Comments

@hongshaoyang
Copy link

hongshaoyang commented Aug 16, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Disk size / Store size of Tiflash slowly started increasing all of a sudden.

2. What did you expect to see? (Required)

  • no disk size increase

3. What did you see instead (Required)

Disk size / Store size of Tiflash slowly started increasing all of a sudden.

See https://snapshots.raintank.io/dashboard/snapshot/moA6q18u4GrG0y5qLgZiXqjd9Lxw7BXP?orgId=0&viewPanel=53&from=1720289323641&to=1720386710155 for a snapshot of Store size metrics.

Screenshot 2024-08-16 at 5 24 45 PM

4. What is your TiFlash version? (Required)

v7.1.2

@hongshaoyang hongshaoyang added the type/bug The issue is confirmed as a bug. label Aug 16, 2024
@hongshaoyang
Copy link
Author

hongshaoyang commented Aug 16, 2024

Aggregation does not support spill because aggregator hash table does not support two level

[2024/07/06 18:55:32.755 +00:00] [WARN] [Aggregator.cpp:346] ["Aggregation does not support spill because aggregator hash table does not support two level"] [source="MPP<gather_id:<gather_id:1, query_ts:1720292129539717116, local_query_id:461048, server_id:511, start_ts:450964260000890920, resource_group: redash_my>,task_id:3>_Aggregation_HashAgg_10"] [thread_id=2321]

Key locked started here

[2024/07/06 19:00:03.347 +00:00] [WARN] [CoprocessorHandler.cpp:202] ["LockException: region_id=2246306414, message: Key is locked (1 locks in regions {2246306414})"] [source="Coprocessor, is_remote_read: true, start_ts: 450964331776180266, region_info: {2246306414, 8585, 18346}, resource_group: default"] [thread_id=938977]

Key locked ended here

[2024/07/06 21:20:34.175 +00:00] [WARN] [CoprocessorHandler.cpp:202] ["LockException: region_id=2246343413, message: Key is locked (1 locks in regions {2246343413})"] [source="Coprocessor, is_remote_read: true, start_ts: 450966541859815581, region_info: {2246343413, 616015, 31527}, resource_group: default"] [thread_id=942206]

Next exceptions from MPPTaskManager.cpp module. I think the task was ABORTED here.

[2024/07/06 21:20:34.260 +00:00] [WARN] [<unknown>] ["region {2246343413,616015,31527} find error: epoch_not_match {\n}\n"] [source=pingcap.tikv] [thread_id=976852]
[2024/07/06 21:21:15.659 +00:00] [WARN] [MPPTaskManager.cpp:277] ["Begin to abort gather: <gather_id:7, query_ts:1720300835869717319, local_query_id:201721, server_id:79, start_ts:450966542148173844, resource_group: redash_id>, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB"] [thread_id=940502]
[2024/07/06 21:21:15.659 +00:00] [WARN] [MPPTaskManager.cpp:288] ["<gather_id:7, query_ts:1720300835869717319, local_query_id:201721, server_id:79, start_ts:450966542148173844, resource_group: redash_id> does not found in task manager, skip abort"] [thread_id=940502]
[2024/07/06 21:23:02.847 +00:00] [WARN] [MPPTaskManager.cpp:277] ["Begin to abort gather: <gather_id:8, query_ts:1720300937773760016, local_query_id:767227, server_id:215, start_ts:450966568873492484, resource_group: redash_id>, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB"] [thread_id=941679]
[2024/07/06 21:23:02.847 +00:00] [WARN] [MPPTaskManager.cpp:288] ["<gather_id:8, query_ts:1720300937773760016, local_query_id:767227, server_id:215, start_ts:450966568873492484, resource_group: redash_id> does not found in task manager, skip abort"] [thread_id=941679]

We get the same ERRORS after that from 2024/07/06 21:23:48.585 to 2024/07/06 21:30:23.676

DB::Exception: Check added_tokens >= 0 failed

[2024/07/06 21:23:48.585 +00:00] [ERROR] [LocalAdmissionController.cpp:122] ["Code: 49, e.displayText() = DB::Exception: Check added_tokens >= 0 failed, e.what() = DB::Exception, Stack trace:


       0x1b68670	DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+28739184]
                	dbms/src/Common/Exception.h:46
       0x7d02dac	DB::LocalAdmissionController::fetchTokensFromGAC(std::__1::vector<DB::LocalAdmissionController::AcquireTokenInfo, std::__1::allocator<DB::LocalAdmissionController::AcquireTokenInfo> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) [tiflash+131083692]
                	dbms/src/Flash/ResourceControl/LocalAdmissionController.cpp:267
       0x7cffde0	DB::LocalAdmissionController::startBackgroudJob() [tiflash+131071456]
                	dbms/src/Flash/ResourceControl/LocalAdmissionController.cpp:107
       0x1c29420	void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, DB::LocalAdmissionController::LocalAdmissionController(pingcap::kv::Cluster*, std::__1::shared_ptr<DB::Etcd::Client>)::'lambda'()> >(void*) [tiflash+29529120]
                	/usr/local/bin/../include/c++/v1/thread:291
  0xffff8f55ca28	start_thread [libc.so.6+535080]
  0xffff8f505b9c	thread_start [libc.so.6+179100]"] [source=LocalAdmissionController] [thread_id=2051]

tidb.ai says, The provided log snippet shows the error occurring in the LocalAdmissionController.cpp file at line 122. The stack trace indicates that the error is related to the fetchTokensFromGAC function, which is part of the resource control mechanism in TiFlash.

The next line of ERRORS ,

[2024/07/06 21:30:27.469 +00:00] [WARN] [ExchangeReceiver.cpp:982] ["connection end. meet error: true, err msg: Exchange receiver meet error : Receive cancel request from TiDB, current alive connections: 2"] [source="MPP<gather_id:<gather_id:4, query_ts:1720301147440367135, local_query_id:973089, server_id:1013, start_ts:450966623989268517, resource_group: redash_ph>,task_id:26> ExchangeReceiver_490 async tunnel24+26"] [thread_id=4315]

tidb.ai says ,
Check the TiFlash logs for entries related to MPP tasks. Look for logs that mention ExchangeSender and ExchangeReceiver as these operators are responsible for cross-node data exchange in MPP mode

Analysing further lines:-

[2024/07/06 21:30:52.773 +00:00] [WARN] [MPPTaskManager.cpp:277] ["Begin to abort gather: <gather_id:1, query_ts:1720301442835215512, local_query_id:973163, server_id:1013, start_ts:450966701413761139, resource_group: redash_ph>, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB"] [thread_id=939210]

I think the query was aborted here:-

[2024/07/06 21:30:52.773 +00:00] [WARN] [MPPTaskManager.cpp:277] ["Begin to abort gather: <gather_id:1, query_ts:1720301442835215512, local_query_id:973163, server_id:1013, start_ts:450966701413761139, resource_group: redash_ph>, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB"] [thread_id=939210]
[2024/07/06 21:30:52.773 +00:00] [WARN] [MPPTaskManager.cpp:277] ["Begin to abort gather: <gather_id:3, query_ts:1720301442835215512, local_query_id:973163, server_id:1013, start_ts:450966701413761139, resource_group: redash_ph>, abort type: ONCANCELLATION, reason: Receive cancel request from TiDB"] [thread_id=940517]

ExchangeReceiver received the request and aborted the MPP task here:-

[2024/07/06 21:30:52.774 +00:00] [WARN] [ExchangeReceiver.cpp:982] ["connection end. meet error: true, err msg: Receive cancel request from TiDB, current alive connections: 2"] [source="MPP<gather_id:<gather_id:1, query_ts:1720301442835215512, local_query_id:973163, server_id:1013, start_ts:450966701413761139, resource_group: redash_ph>,task_id:9> ExchangeReceiver_182 local tunnel5+9"] [thread_id=939210]

[2024/07/06 21:30:52.774 +00:00] [WARN] [ExchangeReceiver.cpp:1003] ["Finish receiver channels, meet error: true, error message: Receive cancel request from TiDB"] [source="MPP<gather_id:<gather_id:1, query_ts:1720301442835215512, local_query_id:973163, server_id:1013, start_ts:450966701413761139, resource_group: redash_ph>,task_id:9> ExchangeReceiver_182"] [thread_id=939210]

Then the Store Size gradually started increasing, due to:

Meet a stale snapshot:

[2024/07/06 21:34:56.750 +00:00] [WARN] [PageDirectory.cpp:2076] ["Meet a stale snapshot [thread id=941690] [tracing id=Coprocessor, is_remote_read: true, start_ts: 450966591483674702, region_info: {2246343413, 616015, 31528}, resource_group: redash_ph table_id=224] [seq=280884012] [alive time(s)=610.035349065]"] [source=__global__.meta] [thread_id=79]
[2024/07/06 21:34:56.750 +00:00] [WARN] [PageDirectory.cpp:2076] ["Meet a stale snapshot [thread id=939041] [tracing id=Coprocessor, is_remote_read: true, start_ts: 450966591483674702, region_info: {2847662646, 616015, 31528}, resource_group: redash_ph table_id=224] [seq=280884012] [alive time(s)=610.027829322]"] [source=__global__.meta] [thread_id=79]
[2024/07/06 21:34:56.750 +00:00] [WARN] [PageDirectory.cpp:2076] ["Meet a stale snapshot [thread id=940475] [tracing id=Coprocessor, is_remote_read: true, start_ts: 450966591483674702, region_info: {2246340642, 615895, 31524}, resource_group: redash_ph table_id=224] [seq=280884012] [alive time(s)=610.022052769]"] [source=__global__.meta] [thread_id=79]
[2024/07/06 21:34:56.750 +00:00] [WARN] [PageDirectory.cpp:2076] ["Meet a stale snapshot [thread id=940633] [tracing id=Coprocessor, is_remote_read: true, start_ts: 450966591483674702, region_info: {2847662406, 615895, 31524}, resource_group: redash_ph table_id=224] [seq=280884012] [alive time(s)=610.018181362]"] [source=__global__.meta] [thread_id=79]
[2024/07/06 21:34:56.750 +00:00] [WARN] [PageDirectory.cpp:2076] ["Meet a stale snapshot [thread id=939034] [tracing id=Coprocessor, is_remote_read: true, start_ts: 450966591483674702, region_info: {2847662374, 615895, 31523}, resource_group: redash_ph table_id=224] [seq=280884012] [alive time(s)=610.014203116]"] [source=__global__.meta] [thread_id=79]
[2024/07/06 21:34:56.750 +00:00] [WARN] [PageDirectory.cpp:2076] ["Meet a stale snapshot [thread id=939227] [tracing id=Coprocessor, is_remote_read: true, start_ts: 450966591483674702, region_info: {2847662617, 616015, 31527}, resource_group: redash_ph table_id=224] [seq=280884012] [alive time(s)=610.011231263]"] [source=__global__.meta] [thread_id=79]

@hongshaoyang
Copy link
Author

This issue is related to #5084.

@hongshaoyang hongshaoyang changed the title Disk size slowly started increasing Store size suddenly started increasing due to stale snapshot Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants