TiDB memory leak after meets the `fail to get stats version for this histogram` #54022

Rustin170506 · 2024-06-14T03:54:53Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Change the async load stats code:

diff --git a/pkg/sessionctx/variable/tidb_vars.go b/pkg/sessionctx/variable/tidb_vars.go
index e10aa9c5bbb9f..39153b4433448 100644
--- a/pkg/sessionctx/variable/tidb_vars.go
+++ b/pkg/sessionctx/variable/tidb_vars.go
@@ -1287,7 +1287,7 @@ const (
 	DefTiDBTableCacheLease                         = 3 // 3s
 	DefTiDBPersistAnalyzeOptions                   = true
 	DefTiDBEnableColumnTracking                    = false
-	DefTiDBStatsLoadSyncWait                       = 100
+	DefTiDBStatsLoadSyncWait                       = 0
 	DefTiDBStatsLoadPseudoTimeout                  = true
 	DefSysdateIsNow                                = false
 	DefTiDBEnableMutationChecker                   = false
diff --git a/pkg/statistics/handle/storage/read.go b/pkg/statistics/handle/storage/read.go
index bacf43abef60f..e1f43e8ecdc65 100644
--- a/pkg/statistics/handle/storage/read.go
+++ b/pkg/statistics/handle/storage/read.go
@@ -539,7 +539,7 @@ func loadNeededColumnHistograms(sctx sessionctx.Context, statsCache util.StatsCa
 	if err != nil {
 		return errors.Trace(err)
 	}
-	if len(rows) == 0 {
+	if true {
 		logutil.BgLogger().Error("fail to get stats version for this histogram", zap.Int64("table_id", col.TableID), zap.Int64("hist_id", col.ID))
 		return errors.Trace(fmt.Errorf("fail to get stats version for this histogram, table_id:%v, hist_id:%v", col.TableID, col.ID))
 	}
@@ -599,7 +599,7 @@ func loadNeededIndexHistograms(sctx sessionctx.Context, statsCache util.StatsCac
 	if err != nil {
 		return errors.Trace(err)
 	}
-	if len(rows) == 0 {
+	if true {
 		logutil.BgLogger().Error("fail to get stats version for this histogram", zap.Int64("table_id", idx.TableID), zap.Int64("hist_id", idx.ID))
 		return errors.Trace(fmt.Errorf("fail to get stats version for this histogram, table_id:%v, hist_id:%v", idx.TableID, idx.ID))
 	}

Create some tables and analyze those tables
Try to issue some select queries to TiDB

2. What did you expect to see? (Required)

The memory usage of TiDB does not increase.

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

v7.5.1

kennedy8312 · 2024-06-14T09:23:15Z

/epic leak

Rustin170506 · 2024-06-17T03:44:08Z

The memory leak issue:

The issue was triggered by a persistent error when a user adjusted tidb_opt_objective to trigger a bug in one of the stats causing the hist of the index to be loaded asynchronously: [2024/06/12 01:53:27.176 +00:00] [ERROR] [read.go:603] [" fail to get stats version for this histogram"] [table_id=143] [hist_id=1]
In the event of an error, our util method drops the session ctx instead of putting it back into the pool.
Not putting it back into the pool isn't the worst part, the worst part is that we record the session internally to optimise gc safe time advancement, so essentially throwing the session away can't be used by golang gc because it's still referenced in memory.
Because the pool is essentially unbound, new sessions are created all the time and internal references are created to the
So this results in 1. memory continuing to grow 2. even if we fix the stats failure by dropping stats, the allocated memory cannot be freed because of internal references gc 3. it also explains why that portion of the memory hasn't continued to grow since the problem was fixed 4. it also explains why the problem went away after a restart of the other node.
A record of solid reproduction:
a. Make load stats report errors directly and turn off sync loads

b. Create some tables, analyse them and then use a query statement to trigger the loading of statistics
c. Memory leaks can be seen by looking at memory profiles and monitors

Rustin170506 · 2024-06-17T03:48:11Z

How to trigger the bug and the root cause:

Looking at the user's table structure and the amount of data, as well as the ask tug information, we can get a characteristic that the table where the user has problems is very small, and does not even reach the analysis threshold. the table where the problem occurs in oncall has only 2-4 rows.
Observe the parameters adjusted by the user for the day:
05/29 17:00 +utc0900 tidb_auto_analyze_ratio=0.1 /tidb_auto_analyze_partition_batch_size = 128
05/30 18:30 tidb_opt_objective='determine'
The problem logs actually started appearing on the 30th.
Combined with the fact that the code path in question is on the hist where the indexes are loaded, we can determine that the query is trying to load the statistics for the indexes of these tables
Looking at the code again, there is only one place in the code that could cause the index's stats to be loaded incorrectly:
The only thing that could go wrong here is if the phsyical id isn't -1.
Then you can look up and find the only parameter that might make the phsyical id not -1: allowPseudoTblTriggerLoading
Looking at the setting of this parameter again, we can see a familiar parameter: OptObjectiveDeterminate

Based on the above information, I suspect that after opening this parameter, due to some special conditions, it triggered the loading of the index statistics of the table, while this index actually has no statistics at all. So it leads to an infinite loop.

Rustin170506 · 2024-06-17T03:49:24Z

Summary:

We were able to reproduce the problem by following this path, but the only thing we are not quite sure of is how to cause the system table to drop the hist records for the index, which I did manually, and which kunqin simulated by dropping stats.
At present, we can determine that tidb_opt_objective='determine' is a necessary condition for reproduction.
The tables where users have problems are relatively small, and there is no statistical information on the table, so if the problem can be found in time, then re-analysis of these small tables can solve the problem.
The way to fix this is to do another check when loading as kunqin said, so that we can avoid loading this kind of index information.
Because the default value of tidb_opt_objective is not determinate, the impact of this problem is not very large, and the probability of users encountering it is small.
as for how to throw away I feel that the cost of checking it out directly may be very very high, so I think we can first fix the loading itself is not reasonable. In the place where this situation may occur and then more related logs, see the next time after the reproduction and then continue to check (not reproduce the leakage problem, is the later reproduction: index in the cache index info, but the system table inside the hist record does not have the case).

Rustin170506 · 2024-06-17T03:50:14Z

The way to reproduce it:

create table t(a int, b int, index ia(a));

drop stats t;

insert into t value(1,1), (2,2);

# Wait for a while until you can see the stats for the table.
show stats_meta; 

set tidb_opt_objective='determinate';

explain select * from t where a = 1 and b = 1;

close #54022

ref #54022

seiya-annie · 2024-08-02T11:32:32Z

/report customer

Rustin170506 added the type/bug The issue is confirmed as a bug. label Jun 14, 2024

hawkingrei added the sig/planner SIG: Planner label Jun 14, 2024

hawkingrei assigned hawkingrei and Rustin170506 and unassigned hawkingrei Jun 14, 2024

ti-chi-bot bot added the impact/leak label Jun 14, 2024

Rustin170506 added affects-7.1 affects-7.5 severity/major labels Jun 17, 2024

ti-chi-bot bot added may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 may-affects-8.1 labels Jun 17, 2024

Rustin170506 mentioned this issue Jun 17, 2024

statistics: do not load unnecessary index statistics #54060

Merged

13 tasks

ti-chi-bot bot pushed a commit that referenced this issue Jun 18, 2024

statistics: do not load unnecessary index statistics (#54060)

3c599cd

close #54022

ti-chi-bot mentioned this issue Jun 18, 2024

statistics: do not load unnecessary index statistics (#54060) #54087

Open

13 tasks

Rustin170506 closed this as completed Jun 20, 2024

Rustin170506 reopened this Jun 20, 2024

Rustin170506 closed this as completed Jun 20, 2024

Rustin170506 mentioned this issue Jun 20, 2024

statistics: add a testcase for issue 54022 #54137

Merged

13 tasks

Rustin170506 removed may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 may-affects-8.1 labels Jun 20, 2024

ti-chi-bot bot pushed a commit that referenced this issue Jul 5, 2024

statistics: add a testcase for issue 54022 (#54137)

ba5c6a9

ref #54022

This was referenced Jul 5, 2024

statistics: add a testcase for issue 54022 (#54137) #54476

Closed

statistics: add a testcase for issue 54022 (#54137) #54477

Closed

ti-chi-bot bot added the report/customer Customers have encountered this bug. label Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TiDB memory leak after meets the `fail to get stats version for this histogram` #54022

TiDB memory leak after meets the `fail to get stats version for this histogram` #54022

Rustin170506 commented Jun 14, 2024

kennedy8312 commented Jun 14, 2024

Rustin170506 commented Jun 17, 2024

Rustin170506 commented Jun 17, 2024

Rustin170506 commented Jun 17, 2024 •

edited

Loading

Rustin170506 commented Jun 17, 2024 •

edited

Loading

seiya-annie commented Aug 2, 2024

TiDB memory leak after meets the fail to get stats version for this histogram #54022

TiDB memory leak after meets the fail to get stats version for this histogram #54022

Comments

Rustin170506 commented Jun 14, 2024

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

kennedy8312 commented Jun 14, 2024

Rustin170506 commented Jun 17, 2024

Rustin170506 commented Jun 17, 2024

Rustin170506 commented Jun 17, 2024 • edited Loading

Rustin170506 commented Jun 17, 2024 • edited Loading

seiya-annie commented Aug 2, 2024

TiDB memory leak after meets the `fail to get stats version for this histogram` #54022

TiDB memory leak after meets the `fail to get stats version for this histogram` #54022

Rustin170506 commented Jun 17, 2024 •

edited

Loading

Rustin170506 commented Jun 17, 2024 •

edited

Loading