-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: OOM during TPCC with schema changes #69317
Comments
Note to all working on this: now that #69313 has been merged, the test no longer OOMs, but the underlying issue still exists. We should continue our effort to understand the OOMs better. To reproduce the OOMs, apply this diff:
I tried out #69375 to see if storing stats for all the DEALLOCATE statements separately was the problem, but that PR still resulted in an OOM. Next theory: perhaps the issue is with the prepared statement cache having a memory leak. @ajwerner observes that:
|
I believe there is some sort of issue with dangling references to the prepared statements. I added this diff on top of the one that allows the pgx stmt cache:
Then ran
|
We store portals in a map by value: |
I'm currently looking into this. I tried switching |
Yeah, we looked at it more and that doesn't explain it. I also noticed that we're not clearing |
I did find something suspicious, here is the comment on
Note that we currently don't actually delete any portals created for the prepared statement. The postgres docs aren't clear whether we should be deleting the portals or not, and I haven't looked into the source code yet. I also think it's possible that the switch to values instead of pointers introduced a bug. In
We're taking |
FWIW on the binary with the fixes to both of the problems I mentioned above the |
Portals that are created outside of a transaction are supposed to automatically be destroyed after they are executed. Portals created in a transaction are supposed to automatically be destroyed at the end of the transaction. See f612484. one thing I don't see though is that |
There's a
|
@yuzefovich definitely having the refcounts on values is a problem, but it seems to me like it can only result in "premature" final |
Tough luck - most runs with the fixes to both possible issues outlines above still failed. I also included the deletion of portals from |
68831: ui: add date range selector in stmts and txns pages r=xinhaoz a=xinhaoz Resolves #68089, #69474 This commit adds date range state properties and a date range selector component to the DB console's statements and transactions pages. This change is necessary to support surfacing persisted statistics in the DB console. This commit adds date range state properties and a date range selector component to the DB console's statements and transactions pages. This change is necessary to support surfacing persisted statistics in the DB console. The date range query parameters used by the api to fetch data is stored in localSettings on db console and localStorage in cluster-ui. The default date range is set to 1 hour ago, and is used as the value when user's reset the date range. The user is able to select dates as far back as 1 year ago, this does not mean there is data available. Release justification: category 4 low risk, high benefit changes to existing functionality Release note (ui change): New date range selector component to the DB console's statements page with the ability to show historical data. The default date range is set to 1 hour ago, and is used as the value when user's reset the date range. ------------------ ![image](https://user-images.githubusercontent.com/20136951/131007768-9a7bfeb6-a0f0-417a-8160-51f8d636f73b.png) ![image](https://user-images.githubusercontent.com/20136951/131007809-ff98359e-bfcc-4cb1-9b0b-eb34f8d04cd0.png) ![image](https://user-images.githubusercontent.com/20136951/131036176-11b03aa6-16e5-4601-b92e-e13dd0dc03df.png) ![image](https://user-images.githubusercontent.com/20136951/131198830-6b727e68-0983-483e-9385-f23e16cde705.png) Transactions page: ![image](https://user-images.githubusercontent.com/20136951/131222215-bb2b013c-034e-40bf-8d83-7f5d9784065a.png) 69375: tree: redact name in DEALLOCATE in FmtHideConstants r=ajwerner,maryliag a=rafiss This will make it so that all DEALLOCATE statements show up in one row in the SQL Statements UI and in the internal stats data structures. Recently, a workload was updated (accidentally) to start creating more DEALLOCATE statements, and it made the UI unusable because there were so many DEALLOCATE statement keys. The second commit is a rename to stop implying that statements with redacted constants are "anonymized" -- they still have names+identifiers. This also renames the name of a field in a response payload. Created while working on #69317 but doesn't fix it. Release justification: fix a high-priority bugs with a low-risk change. Release note: None 69457: sql, import: add metrics for max_row_size guardrails r=rytaft,yuzefovich a=michae2 **sql: rename max_row_size guardrails to match transaction row limits** Addresses: #67400 Rename sql.mutations.max_row_size.{log|err} to sql.guardrails.max_row_size_{log|err} for consistency with transaction_rows_{read|written}_{log|err} and upcoming metrics. Release justification: Low-risk update to new functionality. Release note (ops change): New variables sql.mutations.max_row_size.{log|err} were renamed to sql.guardrails.max_row_size_{log|err} for consistency with other variables and metrics. **sql/row: remove dependency on sql/execinfra** The next commit needs to make sql/row a dependency of sql/execinfra, so remove two small references to sql/execinfra. (I will squash this into the next commit before merging.) Release justification: Low-risk update to new functionality. Release note: None **sql, import: plumb RowMetrics from ExecutorConfig to row.Helper** Addresses: #67400 Add metrics for sql/row and pass them down from ExecutorConfig and FlowCtx all the way to row.Helper. Like sql.Metrics, there are two copies, one for user queries and one for internal queries. (These were originally part of sql.Metrics, but there are several users of sql/row that do not operate under a sql.Server or connExecutor so we are forced to add these to the ExecutorConfig and FlowCtx instead.) I ran into difficulty passing these through import, as FlowCtx itself is not plumbed through. There are only two metrics at first, corresponding to sql.guardrails.max_row_size_{log|err}. Release justification: Low-risk update to new functionality. Release note (ops): Added four new metrics, sql.guardrails.max_row_size_{log|err}.count{.internal} which are incremented whenever a large row violates sql.guardrails.max_row_size_{log|err}. 69535: server: remove non-deterministic test r=maryliag a=maryliag A new test was added but it would fail depending on the time it was executed. This commit remove this flaky test so in a future PR we can add new set of tests that make more sense to test this functionality. Fixes #69533 Release justification: Category 3 bug fix Release note: None Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Michael Erickson <[email protected]> Co-authored-by: Marylia Gutierrez <[email protected]>
Removing the release-blocker label since this is not a regression AFAICT. |
I think I was too quick to say that #69479 - I'm now seeing the same failures on |
We have marked this issue as stale because it has been inactive for |
Describe the problem
This is meta issue to clearly phrase observations from debugging #69100 and #69086. In both of those we see OOMs. We've tracked this down to a change in prepared statement handling due to #68608. We observe that
pgx
is issuing lots ofDEALLOCATE
commands, presumably because of something involving its LRU. However, I did try increasing the LRU size to no avail.https://github.com/jackc/pgconn/blob/53f5fed36c570f0b5c98d6ec2415658c7b9bd11c/stmtcache/lru.go#L118
The issue is not about the
DEALLOCATE
or cache misses but rather about the OOM that seems to come with it.To Reproduce
Run
schemachange/index/tpcc/w=1000
at 7c36a9d and it seems to happen every time.Additional data / screenshots
One observation is that we have a lot of allocations in the optimizer and the in numeric constants. These numeric constants hold on to strings which are backed by the pgwire buffer reader. That might be part of the problem and might point to why we don't see the data in the heap profile. However, there's still the question of why we're holding on to all of these. Maybe it's related to the statements metrics.
Another point of confusion is the extent to which the schema changes are actually relevant here. Is it the straw that breaks the camels back?
Jira issue: CRDB-9542
The text was updated successfully, but these errors were encountered: