-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL] Bitmap Scan crash on fastdebug gcc11 #23943
Labels
2024.1.3_blocker
2024.2.0_blocker
area/ysql
Yugabyte SQL (YSQL)
kind/enhancement
This is an enhancement of an existing feature
kind/failing-test
Tests and testing infra
priority/medium
Medium priority issue
Comments
timothy-e
added
area/ysql
Yugabyte SQL (YSQL)
status/awaiting-triage
Issue awaiting triage
labels
Sep 16, 2024
yugabyte-ci
added
kind/failing-test
Tests and testing infra
priority/medium
Medium priority issue
kind/enhancement
This is an enhancement of an existing feature
2024.1.3_blocker
2024.2.0_blocker
and removed
status/awaiting-triage
Issue awaiting triage
labels
Sep 16, 2024
timothy-e
added a commit
that referenced
this issue
Sep 17, 2024
Summary: Intermittently, in fastdebug compiled with gcc11 (no repro on other build types, or on fastdebug with clang), bitmap scan queries with a Bitmap And crash. The crash is caused by a segmentation fault while removing an item from the bitmap set: ```lang=c++ // for each elem in a, if it's not also in b, delete a's copy auto iterb = b->begin(); for (auto itera = a->begin(); itera != a->end();) { if ((iterb = b->find(*itera)) == b->end()) { FreeSlice(*itera); itera = a->erase(itera); // <----- crashes here inside the erase call. } else { ++itera; } } ``` Moving the `FreeSlice` line to after removing the item from the list allows it to complete successfully, with the same behaviour as before. My guess is that gcc11 tries to invoke the destructor of the slice when removing it from the set, and since the slice is freed but not set to NULL, it points to some garbage value that occasionally results in a crash. Test Plan: ```lang=sj ./yb_build.sh fastdebug --java-test 'org.yb.pgsql.TestPgRegressYbBitmapScans' --gcc11 ``` Or, manually: 1. Build: ``` ./yb_build.sh fastdebug --gcc11 ``` 2. Restart cluster 3. Run ```lang=sql /*+ Set(yb_enable_bitmapscan true) Set(yb_enable_base_scans_cost_model true) BitmapScan(t) */ EXPLAIN (ANALYZE, SUMMARY OFF, COSTS OFF) SELECT * FROM test_and t WHERE a < 5 AND b < 5; \watch .1 ``` Before this change, it would crash usually on the first or second invocation. With this change, it runs successfully for at least a few minutes. Reviewers: amartsinchyk Reviewed By: amartsinchyk Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D38080
jasonyb
pushed a commit
that referenced
this issue
Sep 19, 2024
Summary: 84f3fab [PLAT-15322] Make sure build files have fresh last_modified date to make sure Play Framework assets caching works as expected c7af74d [PLAT-15288] Use set_dbs endpoint when editing table selection for db scoped DR configs 8faeca6 [PLAT-15300] Update task progress poller logic e4f5943 [doc][ybm] VictoriaMetrics (#23819) 70aa7d7 [doc][ybm] Tablet peer alert (#23942) 2f70696 [doc] Smart driver clarification (#23933) 1525ced [docs] fix for a yb version not rendering (#23944) e9f3ec2 [#23843] YSQL: Fix flaky test testSchemaMismatchRetry in TestPgBatch 89b69cf [#23943]: YSQL: Fix Bitmap Scan crash in fastdebug GCC11 27446e2 [DOC-470] Include SSL Connectivity within the source database tabs. (#23878) def0fac [#23879] docdb: Improve rpc metrics test. 87a936a [PLAT-15353] Consistency checks testing hooks 0c41023 [#23956] YSQL: Fix org.yb.pgsql.TestYsqlMetrics#testMetricRows 388e045 [#21625,#21627] Docdb: Clear stale meta-cache entries at the end of clone Excluded: 5523770 [#23547] YSQL: fix pg_hint_plan crash with pg_hint_plan.enable_hint_table enabled 5951e18 [#23881] docdb: Update the hint to advisory_locks 10b5009 Minorfixes (#23986) 3d33b3e [PLAt-15133][PLAT-15332] Fix the preflight check for disk mount 9d8366b [PLAT-15345] Set 755 permissions on node-agent service file d298d44 [#22135] YSQL: Avoid read restart errors with ANALYZE 240e8f0 [PLAT-15355] Fix Node Addition Precheck logic to work correctly in case provider_id is missing 2a40433 [PLAT-15262]Add more checks for non-namespace scope supported universes f957dda [Docs] Minor fixes to docs pages around transactions (#23926) e4a8548 [PLAT-15326] Proper error handling in DDL atomicity check 10a629e [PLAT-13998][PLAT-15215]Support Image bundle creation and updation in provider requests 5f95ff9 [#23905] DocDB: Persistence for Master side Table/Object locks d4103e8 [#23513] YSQL: Fix broken org.yb.pgsql.TestYsqlMetrics#testExplainMaxMemory unit test Merge: - yb_pg_dbms_alert_session_A.out: - "advisory locks are not yet implemented": YB master 5951e18 changes the hint message for test queries added by YB pg15. Update the hint message. - yb_pg_dbms_alert_session_B.out: (same) - yb_pg_dbms_alert_session_C.out: (same) - yb_pg_dbms_pipe_session_A.out: (same) - yb_pg_dbms_pipe_session_B.out: (same) Test Plan: Jenkins: rebase: pg15-cherrypicks Reviewers: jason, tfoucher, qhu Reviewed By: qhu Subscribers: qhu Differential Revision: https://phorge.dev.yugabyte.com/D38163
timothy-e
added a commit
that referenced
this issue
Sep 24, 2024
Summary: 89b69cf / D38080 attempted to fix the crash in GCC11, but after an initial successful run on Jenkins I made a small change and tested locally, omitting the `--gcc11` flag that was required to reproduce the crash. Since I had already seen a successful Jenkins run and believed the local results were still good, I didn't look closely at the Jenkins results and missed that it had started failing again on gcc11 fastdebug. The previous hypothesis was that `delete[]`ing the memory caused the `Slice`'s data to point to a garbage value, which could potentially cause errors while erasing the slice from the unordered_set. This proved not to be true, as calling `Slice::Clear` after deleting the memory did not resolve the problem. The problem is actually mentioned by the [[ https://en.cppreference.com/w/cpp/container/unordered_set | C++ standard ]] > Container elements may not be modified (even by non const iterators) since modification could change an element's hash and corrupt the container. Rearranging the instructions so that the modification of the Slice occurs after it is removed from the set solves the problem. This is also the same as the first draft of the previous attempt of a fix, but now the reason is well understood. Jira: DB-12839 Test Plan: Jenkins, Manually: ``` Build: ./yb_build.sh fastdebug --gcc11 Restart cluster Run /*+ Set(yb_enable_bitmapscan true) Set(yb_enable_base_scans_cost_model true) BitmapScan(t) */ EXPLAIN (ANALYZE, SUMMARY OFF, COSTS OFF) SELECT * FROM test_and t WHERE a < 5 AND b < 5; \watch .1 ``` Reviewers: amartsinchyk Reviewed By: amartsinchyk Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D38318
timothy-e
added a commit
that referenced
this issue
Sep 24, 2024
Summary: Original commit: 76a5c97 / D38318 == Original Summary == 89b69cf / D38080 attempted to fix the crash in GCC11, but after an initial successful run on Jenkins I made a small change and tested locally, omitting the `--gcc11` flag that was required to reproduce the crash. Since I had already seen a successful Jenkins run and believed the local results were still good, I didn't look closely at the Jenkins results and missed that it had started failing again on gcc11 fastdebug. The previous hypothesis was that `delete[]`ing the memory caused the `Slice`'s data to point to a garbage value, which could potentially cause errors while erasing the slice from the unordered_set. This proved not to be true, as calling `Slice::Clear` after deleting the memory did not resolve the problem. The problem is actually mentioned by the [[ https://en.cppreference.com/w/cpp/container/unordered_set | C++ standard ]] > Container elements may not be modified (even by non const iterators) since modification could change an element's hash and corrupt the container. Rearranging the instructions so that the modification of the Slice occurs after it is removed from the set solves the problem. This is also the same as the first draft of the previous attempt of a fix, but now the reason is well understood. Jira: DB-12839 == Differences with the original diff == The original diff (D38318) removed the line `slice.Clear();` from the `FreeSlice` function. That line was added by the previous attempt (D38080) to address the crash. * master and 2024.2 had the `slice.Clear();` line, because 2024.2 was cut after D38080 landed. * 2024.1 did not have the `slice.Clear()` line because D38080 was not backported to 2024.1. This diff does not remove the `slice.Clear()` line from `FreeSlice`, because it is not present in this branch. Test Plan: Jenkins, Manually: ``` Build: ./yb_build.sh fastdebug --gcc11 Restart cluster Run /*+ Set(yb_enable_bitmapscan true) Set(yb_enable_base_scans_cost_model true) BitmapScan(t) */ EXPLAIN (ANALYZE, SUMMARY OFF, COSTS OFF) SELECT * FROM test_and t WHERE a < 5 AND b < 5; \watch .1 ``` Reviewers: amartsinchyk, smishra Reviewed By: amartsinchyk, smishra Subscribers: smishra, yql Differential Revision: https://phorge.dev.yugabyte.com/D38350
timothy-e
pushed a commit
that referenced
this issue
Sep 24, 2024
Summary: 8916c1d [#21467,#21783, #23667] Docdb: Add sequences support for clone part 1 b4a0b45 [PLAT-14944] Enable db scoped replication if runtime config is enabled b062c44 Update link for crd file (#24083) d3f0da5 [PLAT-14898] Record reason for failing to enable node agent b135dfc update images (#24104) 846e35f [PLAT-14597] Fetch xCluster replication/DR configs with and without extra table info 1bc9e06 [#24091] xClusterDDLRepl: Pass automatic_ddl_mode to pollers afc424d [#23824] YSQL: Fix crash for when a RowComparisonExpression is used on a reordered primary key index 76a5c97 [#23943] YSQL: Fix Bitmap Scan GCC11 crash (follow-up) fd23b15 [#24040] YSQL: Simplify the PatchStatus function 47a4723 [PLAT-15441] Pin golang package version in build.sh to prevent incompatible versions to be installed 1edfa4c [PLAT-12224][PLAT-15238] Add metric for connection pooling Test Plan: Jenkins: rebase: pg15-cherrypicks Reviewers: jason, tfoucher Subscribers: telgersma Differential Revision: https://phorge.dev.yugabyte.com/D38365
timothy-e
added a commit
that referenced
this issue
Sep 24, 2024
Summary: Original commit: 76a5c97 / D38318 89b69cf / D38080 attempted to fix the crash in GCC11, but after an initial successful run on Jenkins I made a small change and tested locally, omitting the `--gcc11` flag that was required to reproduce the crash. Since I had already seen a successful Jenkins run and believed the local results were still good, I didn't look closely at the Jenkins results and missed that it had started failing again on gcc11 fastdebug. The previous hypothesis was that `delete[]`ing the memory caused the `Slice`'s data to point to a garbage value, which could potentially cause errors while erasing the slice from the unordered_set. This proved not to be true, as calling `Slice::Clear` after deleting the memory did not resolve the problem. The problem is actually mentioned by the [[ https://en.cppreference.com/w/cpp/container/unordered_set | C++ standard ]] > Container elements may not be modified (even by non const iterators) since modification could change an element's hash and corrupt the container. Rearranging the instructions so that the modification of the Slice occurs after it is removed from the set solves the problem. This is also the same as the first draft of the previous attempt of a fix, but now the reason is well understood. Jira: DB-12839 Test Plan: Jenkins, Manually: ``` Build: ./yb_build.sh fastdebug --gcc11 Restart cluster Run /*+ Set(yb_enable_bitmapscan true) Set(yb_enable_base_scans_cost_model true) BitmapScan(t) */ EXPLAIN (ANALYZE, SUMMARY OFF, COSTS OFF) SELECT * FROM test_and t WHERE a < 5 AND b < 5; \watch .1 ``` Reviewers: amartsinchyk Reviewed By: amartsinchyk Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D38351
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2024.1.3_blocker
2024.2.0_blocker
area/ysql
Yugabyte SQL (YSQL)
kind/enhancement
This is an enhancement of an existing feature
kind/failing-test
Tests and testing infra
priority/medium
Medium priority issue
Jira Link: DB-12839
Description
Issue Type
kind/failing-test
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: