Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Bitmap Scan crash on fastdebug gcc11 #23943

Closed
1 task done
timothy-e opened this issue Sep 16, 2024 · 0 comments
Closed
1 task done

[YSQL] Bitmap Scan crash on fastdebug gcc11 #23943

timothy-e opened this issue Sep 16, 2024 · 0 comments
Assignees
Labels
2024.1.3_blocker 2024.2.0_blocker area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature kind/failing-test Tests and testing infra priority/medium Medium priority issue

Comments

@timothy-e
Copy link
Contributor

timothy-e commented Sep 16, 2024

Jira Link: DB-12839

Description

./yb_build.sh fastdebug --java-test 'org.yb.pgsql.TestPgRegressYbBitmapScans' --gcc11

Issue Type

kind/failing-test

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@timothy-e timothy-e added area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Sep 16, 2024
@timothy-e timothy-e self-assigned this Sep 16, 2024
@yugabyte-ci yugabyte-ci added kind/failing-test Tests and testing infra priority/medium Medium priority issue kind/enhancement This is an enhancement of an existing feature 2024.1.3_blocker 2024.2.0_blocker and removed status/awaiting-triage Issue awaiting triage labels Sep 16, 2024
timothy-e added a commit that referenced this issue Sep 17, 2024
Summary:
Intermittently, in fastdebug compiled with gcc11 (no repro on other build types, or on fastdebug with clang), bitmap scan queries with a Bitmap And crash.

The crash is caused by a segmentation fault while removing an item from the bitmap set:
```lang=c++
  // for each elem in a, if it's not also in b, delete a's copy
  auto iterb = b->begin();
  for (auto itera = a->begin(); itera != a->end();) {
    if ((iterb = b->find(*itera)) == b->end()) {
      FreeSlice(*itera);
      itera = a->erase(itera); // <----- crashes here inside the erase call.
    } else {
      ++itera;
    }
  }
```

Moving the `FreeSlice` line to after removing the item from the list allows it to complete successfully, with the same behaviour as before. My guess is that gcc11 tries to invoke the destructor of the slice when removing it from the set, and since the slice is freed but not set to NULL, it points to some garbage value that occasionally results in a crash.

Test Plan:
```lang=sj
./yb_build.sh fastdebug --java-test 'org.yb.pgsql.TestPgRegressYbBitmapScans' --gcc11
```

Or, manually:
1. Build:
```
./yb_build.sh fastdebug --gcc11
```

2. Restart cluster

3. Run
```lang=sql
/*+ Set(yb_enable_bitmapscan true) Set(yb_enable_base_scans_cost_model true) BitmapScan(t) */
EXPLAIN (ANALYZE, SUMMARY OFF, COSTS OFF) SELECT * FROM test_and t WHERE a < 5 AND b < 5;

\watch .1
```

Before this change, it would crash usually on the first or second invocation. With this change, it runs successfully for at least a few minutes.

Reviewers: amartsinchyk

Reviewed By: amartsinchyk

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D38080
jasonyb pushed a commit that referenced this issue Sep 19, 2024
Summary:
 84f3fab [PLAT-15322] Make sure build files have fresh last_modified date to make sure Play Framework assets caching works as expected
 c7af74d [PLAT-15288] Use set_dbs endpoint when editing table selection for db scoped DR configs
 8faeca6 [PLAT-15300] Update task progress poller logic
 e4f5943 [doc][ybm] VictoriaMetrics (#23819)
 70aa7d7 [doc][ybm] Tablet peer alert (#23942)
 2f70696 [doc] Smart driver clarification (#23933)
 1525ced [docs] fix for a yb version not rendering (#23944)
 e9f3ec2 [#23843] YSQL: Fix flaky test testSchemaMismatchRetry in TestPgBatch
 89b69cf [#23943]: YSQL: Fix Bitmap Scan crash in fastdebug GCC11
 27446e2 [DOC-470] Include SSL Connectivity within the source database tabs. (#23878)
 def0fac [#23879] docdb: Improve rpc metrics test.
 87a936a [PLAT-15353] Consistency checks testing hooks
 0c41023 [#23956] YSQL: Fix org.yb.pgsql.TestYsqlMetrics#testMetricRows
 388e045 [#21625,#21627] Docdb: Clear stale meta-cache entries at the end of clone
 Excluded: 5523770 [#23547] YSQL: fix pg_hint_plan crash with pg_hint_plan.enable_hint_table enabled
 5951e18 [#23881] docdb: Update the hint to advisory_locks
 10b5009 Minorfixes (#23986)
 3d33b3e [PLAt-15133][PLAT-15332] Fix the preflight check for disk mount
 9d8366b [PLAT-15345] Set  755 permissions on node-agent service file
 d298d44 [#22135] YSQL: Avoid read restart errors with ANALYZE
 240e8f0 [PLAT-15355] Fix Node Addition Precheck logic to work correctly in case provider_id is missing
 2a40433 [PLAT-15262]Add more checks for non-namespace scope supported universes
 f957dda [Docs] Minor fixes to docs pages around transactions (#23926)
 e4a8548 [PLAT-15326] Proper error handling in DDL atomicity check
 10a629e [PLAT-13998][PLAT-15215]Support Image bundle creation and updation in provider requests
 5f95ff9 [#23905] DocDB: Persistence for Master side Table/Object locks
 d4103e8 [#23513]  YSQL: Fix broken org.yb.pgsql.TestYsqlMetrics#testExplainMaxMemory unit test

Merge:
- yb_pg_dbms_alert_session_A.out:
  - "advisory locks are not yet implemented": YB master
    5951e18 changes the hint message
    for test queries added by YB pg15.  Update the hint message.
- yb_pg_dbms_alert_session_B.out: (same)
- yb_pg_dbms_alert_session_C.out: (same)
- yb_pg_dbms_pipe_session_A.out: (same)
- yb_pg_dbms_pipe_session_B.out: (same)

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher, qhu

Reviewed By: qhu

Subscribers: qhu

Differential Revision: https://phorge.dev.yugabyte.com/D38163
timothy-e added a commit that referenced this issue Sep 24, 2024
Summary:
89b69cf / D38080 attempted to fix the crash in GCC11, but after an initial successful run on Jenkins I made a small change and tested locally, omitting the `--gcc11` flag that was required to reproduce the crash. Since I had already seen a successful Jenkins run and believed the local results were still good, I didn't look closely at the Jenkins results and missed that it had started failing again on gcc11 fastdebug.

The previous hypothesis was that `delete[]`ing the memory caused the `Slice`'s data to point to a garbage value, which could potentially cause errors while erasing the slice from the unordered_set. This proved not to be true, as calling `Slice::Clear` after deleting the memory did not resolve the problem.

The problem is actually mentioned by the [[ https://en.cppreference.com/w/cpp/container/unordered_set | C++ standard ]]
> Container elements may not be modified (even by non const iterators) since modification could change an element's hash and corrupt the container.

Rearranging the instructions so that the modification of the Slice occurs after it is removed from the set solves the problem. This is also the same as the first draft of the previous attempt of a fix, but now the reason is well understood.
Jira: DB-12839

Test Plan:
Jenkins,

Manually:
```
Build:
./yb_build.sh fastdebug --gcc11
Restart cluster
Run
/*+ Set(yb_enable_bitmapscan true) Set(yb_enable_base_scans_cost_model true) BitmapScan(t) */
EXPLAIN (ANALYZE, SUMMARY OFF, COSTS OFF) SELECT * FROM test_and t WHERE a < 5 AND b < 5;

\watch .1
```

Reviewers: amartsinchyk

Reviewed By: amartsinchyk

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D38318
timothy-e added a commit that referenced this issue Sep 24, 2024
Summary:
Original commit: 76a5c97 / D38318
== Original Summary ==
89b69cf / D38080 attempted to fix the crash in GCC11, but after an initial successful run on Jenkins I made a small change and tested locally, omitting the `--gcc11` flag that was required to reproduce the crash. Since I had already seen a successful Jenkins run and believed the local results were still good, I didn't look closely at the Jenkins results and missed that it had started failing again on gcc11 fastdebug.

The previous hypothesis was that `delete[]`ing the memory caused the `Slice`'s data to point to a garbage value, which could potentially cause errors while erasing the slice from the unordered_set. This proved not to be true, as calling `Slice::Clear` after deleting the memory did not resolve the problem.

The problem is actually mentioned by the [[ https://en.cppreference.com/w/cpp/container/unordered_set | C++ standard ]]
> Container elements may not be modified (even by non const iterators) since modification could change an element's hash and corrupt the container.

Rearranging the instructions so that the modification of the Slice occurs after it is removed from the set solves the problem. This is also the same as the first draft of the previous attempt of a fix, but now the reason is well understood.

Jira: DB-12839

== Differences with the original diff ==
The original diff (D38318) removed the line `slice.Clear();` from the `FreeSlice` function. That line was added by the previous attempt (D38080) to address the crash.
* master and 2024.2 had the `slice.Clear();` line, because 2024.2 was cut after D38080 landed.
* 2024.1 did not have the `slice.Clear()` line because D38080 was not backported to 2024.1.

This diff does not remove the `slice.Clear()` line from `FreeSlice`, because it is not present in this branch.

Test Plan:
Jenkins,

Manually:
```
Build:
./yb_build.sh fastdebug --gcc11
Restart cluster
Run
/*+ Set(yb_enable_bitmapscan true) Set(yb_enable_base_scans_cost_model true) BitmapScan(t) */
EXPLAIN (ANALYZE, SUMMARY OFF, COSTS OFF) SELECT * FROM test_and t WHERE a < 5 AND b < 5;

\watch .1
```

Reviewers: amartsinchyk, smishra

Reviewed By: amartsinchyk, smishra

Subscribers: smishra, yql

Differential Revision: https://phorge.dev.yugabyte.com/D38350
timothy-e pushed a commit that referenced this issue Sep 24, 2024
Summary:
 8916c1d [#21467,#21783, #23667] Docdb: Add sequences support for clone part 1
 b4a0b45 [PLAT-14944] Enable db scoped replication if runtime config is enabled
 b062c44 Update link for crd file (#24083)
 d3f0da5 [PLAT-14898] Record reason for failing to enable node agent
 b135dfc update images (#24104)
 846e35f [PLAT-14597] Fetch xCluster replication/DR configs with and without extra table info
 1bc9e06 [#24091] xClusterDDLRepl: Pass automatic_ddl_mode to pollers
 afc424d [#23824] YSQL: Fix crash for when a RowComparisonExpression is used on a reordered primary key index
 76a5c97 [#23943] YSQL: Fix Bitmap Scan GCC11 crash (follow-up)
 fd23b15 [#24040] YSQL: Simplify the PatchStatus function
 47a4723 [PLAT-15441] Pin golang package version in build.sh to prevent incompatible versions to be installed
 1edfa4c [PLAT-12224][PLAT-15238] Add metric for connection pooling

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher

Subscribers: telgersma

Differential Revision: https://phorge.dev.yugabyte.com/D38365
timothy-e added a commit that referenced this issue Sep 24, 2024
Summary:
Original commit: 76a5c97 / D38318
89b69cf / D38080 attempted to fix the crash in GCC11, but after an initial successful run on Jenkins I made a small change and tested locally, omitting the `--gcc11` flag that was required to reproduce the crash. Since I had already seen a successful Jenkins run and believed the local results were still good, I didn't look closely at the Jenkins results and missed that it had started failing again on gcc11 fastdebug.

The previous hypothesis was that `delete[]`ing the memory caused the `Slice`'s data to point to a garbage value, which could potentially cause errors while erasing the slice from the unordered_set. This proved not to be true, as calling `Slice::Clear` after deleting the memory did not resolve the problem.

The problem is actually mentioned by the [[ https://en.cppreference.com/w/cpp/container/unordered_set | C++ standard ]]
> Container elements may not be modified (even by non const iterators) since modification could change an element's hash and corrupt the container.

Rearranging the instructions so that the modification of the Slice occurs after it is removed from the set solves the problem. This is also the same as the first draft of the previous attempt of a fix, but now the reason is well understood.
Jira: DB-12839

Test Plan:
Jenkins,

Manually:
```
Build:
./yb_build.sh fastdebug --gcc11
Restart cluster
Run
/*+ Set(yb_enable_bitmapscan true) Set(yb_enable_base_scans_cost_model true) BitmapScan(t) */
EXPLAIN (ANALYZE, SUMMARY OFF, COSTS OFF) SELECT * FROM test_and t WHERE a < 5 AND b < 5;

\watch .1
```

Reviewers: amartsinchyk

Reviewed By: amartsinchyk

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D38351
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024.1.3_blocker 2024.2.0_blocker area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature kind/failing-test Tests and testing infra priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

2 participants