workload/ycsb: take advantage of SELECT FOR UPDATE #45755

nvanbenschoten · 2020-03-05T17:24:11Z

Performance numbers for this change and #45701 aren't quite ready to share,
but I'll post here and there when they are.

workload/ycsb: define all fields as NOT NULL

This change marks all columns in the ycsb "usertable" as NOT NULL. Doing
so allows the load generator to take advantage of #44239, which avoids
the KV lookup on the primary column family entirely when querying one of
the other column families. The query plans before and after demonstrate
this:

--- Before
root@:26257/ycsb> EXPLAIN SELECT field5 FROM usertable WHERE ycsb_key = 'key';
    tree    |    field    |               description
------------+-------------+------------------------------------------
            | distributed | false
            | vectorized  | false
  render    |             |
   └── scan |             |
            | table       | usertable@primary
            | spans       | /"key"/0-/"key"/1 /"key"/6/1-/"key"/6/2
            | parallel    |


--- After
root@:26257/ycsb> EXPLAIN SELECT field5 FROM usertable WHERE ycsb_key = 'key';
    tree    |    field    |      description
------------+-------------+------------------------
            | distributed | false
            | vectorized  | false
  render    |             |
   └── scan |             |
            | table       | usertable@primary
            | spans       | /"key"/6/1-/"key"/6/2

This becomes very important when running YCSB with a column family per
field and with implicit SELECT FOR UPDATE (see #45159). Now that (as of
#45701) UPDATE statements acquire upgrade locks during their initial row
fetch, we don't want them acquiring upgrade locks on the primary column
family of the row they are intending to update a single column in. This
re-introduces the contention between writes to different columns in the
same row that column families helped avoid (see #32704). By marking each
column as NOT NULL, we can continue to avoid this contention.

workload/ycsb: use SELECT FOR UPDATE in read-modify-write transactions

50% of workload F is read-modify-write transactions. These transactions
scan a single column from a row using a SELECT statement and then modify
that single column using an UPDATE statement. Like most workloads, the
default request distribution is zipfian, meaning that there are some rows
that are heavily contended.

This is the perfect use case for explicit SELECT FOR UPDATE. Using it for
the "read" portion of the read-modify-write transaction dramatically improves
throughput and eliminates transaction retries. This commit makes that change.

Confusingly, these are not implemented as real transactions in the official
YCSB load generator, but that seems strictly incorrect and was likely
dictated by the lack of support for transactions in NoSQL engines of the
time:
https://github.com/brianfrankcooper/YCSB/blob/cd1589ce6f5abf96e17aa8ab80c78a4348fdf29a/core/src/main/java/site/ycsb/workloads/CoreWorkload.java#L729

Also, strangely Workload F wasn't in the initial YCSB paper:
https://www2.cs.duke.edu/courses/fall13/compsci590.4/838-CloudPapers/ycsb.pdf
But it was in the initial commit of the YCSB load generator:
brianfrankcooper/YCSB@24efaff#diff-d240bdb33fb042b95010733db8ccec63

This might break compatibility with old versions of CRDB, as I think we
only started accepting SELECT FOR UPDATE in 19.2. The commit includes a
flag to avoid the functionality if necessary.

This change marks all columns in the YCSB "usertable" as NOT NULL. Doing so allows the load generator to take advantage of cockroachdb#44239, which avoids the KV lookup on the primary column family entirely when querying one of the other column families. The query plans before and after demonstrate this: ``` --- Before root@:26257/ycsb> EXPLAIN SELECT field5 FROM usertable WHERE ycsb_key = 'key'; tree | field | description ------------+-------------+------------------------------------------ | distributed | false | vectorized | false render | | └── scan | | | table | usertable@primary | spans | /"key"/0-/"key"/1 /"key"/6/1-/"key"/6/2 | parallel | --- After root@:26257/ycsb> EXPLAIN SELECT field5 FROM usertable WHERE ycsb_key = 'key'; tree | field | description ------------+-------------+------------------------ | distributed | false | vectorized | false render | | └── scan | | | table | usertable@primary | spans | /"key"/6/1-/"key"/6/2 ``` This becomes very important when running YCSB with a column family per field and with implicit SELECT FOR UPDATE (see cockroachdb#45159). Now that (as of cockroachdb#45701) UPDATE statements acquire upgrade locks during their initial row fetch, we don't want them acquiring upgrade locks on the primary column family of the row they are intending to update a single column in. This re-introduces the contention between writes to different columns in the same row that column families helped avoid (see cockroachdb#32704). By marking each column as NOT NULL, we can continue to avoid this contention.

cockroach-teamcity · 2020-03-05T17:24:20Z

This change is

nvanbenschoten · 2020-03-05T17:46:38Z

Also, @solongordon thanks again for making this possible with #44239.

ajwerner

Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/workload/ycsb/ycsb.go, line 124 at r2 (raw file):

		g.flags.BoolVar(&g.json, `json`, false, `Use JSONB rather than relational data`)
		g.flags.BoolVar(&g.families, `families`, true, `Place each column in its own column family`)
		g.flags.BoolVar(&g.sfu, `select-for-update`, true, `Use SELECT FOR UPDATE syntax in read-modify-write transactions`)

If you're going to set this to true do you need to update the roachtests? Would it be better to set it to false and then have the roachtests inspect the version?

Roachtests for all releases use the master workload.

50% of workload F is read-modify-write transactions. These transactions scan a single column from a row using a SELECT statement and then modify that single column using an UPDATE statement. Like most workloads, the default request distribution is zipfian, meaning that there are some rows that are heavily contended. This is the perfect use case for explicit SELECT FOR UPDATE. Using it for the "read" portion of the read-modify-write transaction dramatically improves throughput and eliminates transaction retries. This commit makes that change. Confusingly, these are not implemented as real transactions in the official YCSB load generator, but that seems strictly incorrect and was likely dictated by the lack of support for transactions in NoSQL engines of the time: https://github.com/brianfrankcooper/YCSB/blob/cd1589ce6f5abf96e17aa8ab80c78a4348fdf29a/core/src/main/java/site/ycsb/workloads/CoreWorkload.java#L729 Also, strangely Workload F wasn't in the initial YCSB paper: https://www2.cs.duke.edu/courses/fall13/compsci590.4/838-CloudPapers/ycsb.pdf But it was in the initial commit of the YCSB load generator: brianfrankcooper/YCSB@24efaff#diff-d240bdb33fb042b95010733db8ccec63 This might break compatibility with old versions of CRDB, as I think we only started accepting SELECT FOR UPDATE in 19.2. The commit includes a flag to avoid the functionality, if necessary.

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

pkg/workload/ycsb/ycsb.go, line 124 at r2 (raw file):

Previously, ajwerner wrote…

If you're going to set this to true do you need to update the roachtests? Would it be better to set it to false and then have the roachtests inspect the version?

Roachtests for all releases use the master workload.

Yeah, I was just going to see which roachtest failed, but that was lazy. Done.

sumeerbhola

Now that (as of
#45701) UPDATE statements acquire upgrade locks during their initial row
fetch, we don't want them acquiring upgrade locks on the primary column
family of the row they are intending to update a single column in. This
re-introduces the contention between writes to different columns

Is the implicit "SELECT FOR UPDATE" decision making know that there are multiple column families and that an update on a nullable column should not by default acquire locks? Or does it require query-level guidance from the user?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

nvanbenschoten · 2020-03-06T17:42:30Z

Is the implicit "SELECT FOR UPDATE" decision making know that there are multiple column families and that an update on a nullable column should not by default acquire locks?

The implicit SELECT FOR UPDATE determination does not take into consideration column families or whether there are column families being queried that are not being updated. I think there's an argument that could be made that it should, though. I initially looked into selectively locking some of the column families being queried and that turned out to be a huge lift. See the comment that contains "So the key here" in #45701 (review) for a bit more detail about this.

Disabling implicit SFU when we're not updating all families is a more tractable approach to avoiding increased contention in cases like these. That said, this really only comes up in these rare cases where column families are used to isolate contention between writes to disjoint sets of families, in which case you do want implicit SFU! So I'm leaning towards thinking that we should just document the use of NOT NULL columns with column families - using those two together is an optimization outside of this case anyway, as it avoids the need to query multiple keys.

ajwerner

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)

nvanbenschoten requested review from solongordon and ajwerner March 5, 2020 17:24

nvanbenschoten removed the request for review from solongordon March 5, 2020 20:58

ajwerner reviewed Mar 5, 2020

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/ycsbSFU branch from 9d2fccb to db449f3 Compare March 5, 2020 23:58

nvanbenschoten commented Mar 5, 2020

View reviewed changes

sumeerbhola reviewed Mar 6, 2020

View reviewed changes

ajwerner approved these changes Mar 8, 2020

View reviewed changes

nvanbenschoten merged commit 01f131d into cockroachdb:master Mar 9, 2020

nvanbenschoten deleted the nvanbenschoten/ycsbSFU branch March 16, 2020 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workload/ycsb: take advantage of SELECT FOR UPDATE #45755

workload/ycsb: take advantage of SELECT FOR UPDATE #45755

nvanbenschoten commented Mar 5, 2020

cockroach-teamcity commented Mar 5, 2020

nvanbenschoten commented Mar 5, 2020

ajwerner left a comment

nvanbenschoten left a comment

sumeerbhola left a comment

nvanbenschoten commented Mar 6, 2020

ajwerner left a comment

workload/ycsb: take advantage of SELECT FOR UPDATE #45755

workload/ycsb: take advantage of SELECT FOR UPDATE #45755

Conversation

nvanbenschoten commented Mar 5, 2020

workload/ycsb: define all fields as NOT NULL

workload/ycsb: use SELECT FOR UPDATE in read-modify-write transactions

cockroach-teamcity commented Mar 5, 2020

nvanbenschoten commented Mar 5, 2020

ajwerner left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Mar 6, 2020

ajwerner left a comment

Choose a reason for hiding this comment