-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
workload/ycsb: take advantage of SELECT FOR UPDATE #45755
workload/ycsb: take advantage of SELECT FOR UPDATE #45755
Conversation
This change marks all columns in the YCSB "usertable" as NOT NULL. Doing so allows the load generator to take advantage of cockroachdb#44239, which avoids the KV lookup on the primary column family entirely when querying one of the other column families. The query plans before and after demonstrate this: ``` --- Before root@:26257/ycsb> EXPLAIN SELECT field5 FROM usertable WHERE ycsb_key = 'key'; tree | field | description ------------+-------------+------------------------------------------ | distributed | false | vectorized | false render | | └── scan | | | table | usertable@primary | spans | /"key"/0-/"key"/1 /"key"/6/1-/"key"/6/2 | parallel | --- After root@:26257/ycsb> EXPLAIN SELECT field5 FROM usertable WHERE ycsb_key = 'key'; tree | field | description ------------+-------------+------------------------ | distributed | false | vectorized | false render | | └── scan | | | table | usertable@primary | spans | /"key"/6/1-/"key"/6/2 ``` This becomes very important when running YCSB with a column family per field and with implicit SELECT FOR UPDATE (see cockroachdb#45159). Now that (as of cockroachdb#45701) UPDATE statements acquire upgrade locks during their initial row fetch, we don't want them acquiring upgrade locks on the primary column family of the row they are intending to update a single column in. This re-introduces the contention between writes to different columns in the same row that column families helped avoid (see cockroachdb#32704). By marking each column as NOT NULL, we can continue to avoid this contention.
Also, @solongordon thanks again for making this possible with #44239. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/workload/ycsb/ycsb.go, line 124 at r2 (raw file):
g.flags.BoolVar(&g.json, `json`, false, `Use JSONB rather than relational data`) g.flags.BoolVar(&g.families, `families`, true, `Place each column in its own column family`) g.flags.BoolVar(&g.sfu, `select-for-update`, true, `Use SELECT FOR UPDATE syntax in read-modify-write transactions`)
If you're going to set this to true do you need to update the roachtests? Would it be better to set it to false and then have the roachtests inspect the version?
Roachtests for all releases use the master workload.
50% of workload F is read-modify-write transactions. These transactions scan a single column from a row using a SELECT statement and then modify that single column using an UPDATE statement. Like most workloads, the default request distribution is zipfian, meaning that there are some rows that are heavily contended. This is the perfect use case for explicit SELECT FOR UPDATE. Using it for the "read" portion of the read-modify-write transaction dramatically improves throughput and eliminates transaction retries. This commit makes that change. Confusingly, these are not implemented as real transactions in the official YCSB load generator, but that seems strictly incorrect and was likely dictated by the lack of support for transactions in NoSQL engines of the time: https://github.com/brianfrankcooper/YCSB/blob/cd1589ce6f5abf96e17aa8ab80c78a4348fdf29a/core/src/main/java/site/ycsb/workloads/CoreWorkload.java#L729 Also, strangely Workload F wasn't in the initial YCSB paper: https://www2.cs.duke.edu/courses/fall13/compsci590.4/838-CloudPapers/ycsb.pdf But it was in the initial commit of the YCSB load generator: brianfrankcooper/YCSB@24efaff#diff-d240bdb33fb042b95010733db8ccec63 This might break compatibility with old versions of CRDB, as I think we only started accepting SELECT FOR UPDATE in 19.2. The commit includes a flag to avoid the functionality, if necessary.
9d2fccb
to
db449f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/workload/ycsb/ycsb.go, line 124 at r2 (raw file):
Previously, ajwerner wrote…
If you're going to set this to true do you need to update the roachtests? Would it be better to set it to false and then have the roachtests inspect the version?
Roachtests for all releases use the master workload.
Yeah, I was just going to see which roachtest failed, but that was lazy. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that (as of
#45701) UPDATE statements acquire upgrade locks during their initial row
fetch, we don't want them acquiring upgrade locks on the primary column
family of the row they are intending to update a single column in. This
re-introduces the contention between writes to different columns
Is the implicit "SELECT FOR UPDATE" decision making know that there are multiple column families and that an update on a nullable column should not by default acquire locks? Or does it require query-level guidance from the user?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
The implicit SELECT FOR UPDATE determination does not take into consideration column families or whether there are column families being queried that are not being updated. I think there's an argument that could be made that it should, though. I initially looked into selectively locking some of the column families being queried and that turned out to be a huge lift. See the comment that contains "So the key here" in #45701 (review) for a bit more detail about this. Disabling implicit SFU when we're not updating all families is a more tractable approach to avoiding increased contention in cases like these. That said, this really only comes up in these rare cases where column families are used to isolate contention between writes to disjoint sets of families, in which case you do want implicit SFU! So I'm leaning towards thinking that we should just document the use of NOT NULL columns with column families - using those two together is an optimization outside of this case anyway, as it avoids the need to query multiple keys. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)
Performance numbers for this change and #45701 aren't quite ready to share,
but I'll post here and there when they are.
workload/ycsb: define all fields as NOT NULL
This change marks all columns in the ycsb "usertable" as NOT NULL. Doing
so allows the load generator to take advantage of #44239, which avoids
the KV lookup on the primary column family entirely when querying one of
the other column families. The query plans before and after demonstrate
this:
This becomes very important when running YCSB with a column family per
field and with implicit SELECT FOR UPDATE (see #45159). Now that (as of
#45701) UPDATE statements acquire upgrade locks during their initial row
fetch, we don't want them acquiring upgrade locks on the primary column
family of the row they are intending to update a single column in. This
re-introduces the contention between writes to different columns in the
same row that column families helped avoid (see #32704). By marking each
column as NOT NULL, we can continue to avoid this contention.
workload/ycsb: use SELECT FOR UPDATE in read-modify-write transactions
50% of workload F is read-modify-write transactions. These transactions
scan a single column from a row using a SELECT statement and then modify
that single column using an UPDATE statement. Like most workloads, the
default request distribution is zipfian, meaning that there are some rows
that are heavily contended.
This is the perfect use case for explicit SELECT FOR UPDATE. Using it for
the "read" portion of the read-modify-write transaction dramatically improves
throughput and eliminates transaction retries. This commit makes that change.
Confusingly, these are not implemented as real transactions in the official
YCSB load generator, but that seems strictly incorrect and was likely
dictated by the lack of support for transactions in NoSQL engines of the
time:
https://github.com/brianfrankcooper/YCSB/blob/cd1589ce6f5abf96e17aa8ab80c78a4348fdf29a/core/src/main/java/site/ycsb/workloads/CoreWorkload.java#L729
Also, strangely Workload F wasn't in the initial YCSB paper:
https://www2.cs.duke.edu/courses/fall13/compsci590.4/838-CloudPapers/ycsb.pdf
But it was in the initial commit of the YCSB load generator:
brianfrankcooper/YCSB@24efaff#diff-d240bdb33fb042b95010733db8ccec63
This might break compatibility with old versions of CRDB, as I think we
only started accepting SELECT FOR UPDATE in 19.2. The commit includes a
flag to avoid the functionality if necessary.