Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workload/ycsb: take advantage of SELECT FOR UPDATE #45755

Merged

Conversation

nvanbenschoten
Copy link
Member

Performance numbers for this change and #45701 aren't quite ready to share,
but I'll post here and there when they are.

workload/ycsb: define all fields as NOT NULL

This change marks all columns in the ycsb "usertable" as NOT NULL. Doing
so allows the load generator to take advantage of #44239, which avoids
the KV lookup on the primary column family entirely when querying one of
the other column families. The query plans before and after demonstrate
this:

--- Before
root@:26257/ycsb> EXPLAIN SELECT field5 FROM usertable WHERE ycsb_key = 'key';
    tree    |    field    |               description
------------+-------------+------------------------------------------
            | distributed | false
            | vectorized  | false
  render    |             |
   └── scan |             |
            | table       | usertable@primary
            | spans       | /"key"/0-/"key"/1 /"key"/6/1-/"key"/6/2
            | parallel    |


--- After
root@:26257/ycsb> EXPLAIN SELECT field5 FROM usertable WHERE ycsb_key = 'key';
    tree    |    field    |      description
------------+-------------+------------------------
            | distributed | false
            | vectorized  | false
  render    |             |
   └── scan |             |
            | table       | usertable@primary
            | spans       | /"key"/6/1-/"key"/6/2

This becomes very important when running YCSB with a column family per
field and with implicit SELECT FOR UPDATE (see #45159). Now that (as of
#45701) UPDATE statements acquire upgrade locks during their initial row
fetch, we don't want them acquiring upgrade locks on the primary column
family of the row they are intending to update a single column in. This
re-introduces the contention between writes to different columns in the
same row that column families helped avoid (see #32704). By marking each
column as NOT NULL, we can continue to avoid this contention.

workload/ycsb: use SELECT FOR UPDATE in read-modify-write transactions

50% of workload F is read-modify-write transactions. These transactions
scan a single column from a row using a SELECT statement and then modify
that single column using an UPDATE statement. Like most workloads, the
default request distribution is zipfian, meaning that there are some rows
that are heavily contended.

This is the perfect use case for explicit SELECT FOR UPDATE. Using it for
the "read" portion of the read-modify-write transaction dramatically improves
throughput and eliminates transaction retries. This commit makes that change.

Confusingly, these are not implemented as real transactions in the official
YCSB load generator, but that seems strictly incorrect and was likely
dictated by the lack of support for transactions in NoSQL engines of the
time:
https://github.com/brianfrankcooper/YCSB/blob/cd1589ce6f5abf96e17aa8ab80c78a4348fdf29a/core/src/main/java/site/ycsb/workloads/CoreWorkload.java#L729

Also, strangely Workload F wasn't in the initial YCSB paper:
https://www2.cs.duke.edu/courses/fall13/compsci590.4/838-CloudPapers/ycsb.pdf
But it was in the initial commit of the YCSB load generator:
brianfrankcooper/YCSB@24efaff#diff-d240bdb33fb042b95010733db8ccec63

This might break compatibility with old versions of CRDB, as I think we
only started accepting SELECT FOR UPDATE in 19.2. The commit includes a
flag to avoid the functionality if necessary.

This change marks all columns in the YCSB "usertable" as NOT NULL. Doing
so allows the load generator to take advantage of cockroachdb#44239, which avoids
the KV lookup on the primary column family entirely when querying one of
the other column families. The query plans before and after demonstrate
this:

```
--- Before
root@:26257/ycsb> EXPLAIN SELECT field5 FROM usertable WHERE ycsb_key = 'key';
    tree    |    field    |               description
------------+-------------+------------------------------------------
            | distributed | false
            | vectorized  | false
  render    |             |
   └── scan |             |
            | table       | usertable@primary
            | spans       | /"key"/0-/"key"/1 /"key"/6/1-/"key"/6/2
            | parallel    |

--- After
root@:26257/ycsb> EXPLAIN SELECT field5 FROM usertable WHERE ycsb_key = 'key';
    tree    |    field    |      description
------------+-------------+------------------------
            | distributed | false
            | vectorized  | false
  render    |             |
   └── scan |             |
            | table       | usertable@primary
            | spans       | /"key"/6/1-/"key"/6/2
```

This becomes very important when running YCSB with a column family per
field and with implicit SELECT FOR UPDATE (see cockroachdb#45159). Now that (as
of cockroachdb#45701) UPDATE statements acquire upgrade locks during their initial row
fetch, we don't want them acquiring upgrade locks on the primary column
family of the row they are intending to update a single column in. This
re-introduces the contention between writes to different columns in the
same row that column families helped avoid (see cockroachdb#32704). By marking each
column as NOT NULL, we can continue to avoid this contention.
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten
Copy link
Member Author

Also, @solongordon thanks again for making this possible with #44239.

@nvanbenschoten nvanbenschoten removed the request for review from solongordon March 5, 2020 20:58
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/workload/ycsb/ycsb.go, line 124 at r2 (raw file):

		g.flags.BoolVar(&g.json, `json`, false, `Use JSONB rather than relational data`)
		g.flags.BoolVar(&g.families, `families`, true, `Place each column in its own column family`)
		g.flags.BoolVar(&g.sfu, `select-for-update`, true, `Use SELECT FOR UPDATE syntax in read-modify-write transactions`)

If you're going to set this to true do you need to update the roachtests? Would it be better to set it to false and then have the roachtests inspect the version?

Roachtests for all releases use the master workload.

50% of workload F is read-modify-write transactions. These transactions
scan a single column from a row using a SELECT statement and then modify
that single column using an UPDATE statement. Like most workloads, the
default request distribution is zipfian, meaning that there are some rows
that are heavily contended.

This is the perfect use case for explicit SELECT FOR UPDATE. Using it for
the "read" portion of the read-modify-write transaction dramatically improves
throughput and eliminates transaction retries. This commit makes that change.

Confusingly, these are not implemented as real transactions in the official
YCSB load generator, but that seems strictly incorrect and was likely
dictated by the lack of support for transactions in NoSQL engines of the
time:
  https://github.com/brianfrankcooper/YCSB/blob/cd1589ce6f5abf96e17aa8ab80c78a4348fdf29a/core/src/main/java/site/ycsb/workloads/CoreWorkload.java#L729

Also, strangely Workload F wasn't in the initial YCSB paper:
  https://www2.cs.duke.edu/courses/fall13/compsci590.4/838-CloudPapers/ycsb.pdf
But it was in the initial commit of the YCSB load generator:
  brianfrankcooper/YCSB@24efaff#diff-d240bdb33fb042b95010733db8ccec63

This might break compatibility with old versions of CRDB, as I think we
only started accepting SELECT FOR UPDATE in 19.2. The commit includes a
flag to avoid the functionality, if necessary.
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/workload/ycsb/ycsb.go, line 124 at r2 (raw file):

Previously, ajwerner wrote…

If you're going to set this to true do you need to update the roachtests? Would it be better to set it to false and then have the roachtests inspect the version?

Roachtests for all releases use the master workload.

Yeah, I was just going to see which roachtest failed, but that was lazy. Done.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that (as of
#45701) UPDATE statements acquire upgrade locks during their initial row
fetch, we don't want them acquiring upgrade locks on the primary column
family of the row they are intending to update a single column in. This
re-introduces the contention between writes to different columns

Is the implicit "SELECT FOR UPDATE" decision making know that there are multiple column families and that an update on a nullable column should not by default acquire locks? Or does it require query-level guidance from the user?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

@nvanbenschoten
Copy link
Member Author

Is the implicit "SELECT FOR UPDATE" decision making know that there are multiple column families and that an update on a nullable column should not by default acquire locks?

The implicit SELECT FOR UPDATE determination does not take into consideration column families or whether there are column families being queried that are not being updated. I think there's an argument that could be made that it should, though. I initially looked into selectively locking some of the column families being queried and that turned out to be a huge lift. See the comment that contains "So the key here" in #45701 (review) for a bit more detail about this.

Disabling implicit SFU when we're not updating all families is a more tractable approach to avoiding increased contention in cases like these. That said, this really only comes up in these rare cases where column families are used to isolate contention between writes to disjoint sets of families, in which case you do want implicit SFU! So I'm leaning towards thinking that we should just document the use of NOT NULL columns with column families - using those two together is an optimization outside of this case anyway, as it avoids the need to query multiple keys.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)

@nvanbenschoten nvanbenschoten merged commit 01f131d into cockroachdb:master Mar 9, 2020
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/ycsbSFU branch March 16, 2020 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants