-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: vectorized data encoding package #96321
Conversation
a939285
to
a75e7b0
Compare
This PR is just for the last commit. |
4e8cf32
to
787f997
Compare
eb1dcda
to
2962e00
Compare
Hey @jordanlewis not looking for thorough review here, but thought it would be appropriate for you to give this a once over as someone with deep knowledge on encoding logic, if definitely turned out to be a little more complicated than I thought going in! Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I'm assuming since this isn't hooked up to the copy yet, the perf numbers aren't included, right?
No, that's the next patch.
pkg/col/coldata/nulls.go
line 188 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: should we just call
n.NullAt(i)
here? I'd assume that call would be inlined.
👍
pkg/sql/colenc/encode.go
line 90 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: this truncation should be called out in a comment to the method.
I decided it would be better to just make this an error.
pkg/sql/colenc/encode.go
line 118 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
This only checks that a non-nullable column is present in
colMap
, but where do we check that each value in that column is non-NULL? Is it done later? Perhaps add a quick comment if so.
Done.
pkg/sql/colenc/encode.go
line 148 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
This comment suggests that we should be reusing the slices above, but we do make a fresh allocation on each
initBuffers
call. Is this expected? My guess is that we only need to allocate some of these ifb.rows
became larger on thisPrepareBatch
call.
👍
pkg/sql/colenc/encode.go
line 180 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
getMaxKeyCols
andgetMaxSuffixColumns
seem to return the same values throughout the encoding process (unless there is a schema change). Should we compute it once inMakeEncoder
?
I think now that I reuse that space across PrepareBatch calls its fine the way it is.
pkg/sql/colenc/encode.go
line 219 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
MaybeHasNulls
, in theory, might returntrue
if there are actually no NULLs in the vector. Not checking explicitly each relevant element can result in false positives. I'd guess it wouldn't be a noticeable slowdown to iterate over the relevant elements viaNullAt
check.
Done.
pkg/sql/colenc/encode.go
line 435 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Should we do
b.checkMemory()
here?
We do it right away when we return from this function.
pkg/sql/colenc/encode.go
line 648 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: rather than allocating new
DFloat
andDDecimal
object on eachisComposite
call we could reuse the same ones. Perhaps this should be a method on some helper.
There's no heap allocations here:
~/vec-encoding [copy_demo|✚1…12] ❯❯❯ go test ./pkg/sql/colenc -gcflags="-m" 2>&1 | grep pkg/sql/colenc/encode.go:7
pkg/sql/colenc/encode.go:705:23: inlining call to tree.(*DFloat).IsComposite
pkg/sql/colenc/encode.go:705:23: inlining call to math.Float64bits
and
697ec01
to
89334a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)
pkg/sql/row/putter.go
line 47 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Should we do a conversion like
key.(roachpb.Key)
and use%s
format directive to match what we have inrow/inserter.go
?
So what would we do if it wasn't a key? I'm inclined to leave it unless there's a reason to change it.
pkg/sql/row/putter.go
line 81 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: not sure if it actually matters at all, but should we just inline one
roachpb.Value
directly intoTracePutter
and then reuse that value across all calls?
Its stack allocated so I think that wouldn't help and might hurt.
pkg/sql/row/putter.go
line 186 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Should this be
InitPut
?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)
pkg/sql/colenc/encode_test.go
line 150 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: consider adding a few tests for INT2 and INT4 types.
The Rand schema tests covered those cases well.
pkg/sql/colenc/encode_test.go
line 181 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
The
EncodedKeyFamily
was introduced in #77501 to explicitly treat the index key of the inverted indexes as special (so that we don't attempt to decode it as bytes). It's just a light wrapper around[]byte
, so I think you could just generate a randomDBytes
and then wrap its contents inDEncodedKey
.
Okay so do I have it right that we never write these things to a table are they exist purely when we read an inverted index value via IndexFetchSpec for execution time filter/join/aggregation purposes? And I should encode/decode them just like the row engine does for completeness even though its effectively dead code? I'm fine with that, just wanted to confirm.
89334a4
to
e93b5cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)
pkg/sql/colenc/encode_test.go
line 181 at r7 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
Okay so do I have it right that we never write these things to a table are they exist purely when we read an inverted index value via IndexFetchSpec for execution time filter/join/aggregation purposes? And I should encode/decode them just like the row engine does for completeness even though its effectively dead code? I'm fine with that, just wanted to confirm.
If I use a BYTES SQL type and pass a DEncodedKey in my test harness blows up because its expecting a DBytes. Should I just not test it, and should I remove support for it in encodeKeys?
339b16a
to
2af96a8
Compare
Wow! Thanks so much for the thorough review @yuzefovich, feeling much better about this code now and test coverage is up to %88. Hopefully I responded adequately to all your feedback, let me know if I missed anything! |
2af96a8
to
4569657
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)
pkg/sql/colenc/key.go
line 87 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I think it might be worth extracting the type switch outside of the
for
loop. This will result in the duplication of the loop three times but we'd avoid repetitive interface conversions and type switches on each row.
I'm actually leaning towards reverting this whole "push the for loop down" approach, its so ugly and we aren't CPU bound (here).
pkg/sql/colenc/key.go
line 202 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
It'd be good to mention how
nulls
are mutated.nit: missing period.
Done.
pkg/sql/colenc/value.go
line 32 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
We do this interface conversion for each value in the column that we need to encode. Have you considered using
coldata.TypedVecs
to do the conversion only once per column?
Not really, the experience pushing the loop below the interface calls for keys left a bad taste in my mouth. ;-)
pkg/sql/colenc/value.go
line 54 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: I don't think this matters much but
types.EncodedKeyFamily
should also go in thiscase
.
Like I decided for keys that would just be dead code I think.
pkg/sql/colenc/encode_test.go
line 180 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I think the support of JSON in
decodeTableKeyToCol
is dead code - currently it should never execute.
Done.
pkg/sql/colenc/encode_test.go
line 601 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Is it ok that we always sort here and in
buildVecKVs
? Consider leaving a comment about that.
Yeah we have to sort them both or the comparison algorithm doesn't work. I'll comment.
4569657
to
9743e6a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did another quick pass, and I think it looks good to me. I'll do another more careful pass once other commits merge.
Reviewed 5 of 10 files at r9, 14 of 21 files at r10, 2 of 4 files at r13, 4 of 4 files at r14, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @jordanlewis)
pkg/sql/colenc/encode.go
line 648 at r7 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
There's no heap allocations here:
~/vec-encoding [copy_demo|✚1…12] ❯❯❯ go test ./pkg/sql/colenc -gcflags="-m" 2>&1 | grep pkg/sql/colenc/encode.go:7 pkg/sql/colenc/encode.go:705:23: inlining call to tree.(*DFloat).IsComposite pkg/sql/colenc/encode.go:705:23: inlining call to math.Float64bits
and
Ack, thanks for checking.
pkg/sql/colenc/encode.go
line 54 at r14 (raw file):
// Map of index id to a slice of bools that contain partial index predicates. partialIndexes map[descpb.IndexID][]bool // Column ID that might contains composite encoded values.
nit: s/might contains/might contain/
.
pkg/sql/colenc/encode.go
line 94 at r14 (raw file):
// PrepareBatch encodes a subset of rows from the batch to the given row.Putter. // The maximum batch size is 64k (determined by NewMemBatchNoCols).
Hm, I actually forgot about this limit. It was introduced in #35349, and I don't think there is any fundamental reason for it other than being conservative and preventing engineers from shooting themselves in the foot. The only fundamental limitation AFAICT is that int
of the selection vector must be able to address every row in the batch, so assuming 64 bit architecture, we can go pretty high. So if you think we need to have higher batch sizes for copy, we can do that.
An additional improvement to coldata.NewMemBatchNoCols
that we can make is not allocating the selection vector at all. Copy's execution pipeline won't have any filters, so we don't need those selection vectors. Perhaps leave it as a TODO.
pkg/sql/colenc/encode.go
line 156 at r14 (raw file):
} // Amount of space to reserve per
nit: incomplete comment.
pkg/sql/colenc/key.go
line 87 at r7 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
I'm actually leaning towards reverting this whole "push the for loop down" approach, its so ugly and we aren't CPU bound (here).
Makes sense. Perhaps we're not CPU bound here in the Copy use case, but if / when we reuse this encoder logic elsewhere, that might not be the case, so I'd be in favor of keeping the current ugly-but-performant code. Up to you though if you want to pull the for
loop out.
pkg/sql/colenc/key.go
line 198 at r14 (raw file):
// encodeIndexKey is the vector version of rowenc.EncodeIndexKey. nulls will tell // the caller if theres null values in any of the columns and which rows as
nit: s/theres/there are/
.
pkg/sql/row/putter.go
line 47 at r7 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
So what would we do if it wasn't a key? I'm inclined to leave it unless there's a reason to change it.
I was thinking that %s
with roachpb.Key.String
would give us pretty representation whereas %v
might not. When is this TracePutter
used? Do we have any tests for it? Perhaps %v
on interface{}
already gives us the pretty representation.
pkg/sql/colenc/encode_test.go
line 181 at r7 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
If I use a BYTES SQL type and pass a DEncodedKey in my test harness blows up because its expecting a DBytes. Should I just not test it, and should I remove support for it in encodeKeys?
Removing the code and any corresponding tests sounds good to me. It should be impossible for outside users to produce a datum of EncodedKeyFamily
type, so support for it is not needed for Copy. If in the future we start using the vectorized encoder logic for internally-generated datums, then we might bring it back, but for now removing the support seems reasonable.
9743e6a
to
9f04246
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)
pkg/sql/colenc/encode.go
line 94 at r14 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Hm, I actually forgot about this limit. It was introduced in #35349, and I don't think there is any fundamental reason for it other than being conservative and preventing engineers from shooting themselves in the foot. The only fundamental limitation AFAICT is that
int
of the selection vector must be able to address every row in the batch, so assuming 64 bit architecture, we can go pretty high. So if you think we need to have higher batch sizes for copy, we can do that.An additional improvement to
coldata.NewMemBatchNoCols
that we can make is not allocating the selection vector at all. Copy's execution pipeline won't have any filters, so we don't need those selection vectors. Perhaps leave it as a TODO.
I'm not seeing any need to go beyond 64k at this time but good to know! I'll add a TODO where the coldata.Batch is allocated.
pkg/sql/colenc/key.go
line 87 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Makes sense. Perhaps we're not CPU bound here in the Copy use case, but if / when we reuse this encoder logic elsewhere, that might not be the case, so I'd be in favor of keeping the current ugly-but-performant code. Up to you though if you want to pull the
for
loop out.
I'm tempted to leave it as is and entertain the notion of code generated these switch statements if Go doesn't grow a usable macro system soon.
pkg/sql/row/putter.go
line 47 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I was thinking that
%s
withroachpb.Key.String
would give us pretty representation whereas%v
might not. When is thisTracePutter
used? Do we have any tests for it? Perhaps%v
oninterface{}
already gives us the pretty representation.
Yes I just put you on the review of the new copy tests that require TracePutter to work right.
pkg/sql/colenc/encode_test.go
line 181 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Removing the code and any corresponding tests sounds good to me. It should be impossible for outside users to produce a datum of
EncodedKeyFamily
type, so support for it is not needed for Copy. If in the future we start using the vectorized encoder logic for internally-generated datums, then we might bring it back, but for now removing the support seems reasonable.
👍
ede7eab
to
7795610
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some more nits and I think I found a minor bug, but Nice work!
Reviewed 20 of 20 files at r15, 17 of 17 files at r16, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @jordanlewis)
pkg/sql/colenc/encode.go
line 62 at r16 (raw file):
// column family. keys []roachpb.Key // Slice of value we can reuse across each call to Prepare and between each
nit: s/value/values/
.
pkg/sql/colenc/encode.go
line 73 at r16 (raw file):
// count is just end - start. start, end, count int mutationQuotaCheck func() bool
nit: perhaps it'll be more clear if this function returned an error (with a boolean you need to know whether true
indicates "ok" or "quota exceeded").
pkg/sql/colenc/encode.go
line 173 at r16 (raw file):
extraBuf := make([]byte, b.count*b.extraKeysBufSize) // Values are copied so we can re-use.
nit: I'm confused by this comment - what is being reused? We seem to be reallocating both b.values
and valBuf
. I guess the reuse is in comparison to what is reset in resetBuffers
. Perhaps this comment should be moved into resetBuffers
or just removed altogether?
pkg/sql/colenc/encode.go
line 188 at r16 (raw file):
} } else { // If we are doing less row shrink slices.
nit: s/less row/less rows/
.
pkg/sql/colenc/encode.go
line 209 at r16 (raw file):
} func intMax(a, b int) int {
Doesn't this function currently return minimum?
pkg/sql/colenc/inverted.go
line 62 at r16 (raw file):
var keys [][]byte val := invertedColToDatum(vec, row+b.start) indexGeoConfig := index.GetGeoConfig()
nit: this probably can be extracted outside of the for
loop.
pkg/sql/colenc/key.go
line 180 at r16 (raw file):
kys[r] = b } default:
nit: perhaps add an assertion here (behind buildutil.CrdbTestBuild
flag) that the "canonical type family" of typ
is typeconv.DatumVecCanonicalTypeFamily
.
pkg/sql/row/putter.go
line 77 at r15 (raw file):
func (t *TracePutter) CPutTuplesEmpty(kys []roachpb.Key, values [][]byte) { for i, k := range kys {
nit: this loop is effectively repeated in five places, consider extracting it into a helper.
pkg/sql/row/putter.go
line 83 at r15 (raw file):
var v roachpb.Value v.SetTuple(values[i]) log.VEventfDepth(t.Ctx, 1, 2, "CPut %s -> %v", k, v.PrettyPrint())
nit: PrettyPrint
returns a string
, so %s
seems more natural (here and four other places below) and consistent with CPutValuesEmpty
.
pkg/sql/row/putter.go
line 261 at r15 (raw file):
} // KVBatchAdapter implements putter interface and adapts it to kv.Batch API.
nit: s/putter/Putter/
.
In order to avoid code duplication make parts the row encoding code public, mostly this is the RowHelper. Epic: CRDB-18892 Informs: cockroachdb#91831 Release note: None
colenc is a new package that allows kv.Batch's to be produced to encode tables using coldata.Batch's as the input. Every attempt was made to avoid code duplication and delegate to row encoding code where possible. Epic: CRDB-18892 Informs: cockroachdb#91831 Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @jordanlewis, and @yuzefovich)
pkg/sql/colenc/encode.go
line 73 at r16 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: perhaps it'll be more clear if this function returned an error (with a boolean you need to know whether
true
indicates "ok" or "quota exceeded").
Agreed
pkg/sql/colenc/encode.go
line 173 at r16 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: I'm confused by this comment - what is being reused? We seem to be reallocating both
b.values
andvalBuf
. I guess the reuse is in comparison to what is reset inresetBuffers
. Perhaps this comment should be moved intoresetBuffers
or just removed altogether?
Between pk and each secondary index values isn't reallocated and across batches of rows if the num rows doesn't increase values is reused. Agreed this comment isn't helping.
pkg/sql/colenc/encode.go
line 209 at r16 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Doesn't this function currently return minimum?
Wow, nice catch, that just eliminated %30 of the memory allocations from the BenchmarkTCPHLineItem:
BenchmarkTCPHLineItem-10 1340 889978 ns/op 209379 B/op 9310 allocs/op
vs.
BenchmarkTCPHLineItem-10 1390 856018 ns/op 267776 B/op 6982 allocs/op
pkg/sql/colenc/inverted.go
line 62 at r16 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: this probably can be extracted outside of the
for
loop.
👍
pkg/sql/colenc/key.go
line 180 at r16 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: perhaps add an assertion here (behind
buildutil.CrdbTestBuild
flag) that the "canonical type family" oftyp
istypeconv.DatumVecCanonicalTypeFamily
.
👍
pkg/sql/row/putter.go
line 77 at r15 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: this loop is effectively repeated in five places, consider extracting it into a helper.
There's 3 "tuples" loops, 2 "bytes" loops and 1 values loop. I think I'll just leave it.
pkg/sql/row/putter.go
line 83 at r15 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
PrettyPrint
returns astring
, so%s
seems more natural (here and four other places below) and consistent withCPutValuesEmpty
.
👍
bors r=yuzefovich |
bors retry |
Build succeeded: |
colenc is a new package that allows kv.Batch's to be produced to encode
tables using coldata.Batch's as the input. Every attempt was made to
avoid code duplication and delegate to row encoding code where possible.
Epic: CRDB-18892
Informs: #91831
Release note: None