-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filebeat/input/filestream: Improve fingerprint performance #36850
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
As this is a proposal, marking this PR as a draft. Also, I think as I updated the NOTICE.txt, it automatically tagged |
require.NoError(b, err) | ||
|
||
b.ResetTimer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resetting the timer as operations before this are not relevant to the actual benchmark i.e., creation of dummy files, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it matters only on absolute values but not on relative values when you compare results. Did you make sure that this ResetTimer
was in the base benchmark too before comparing them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you make sure that this ResetTimer was in the base benchmark too before comparing them?
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to see a contribution of each separate optimisation in this PR.
Looking at your results:
│ bench.old │ bench.new │
│ sec/op │ sec/op vs base │
GetFilesWithFingerprint-10 21.22m ± 2% 20.72m ± 1% -2.37% (p=0.009 n=6)
│ bench.old │ bench.new │
│ B/op │ B/op vs base │
GetFilesWithFingerprint-10 1.433Mi ± 0% 1.035Mi ± 0% -27.78% (p=0.002 n=6)
│ bench.old │ bench.new │
│ allocs/op │ allocs/op vs base │
GetFilesWithFingerprint-10 11.09k ± 0% 10.04k ± 0% -9.48% (p=0.002 n=6)
Looks like the CPU gain is insignificant (-2.37%) and it's more about the memory/allocations. Is it because you pre-allocated the maps with 1024 or also because of the new hash function? Pre-allocating maps unlike slices should not contribute that much.
@@ -209,7 +209,7 @@ scanner: | |||
Op: loginp.OpCreate, | |||
Descriptor: loginp.FileDescriptor{ | |||
Filename: filename, | |||
Fingerprint: "2edc986847e209b4016e141a6dc8716d3207350f416969382d431539bf292e4a", | |||
Fingerprint: "4a5d6b09a9587a1c", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fingerprint values must remain the same, this is a breaking change and will lead to re-ingesting all files for users who use the fingerprint mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. Didn't think of it as before.
Perhaps hash algorithms can be introduced as an option in newer versions?
@@ -285,7 +285,7 @@ type fileScanner struct { | |||
paths []string | |||
cfg fileScannerConfig | |||
log *logp.Logger | |||
hasher hash.Hash | |||
hasher *xxh3.Hasher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only optimisation that affects performance in this PR? The only way I see this merged if we have an option in the fingerprint mode which hash algorithm to use.
The reason why SHA256 was used for fingerprinting is that we ensure there are no collision, every collision will lead to a possible data loss or data duplication. Looks like xxh3 is not safe to use for this purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only optimisation that affects performance in this PR? The only way I see this merged if we have an option in the fingerprint mode which hash algorithm to use.
Yes. Changing the hashing algorithm contributes most to the performance again. And yes, I agree with the last part based on your comment above. Thanks for pointing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why SHA256 was used for fingerprinting is that we ensure there are no collision, every collision will lead to a possible data loss or data duplication. Looks like xxh3 is not safe to use for this purpose.
I saw the collision ratio chart you linked when I was making the change. In the first table, 256-byte length is used, and 100G (100 billion) hashes, so I assume a collision of ~300 is insignificant for such a big number of samples. Also see: https://github.com/Cyan4973/xxHash/blob/44f67696b5a6f3d45363326bab965fe8c358e115/tests/collisions/main.c#L576 (For 24 billion samples, ~18 collisions are expected)
Also from: https://github.com/Cyan4973/xxHash/wiki/Collision-ratio-comparison#collision-study
For example, with 100G 64-bit hashes, the expected average amount of collisions is 312.5, but anything between about 280 and 350 collisions is statistically indistinguishable from expectation, so they are all "good", and can't be ranked solely on this information.
var ( | ||
// mapSz is the capacity for maps below so that each map should need to | ||
// avoid being copied/resized on the fly until a certain limit. | ||
// | ||
// Number is decided based on having 1024 unique files per path. Not a very | ||
// small or a big number. | ||
// | ||
// TODO: Decide a better number? | ||
mapSz = len(s.paths) * 1024 | ||
|
||
fdByName = make(map[string]loginp.FileDescriptor, mapSz) | ||
|
||
// Used to determine if a symlink resolves in a already known target. | ||
uniqueIDs = make(map[string]string, mapSz) | ||
|
||
// Used to filter out duplicate matches. | ||
uniqueFiles = make(map[string]struct{}, mapSz) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much does it contribute into the performance gain? Did you benchmark it separately?
How does it compare on 50 files and 5000 files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much does it contribute into the performance gain? Did you benchmark it separately?
Yes, I did. A little gain; not much. But I didn't note down the results. But this change can also be avoided and can be revised in the future by introducing a better number than len(s.paths) * 1024
.
The reason why I introduced the change is to save some cpu and allocations by avoiding resizing up to mapSz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it compare on 50 files and 5000 files?
Didn't compare. Will try.
@@ -406,7 +417,7 @@ type ingestTarget struct { | |||
filename string | |||
originalFilename string | |||
symlink bool | |||
info os.FileInfo | |||
info fs.FileInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this, it's a type alias anyway. https://cs.opensource.google/go/go/+/refs/tags/go1.21.3:src/os/types.go;l=21
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Was just finicky with the types although it is a type alias os.FileInfo = fs.FileInfo
. Can be avoided. I was reading the full code and made some changes like these (s/os.FileInfo/fs.FileInfo, then comments, etc.); those can be avoided.
@@ -496,7 +508,7 @@ func (s *fileScanner) toFileDescriptor(it *ingestTarget) (fd loginp.FileDescript | |||
return fd, fmt.Errorf("failed to read %d bytes from %q to compute fingerprint, read only %d", written, fd.Filename, s.cfg.Fingerprint.Length) | |||
} | |||
|
|||
fd.Fingerprint = hex.EncodeToString(s.hasher.Sum(nil)) | |||
fd.Fingerprint = fmt.Sprintf("%x", s.hasher.Sum64()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the performance difference between using fmt.Sprintf("%x")
and hex.EncodeToString
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This simple benchmark tells me that fmt.Sprintf("%x")
is slower but perhaps more efficient on memory:
package main
import (
"crypto/rand"
"encoding/hex"
"fmt"
"testing"
)
const size = 64
func BenchmarkMain(b *testing.B) {
data := make([]byte, size)
_, _ = rand.Read(data)
var val1, val2 string
b.Run("hex", func(b *testing.B) {
for i := 0; i < b.N; i++ {
val1 = hex.EncodeToString(data)
if len(val1) == 0 {
b.FailNow()
}
}
})
b.Run("sprintf", func(b *testing.B) {
for i := 0; i < b.N; i++ {
val2 = fmt.Sprintf("%x", data)
if len(val2) == 0 {
b.FailNow()
}
}
})
if val1 != val2 {
b.FailNow()
}
}
BenchmarkMain/hex-10 11652643 101.0 ns/op 256 B/op 2 allocs/op
BenchmarkMain/sprintf-10 7365199 162.8 ns/op 152 B/op 2 allocs/op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the reason for this change this make it even slightly better. So if you see I used Sum64 instead of Sum which returns different types.
hex.EncodeToString
expects a slice but Sum64
returns uint64
so it was necessary to use fmt.Sprintf("%x")
Both in the end give the same result. The main difference is inside Sum and Sum64's implementation. In Sum a new buffer is allocated as nil is passed and inside [8]byte is also created along with encoding it to BigEndian but in the same machine I don't think it is required. So better choice was to use Sum64.
require.NoError(b, err) | ||
|
||
b.ResetTimer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it matters only on absolute values but not on relative values when you compare results. Did you make sure that this ResetTimer
was in the base benchmark too before comparing them?
This pull request is now in conflicts. Could you fix it? 🙏
|
1 similar comment
This pull request is now in conflicts. Could you fix it? 🙏
|
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
PROPOSAL (Please do not actively review until the proposal is accepted)
I recently stumbled upon a great blog post about the filestream fingerprint mode by @rdner, which caught my attention. After reading it, I noticed a few areas where the performance could be improved. For instance, the blog mentions the use of SHA256, a highly secure hashing algorithm. However, since we don't require such high levels of security for data hashing in this context, it would be better to explore faster hashing algorithms that can still ensure data integrity.
In the beats repository, github.com/cespare/xxhash/v2 (xxhash) is used in multiple places which is a very fast cryptographically-insecure hash algorithm. But there's something even better we can use here i.e., xxh3.
See some benchmarks: https://xxhash.com
This is a good blog comparing the different algorithms and advising what could be a better choice.
So, I used xxh3 (https://github.com/zeebo/xxh3) package used by apache/arrow, grafana, etc. I tested with the defaults on MacOS 13.6 (Apple M1 Max) with the already existing benchmark function (put
b.ResetTimer
to reset the timer after the creation of files because they are not necessary to the benchmark) and here are the results (it'd be better if the benchmarks were taken in a separate machine with lesser noise because my machine always has multiple instances of chrome running and few more things):Used benchstat for comparing old and new benchmarks (count = 6)
In the benchmark, only 1000 are considered but in a production server, the number of files could be considerably much higher. So, the changes in this PR could be very beneficial when the user has opted for fingerprint mode. xxh3 also works well for smaller inputs and I assume the fingerprint length would mostly be small.
In case we plan to use an algorithm that works even better for very small inputs, this may be beneficial to read.
Some more reading material, if interested:
Proposed commit message
Improved the overall performance of the fingerprint mode by switching the hashing algorithm from SHA256 to XXH3. Also, removed an unnecessary check (not required for XXH3 anymore) and updated the tests. Also preallocating the maps (a better number could be decided) and some minor changes to the code.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally