filebeat/input/filestream: Improve fingerprint performance #36850

shmsr · 2023-10-15T17:07:10Z

PROPOSAL (Please do not actively review until the proposal is accepted)

I recently stumbled upon a great blog post about the filestream fingerprint mode by @rdner, which caught my attention. After reading it, I noticed a few areas where the performance could be improved. For instance, the blog mentions the use of SHA256, a highly secure hashing algorithm. However, since we don't require such high levels of security for data hashing in this context, it would be better to explore faster hashing algorithms that can still ensure data integrity.

In the beats repository, github.com/cespare/xxhash/v2 (xxhash) is used in multiple places which is a very fast cryptographically-insecure hash algorithm. But there's something even better we can use here i.e., xxh3.

See some benchmarks: https://xxhash.com

This is a good blog comparing the different algorithms and advising what could be a better choice.

So, I used xxh3 (https://github.com/zeebo/xxh3) package used by apache/arrow, grafana, etc. I tested with the defaults on MacOS 13.6 (Apple M1 Max) with the already existing benchmark function (put b.ResetTimer to reset the timer after the creation of files because they are not necessary to the benchmark) and here are the results (it'd be better if the benchmarks were taken in a separate machine with lesser noise because my machine always has multiple instances of chrome running and few more things):

Used benchstat for comparing old and new benchmarks (count = 6)

                           │  bench.old  │             bench.new             │
                           │   sec/op    │   sec/op     vs base              │
GetFilesWithFingerprint-10   21.22m ± 2%   20.72m ± 1%  -2.37% (p=0.009 n=6)

                           │  bench.old   │              bench.new              │
                           │     B/op     │     B/op      vs base               │
GetFilesWithFingerprint-10   1.433Mi ± 0%   1.035Mi ± 0%  -27.78% (p=0.002 n=6)

                           │  bench.old  │             bench.new             │
                           │  allocs/op  │  allocs/op   vs base              │
GetFilesWithFingerprint-10   11.09k ± 0%   10.04k ± 0%  -9.48% (p=0.002 n=6)

In the benchmark, only 1000 are considered but in a production server, the number of files could be considerably much higher. So, the changes in this PR could be very beneficial when the user has opted for fingerprint mode. xxh3 also works well for smaller inputs and I assume the fingerprint length would mostly be small.

In case we plan to use an algorithm that works even better for very small inputs, this may be beneficial to read.

Some more reading material, if interested:

https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html (xxh3)
Refinery spends unnecessary 5% cpu resizing maps honeycombio/refinery#749 (related to capacity of map)
https://www.reddit.com/r/zfs/comments/wb50b7/will_zfs_support_the_hash_algorithm_xxhash_in_the/
https://btrfs.readthedocs.io/en/latest/Checksumming.html#checksumming (btrfs' mention of xxhash as an option for checksum)
https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html#:~:text=Checksumming%20long%20inputs (xxhash popular among movie makers for file transfer verification)

Proposed commit message

Improved the overall performance of the fingerprint mode by switching the hashing algorithm from SHA256 to XXH3. Also, removed an unnecessary check (not required for XXH3 anymore) and updated the tests. Also preallocating the maps (a better number could be decided) and some minor changes to the code.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Run go tests and benchmarks

mergify · 2023-10-15T17:07:45Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @shmsr? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

shmsr · 2023-10-15T17:29:20Z

As this is a proposal, marking this PR as a draft. Also, I think as I updated the NOTICE.txt, it automatically tagged elastic/beats-tech-leads as a reviewer.

elasticmachine · 2023-10-15T17:31:26Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-10-15T17:36:51.605+0000
Duration: 118 min 2 sec

Test stats 🧪

Test	Results
Failed	0
Passed	28470
Skipped	2013
Total	30483

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

shmsr · 2023-10-15T17:46:48Z

filebeat/input/filestream/fswatch_test.go

 	require.NoError(b, err)

+	b.ResetTimer()


Resetting the timer as operations before this are not relevant to the actual benchmark i.e., creation of dummy files, etc.

it matters only on absolute values but not on relative values when you compare results. Did you make sure that this ResetTimer was in the base benchmark too before comparing them?

Did you make sure that this ResetTimer was in the base benchmark too before comparing them?

Yes.

rdner

We need to see a contribution of each separate optimisation in this PR.

Looking at your results:

                           │  bench.old  │             bench.new             │
                           │   sec/op    │   sec/op     vs base              │
GetFilesWithFingerprint-10   21.22m ± 2%   20.72m ± 1%  -2.37% (p=0.009 n=6)

                           │  bench.old   │              bench.new              │
                           │     B/op     │     B/op      vs base               │
GetFilesWithFingerprint-10   1.433Mi ± 0%   1.035Mi ± 0%  -27.78% (p=0.002 n=6)

                           │  bench.old  │             bench.new             │
                           │  allocs/op  │  allocs/op   vs base              │
GetFilesWithFingerprint-10   11.09k ± 0%   10.04k ± 0%  -9.48% (p=0.002 n=6)

Looks like the CPU gain is insignificant (-2.37%) and it's more about the memory/allocations. Is it because you pre-allocated the maps with 1024 or also because of the new hash function? Pre-allocating maps unlike slices should not contribute that much.

rdner · 2023-10-15T19:38:26Z

filebeat/input/filestream/fswatch_test.go

@@ -209,7 +209,7 @@ scanner:
 			Op:      loginp.OpCreate,
 			Descriptor: loginp.FileDescriptor{
 				Filename:    filename,
-				Fingerprint: "2edc986847e209b4016e141a6dc8716d3207350f416969382d431539bf292e4a",
+				Fingerprint: "4a5d6b09a9587a1c",


Fingerprint values must remain the same, this is a breaking change and will lead to re-ingesting all files for users who use the fingerprint mode.

That's right. Didn't think of it as before.

Perhaps hash algorithms can be introduced as an option in newer versions?

rdner · 2023-10-15T19:39:02Z

filebeat/input/filestream/fswatch.go

@@ -285,7 +285,7 @@ type fileScanner struct {
 	paths      []string
 	cfg        fileScannerConfig
 	log        *logp.Logger
-	hasher     hash.Hash
+	hasher     *xxh3.Hasher


Is this the only optimisation that affects performance in this PR? The only way I see this merged if we have an option in the fingerprint mode which hash algorithm to use.

The reason why SHA256 was used for fingerprinting is that we ensure there are no collision, every collision will lead to a possible data loss or data duplication. Looks like xxh3 is not safe to use for this purpose.

Is this the only optimisation that affects performance in this PR? The only way I see this merged if we have an option in the fingerprint mode which hash algorithm to use.

Yes. Changing the hashing algorithm contributes most to the performance again. And yes, I agree with the last part based on your comment above. Thanks for pointing that.

The reason why SHA256 was used for fingerprinting is that we ensure there are no collision, every collision will lead to a possible data loss or data duplication. Looks like xxh3 is not safe to use for this purpose.

I saw the collision ratio chart you linked when I was making the change. In the first table, 256-byte length is used, and 100G (100 billion) hashes, so I assume a collision of ~300 is insignificant for such a big number of samples. Also see: https://github.com/Cyan4973/xxHash/blob/44f67696b5a6f3d45363326bab965fe8c358e115/tests/collisions/main.c#L576 (For 24 billion samples, ~18 collisions are expected)

Also from: https://github.com/Cyan4973/xxHash/wiki/Collision-ratio-comparison#collision-study

For example, with 100G 64-bit hashes, the expected average amount of collisions is 312.5, but anything between about 280 and 350 collisions is statistically indistinguishable from expectation, so they are all "good", and can't be ranked solely on this information.

rdner · 2023-10-15T19:41:51Z

filebeat/input/filestream/fswatch.go

+	var (
+		// mapSz is the capacity for maps below so that each map should need to
+		// avoid being copied/resized on the fly until a certain limit.
+		//
+		// Number is decided based on having 1024 unique files per path. Not a very
+		// small or a big number.
+		//
+		// TODO: Decide a better number?
+		mapSz = len(s.paths) * 1024
+
+		fdByName = make(map[string]loginp.FileDescriptor, mapSz)
+
+		// Used to determine if a symlink resolves in a already known target.
+		uniqueIDs = make(map[string]string, mapSz)
+
+		// Used to filter out duplicate matches.
+		uniqueFiles = make(map[string]struct{}, mapSz)
+	)


How much does it contribute into the performance gain? Did you benchmark it separately?

How does it compare on 50 files and 5000 files?

How much does it contribute into the performance gain? Did you benchmark it separately?

Yes, I did. A little gain; not much. But I didn't note down the results. But this change can also be avoided and can be revised in the future by introducing a better number than len(s.paths) * 1024.

See: honeycombio/refinery#749

The reason why I introduced the change is to save some cpu and allocations by avoiding resizing up to mapSz

How does it compare on 50 files and 5000 files?

Didn't compare. Will try.

rdner · 2023-10-15T19:43:40Z

filebeat/input/filestream/fswatch.go

@@ -406,7 +417,7 @@ type ingestTarget struct {
 	filename         string
 	originalFilename string
 	symlink          bool
-	info             os.FileInfo
+	info             fs.FileInfo


Why change this, it's a type alias anyway. https://cs.opensource.google/go/go/+/refs/tags/go1.21.3:src/os/types.go;l=21

Yep. Was just finicky with the types although it is a type alias os.FileInfo = fs.FileInfo. Can be avoided. I was reading the full code and made some changes like these (s/os.FileInfo/fs.FileInfo, then comments, etc.); those can be avoided.

rdner · 2023-10-15T19:44:21Z

filebeat/input/filestream/fswatch.go

@@ -496,7 +508,7 @@ func (s *fileScanner) toFileDescriptor(it *ingestTarget) (fd loginp.FileDescript
 			return fd, fmt.Errorf("failed to read %d bytes from %q to compute fingerprint, read only %d", written, fd.Filename, s.cfg.Fingerprint.Length)
 		}

-		fd.Fingerprint = hex.EncodeToString(s.hasher.Sum(nil))
+		fd.Fingerprint = fmt.Sprintf("%x", s.hasher.Sum64())


What's the performance difference between using fmt.Sprintf("%x") and hex.EncodeToString?

This simple benchmark tells me that fmt.Sprintf("%x") is slower but perhaps more efficient on memory:

package main import ( "crypto/rand" "encoding/hex" "fmt" "testing" ) const size = 64 func BenchmarkMain(b *testing.B) { data := make([]byte, size) _, _ = rand.Read(data) var val1, val2 string b.Run("hex", func(b *testing.B) { for i := 0; i < b.N; i++ { val1 = hex.EncodeToString(data) if len(val1) == 0 { b.FailNow() } } }) b.Run("sprintf", func(b *testing.B) { for i := 0; i < b.N; i++ { val2 = fmt.Sprintf("%x", data) if len(val2) == 0 { b.FailNow() } } }) if val1 != val2 { b.FailNow() } }

BenchmarkMain/hex-10 11652643 101.0 ns/op 256 B/op 2 allocs/op BenchmarkMain/sprintf-10 7365199 162.8 ns/op 152 B/op 2 allocs/op

So the reason for this change this make it even slightly better. So if you see I used Sum64 instead of Sum which returns different types.

hex.EncodeToString expects a slice but Sum64 returns uint64 so it was necessary to use fmt.Sprintf("%x")

Both in the end give the same result. The main difference is inside Sum and Sum64's implementation. In Sum a new buffer is allocated as nil is passed and inside [8]byte is also created along with encoding it to BigEndian but in the same machine I don't think it is required. So better choice was to use Sum64.

rdner · 2023-10-15T19:45:10Z

filebeat/input/filestream/fswatch_test.go

 	require.NoError(b, err)

+	b.ResetTimer()


it matters only on absolute values but not on relative values when you compare results. Did you make sure that this ResetTimer was in the base benchmark too before comparing them?

mergify · 2023-12-18T14:26:51Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b improve-filestream-fingerprint upstream/improve-filestream-fingerprint
git merge upstream/main
git push upstream improve-filestream-fingerprint

mergify · 2024-02-05T14:28:55Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b improve-filestream-fingerprint upstream/improve-filestream-fingerprint
git merge upstream/main
git push upstream improve-filestream-fingerprint

mergify · 2024-02-05T14:28:56Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @shmsr? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

filebeat/input/filestream: Improve fingerprint performance

35eb9b1

shmsr requested a review from rdner October 15, 2023 17:07

shmsr requested a review from a team as a code owner October 15, 2023 17:07

shmsr requested a review from fearful-symmetry October 15, 2023 17:07

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 15, 2023

shmsr removed the request for review from fearful-symmetry October 15, 2023 17:07

mergify bot assigned shmsr Oct 15, 2023

Update go.mod and NOTICE

9f3e1bc

shmsr requested a review from a team as a code owner October 15, 2023 17:16

shmsr marked this pull request as draft October 15, 2023 17:27

Use goimports -local github.com/elastic to make ci happy

bf751b8

shmsr commented Oct 15, 2023

View reviewed changes

rdner reviewed Oct 15, 2023

View reviewed changes

pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Oct 16, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 16, 2023

shmsr closed this Sep 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filebeat/input/filestream: Improve fingerprint performance #36850

filebeat/input/filestream: Improve fingerprint performance #36850

shmsr commented Oct 15, 2023 •

edited

Loading

mergify bot commented Oct 15, 2023

shmsr commented Oct 15, 2023

elasticmachine commented Oct 15, 2023 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

shmsr Oct 15, 2023

rdner Oct 15, 2023

shmsr Oct 16, 2023

rdner left a comment

rdner Oct 15, 2023

shmsr Oct 16, 2023 •

edited

Loading

rdner Oct 15, 2023

shmsr Oct 16, 2023

shmsr Oct 16, 2023 •

edited

Loading

rdner Oct 15, 2023

shmsr Oct 16, 2023

shmsr Oct 16, 2023

rdner Oct 15, 2023

shmsr Oct 16, 2023

rdner Oct 15, 2023

rdner Oct 15, 2023

shmsr Oct 16, 2023

rdner Oct 15, 2023

mergify bot commented Dec 18, 2023

mergify bot commented Feb 5, 2024

mergify bot commented Feb 5, 2024

filebeat/input/filestream: Improve fingerprint performance #36850

filebeat/input/filestream: Improve fingerprint performance #36850

Conversation

shmsr commented Oct 15, 2023 • edited Loading

Proposed commit message

Checklist

How to test this PR locally

mergify bot commented Oct 15, 2023

shmsr commented Oct 15, 2023

elasticmachine commented Oct 15, 2023 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shmsr Oct 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shmsr Oct 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Dec 18, 2023

mergify bot commented Feb 5, 2024

mergify bot commented Feb 5, 2024

shmsr commented Oct 15, 2023 •

edited

Loading

elasticmachine commented Oct 15, 2023 •

edited by jenkins-beats-ci bot

Loading

shmsr Oct 16, 2023 •

edited

Loading

shmsr Oct 16, 2023 •

edited

Loading