Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filebeat/input/filestream: Improve fingerprint performance #36850

Closed
wants to merge 3 commits into from

Conversation

shmsr
Copy link
Member

@shmsr shmsr commented Oct 15, 2023

PROPOSAL (Please do not actively review until the proposal is accepted)

I recently stumbled upon a great blog post about the filestream fingerprint mode by @rdner, which caught my attention. After reading it, I noticed a few areas where the performance could be improved. For instance, the blog mentions the use of SHA256, a highly secure hashing algorithm. However, since we don't require such high levels of security for data hashing in this context, it would be better to explore faster hashing algorithms that can still ensure data integrity.

In the beats repository, github.com/cespare/xxhash/v2 (xxhash) is used in multiple places which is a very fast cryptographically-insecure hash algorithm. But there's something even better we can use here i.e., xxh3.

See some benchmarks: https://xxhash.com

This is a good blog comparing the different algorithms and advising what could be a better choice.

So, I used xxh3 (https://github.com/zeebo/xxh3) package used by apache/arrow, grafana, etc. I tested with the defaults on MacOS 13.6 (Apple M1 Max) with the already existing benchmark function (put b.ResetTimer to reset the timer after the creation of files because they are not necessary to the benchmark) and here are the results (it'd be better if the benchmarks were taken in a separate machine with lesser noise because my machine always has multiple instances of chrome running and few more things):

Used benchstat for comparing old and new benchmarks (count = 6)

                           │  bench.old  │             bench.new             │
                           │   sec/op    │   sec/op     vs base              │
GetFilesWithFingerprint-10   21.22m ± 2%   20.72m ± 1%  -2.37% (p=0.009 n=6)
                           │  bench.old   │              bench.new              │
                           │     B/op     │     B/op      vs base               │
GetFilesWithFingerprint-10   1.433Mi ± 0%   1.035Mi ± 0%  -27.78% (p=0.002 n=6)
                           │  bench.old  │             bench.new             │
                           │  allocs/op  │  allocs/op   vs base              │
GetFilesWithFingerprint-10   11.09k ± 0%   10.04k ± 0%  -9.48% (p=0.002 n=6)

In the benchmark, only 1000 are considered but in a production server, the number of files could be considerably much higher. So, the changes in this PR could be very beneficial when the user has opted for fingerprint mode. xxh3 also works well for smaller inputs and I assume the fingerprint length would mostly be small.

In case we plan to use an algorithm that works even better for very small inputs, this may be beneficial to read.

Some more reading material, if interested:

Proposed commit message

Improved the overall performance of the fingerprint mode by switching the hashing algorithm from SHA256 to XXH3. Also, removed an unnecessary check (not required for XXH3 anymore) and updated the tests. Also preallocating the maps (a better number could be decided) and some minor changes to the code.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

  • Run go tests and benchmarks

@shmsr shmsr requested a review from rdner October 15, 2023 17:07
@shmsr shmsr requested a review from a team as a code owner October 15, 2023 17:07
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 15, 2023
@shmsr shmsr removed the request for review from fearful-symmetry October 15, 2023 17:07
@mergify
Copy link
Contributor

mergify bot commented Oct 15, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @shmsr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@mergify mergify bot assigned shmsr Oct 15, 2023
@shmsr shmsr requested a review from a team as a code owner October 15, 2023 17:16
@shmsr shmsr marked this pull request as draft October 15, 2023 17:27
@shmsr
Copy link
Member Author

shmsr commented Oct 15, 2023

As this is a proposal, marking this PR as a draft. Also, I think as I updated the NOTICE.txt, it automatically tagged elastic/beats-tech-leads as a reviewer.

@elasticmachine
Copy link
Collaborator

elasticmachine commented Oct 15, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-10-15T17:36:51.605+0000

  • Duration: 118 min 2 sec

Test stats 🧪

Test Results
Failed 0
Passed 28470
Skipped 2013
Total 30483

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

require.NoError(b, err)

b.ResetTimer()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resetting the timer as operations before this are not relevant to the actual benchmark i.e., creation of dummy files, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it matters only on absolute values but not on relative values when you compare results. Did you make sure that this ResetTimer was in the base benchmark too before comparing them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you make sure that this ResetTimer was in the base benchmark too before comparing them?

Yes.

Copy link
Member

@rdner rdner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to see a contribution of each separate optimisation in this PR.

Looking at your results:

                           │  bench.old  │             bench.new             │
                           │   sec/op    │   sec/op     vs base              │
GetFilesWithFingerprint-10   21.22m ± 2%   20.72m ± 1%  -2.37% (p=0.009 n=6)
                           │  bench.old   │              bench.new              │
                           │     B/op     │     B/op      vs base               │
GetFilesWithFingerprint-10   1.433Mi ± 0%   1.035Mi ± 0%  -27.78% (p=0.002 n=6)
                           │  bench.old  │             bench.new             │
                           │  allocs/op  │  allocs/op   vs base              │
GetFilesWithFingerprint-10   11.09k ± 0%   10.04k ± 0%  -9.48% (p=0.002 n=6)

Looks like the CPU gain is insignificant (-2.37%) and it's more about the memory/allocations. Is it because you pre-allocated the maps with 1024 or also because of the new hash function? Pre-allocating maps unlike slices should not contribute that much.

@@ -209,7 +209,7 @@ scanner:
Op: loginp.OpCreate,
Descriptor: loginp.FileDescriptor{
Filename: filename,
Fingerprint: "2edc986847e209b4016e141a6dc8716d3207350f416969382d431539bf292e4a",
Fingerprint: "4a5d6b09a9587a1c",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fingerprint values must remain the same, this is a breaking change and will lead to re-ingesting all files for users who use the fingerprint mode.

Copy link
Member Author

@shmsr shmsr Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. Didn't think of it as before.

Perhaps hash algorithms can be introduced as an option in newer versions?

@@ -285,7 +285,7 @@ type fileScanner struct {
paths []string
cfg fileScannerConfig
log *logp.Logger
hasher hash.Hash
hasher *xxh3.Hasher
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only optimisation that affects performance in this PR? The only way I see this merged if we have an option in the fingerprint mode which hash algorithm to use.

The reason why SHA256 was used for fingerprinting is that we ensure there are no collision, every collision will lead to a possible data loss or data duplication. Looks like xxh3 is not safe to use for this purpose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only optimisation that affects performance in this PR? The only way I see this merged if we have an option in the fingerprint mode which hash algorithm to use.

Yes. Changing the hashing algorithm contributes most to the performance again. And yes, I agree with the last part based on your comment above. Thanks for pointing that.

Copy link
Member Author

@shmsr shmsr Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why SHA256 was used for fingerprinting is that we ensure there are no collision, every collision will lead to a possible data loss or data duplication. Looks like xxh3 is not safe to use for this purpose.

I saw the collision ratio chart you linked when I was making the change. In the first table, 256-byte length is used, and 100G (100 billion) hashes, so I assume a collision of ~300 is insignificant for such a big number of samples. Also see: https://github.com/Cyan4973/xxHash/blob/44f67696b5a6f3d45363326bab965fe8c358e115/tests/collisions/main.c#L576 (For 24 billion samples, ~18 collisions are expected)

Also from: https://github.com/Cyan4973/xxHash/wiki/Collision-ratio-comparison#collision-study

For example, with 100G 64-bit hashes, the expected average amount of collisions is 312.5, but anything between about 280 and 350 collisions is statistically indistinguishable from expectation, so they are all "good", and can't be ranked solely on this information.

Comment on lines +358 to +375
var (
// mapSz is the capacity for maps below so that each map should need to
// avoid being copied/resized on the fly until a certain limit.
//
// Number is decided based on having 1024 unique files per path. Not a very
// small or a big number.
//
// TODO: Decide a better number?
mapSz = len(s.paths) * 1024

fdByName = make(map[string]loginp.FileDescriptor, mapSz)

// Used to determine if a symlink resolves in a already known target.
uniqueIDs = make(map[string]string, mapSz)

// Used to filter out duplicate matches.
uniqueFiles = make(map[string]struct{}, mapSz)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much does it contribute into the performance gain? Did you benchmark it separately?

How does it compare on 50 files and 5000 files?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much does it contribute into the performance gain? Did you benchmark it separately?

Yes, I did. A little gain; not much. But I didn't note down the results. But this change can also be avoided and can be revised in the future by introducing a better number than len(s.paths) * 1024.

See: honeycombio/refinery#749

The reason why I introduced the change is to save some cpu and allocations by avoiding resizing up to mapSz

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it compare on 50 files and 5000 files?

Didn't compare. Will try.

@@ -406,7 +417,7 @@ type ingestTarget struct {
filename string
originalFilename string
symlink bool
info os.FileInfo
info fs.FileInfo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Was just finicky with the types although it is a type alias os.FileInfo = fs.FileInfo. Can be avoided. I was reading the full code and made some changes like these (s/os.FileInfo/fs.FileInfo, then comments, etc.); those can be avoided.

@@ -496,7 +508,7 @@ func (s *fileScanner) toFileDescriptor(it *ingestTarget) (fd loginp.FileDescript
return fd, fmt.Errorf("failed to read %d bytes from %q to compute fingerprint, read only %d", written, fd.Filename, s.cfg.Fingerprint.Length)
}

fd.Fingerprint = hex.EncodeToString(s.hasher.Sum(nil))
fd.Fingerprint = fmt.Sprintf("%x", s.hasher.Sum64())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the performance difference between using fmt.Sprintf("%x") and hex.EncodeToString?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This simple benchmark tells me that fmt.Sprintf("%x") is slower but perhaps more efficient on memory:

package main

import (
	"crypto/rand"
	"encoding/hex"
	"fmt"
	"testing"
)

const size = 64

func BenchmarkMain(b *testing.B) {
	data := make([]byte, size)
	_, _ = rand.Read(data)

	var val1, val2 string

	b.Run("hex", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			val1 = hex.EncodeToString(data)
			if len(val1) == 0 {
				b.FailNow()
			}
		}
	})

	b.Run("sprintf", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			val2 = fmt.Sprintf("%x", data)
			if len(val2) == 0 {
				b.FailNow()
			}
		}
	})

	if val1 != val2 {
		b.FailNow()
	}
}
BenchmarkMain/hex-10            11652643               101.0 ns/op           256 B/op          2 allocs/op
BenchmarkMain/sprintf-10         7365199               162.8 ns/op           152 B/op          2 allocs/op

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reason for this change this make it even slightly better. So if you see I used Sum64 instead of Sum which returns different types.

hex.EncodeToString expects a slice but Sum64 returns uint64 so it was necessary to use fmt.Sprintf("%x")

Both in the end give the same result. The main difference is inside Sum and Sum64's implementation. In Sum a new buffer is allocated as nil is passed and inside [8]byte is also created along with encoding it to BigEndian but in the same machine I don't think it is required. So better choice was to use Sum64.

require.NoError(b, err)

b.ResetTimer()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it matters only on absolute values but not on relative values when you compare results. Did you make sure that this ResetTimer was in the base benchmark too before comparing them?

@pierrehilbert pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Oct 16, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 16, 2023
Copy link
Contributor

mergify bot commented Dec 18, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b improve-filestream-fingerprint upstream/improve-filestream-fingerprint
git merge upstream/main
git push upstream improve-filestream-fingerprint

1 similar comment
Copy link
Contributor

mergify bot commented Feb 5, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b improve-filestream-fingerprint upstream/improve-filestream-fingerprint
git merge upstream/main
git push upstream improve-filestream-fingerprint

Copy link
Contributor

mergify bot commented Feb 5, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @shmsr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@shmsr shmsr closed this Sep 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants