Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve startup time #1408

Closed
sharkdp opened this issue Oct 23, 2023 · 1 comment · Fixed by #1422
Closed

Improve startup time #1408

sharkdp opened this issue Oct 23, 2023 · 1 comment · Fixed by #1422

Comments

@sharkdp
Copy link
Owner

sharkdp commented Oct 23, 2023

fds startup time is quite slow. On my 12 core system, it takes ~ 20 ms for "searching" an empty folder. This is fast enough not to be noticeable by humans, but it looks bad in benchmarks when comparing fd with other tools on small folders 1. And it's also an actual problem for use cases where fd is called repeatedly from a script.

Some of that overhead is caused by the spawning of threads, and that problem is already tracked in #1203. But I think there is more that can be done. Instead of using my usual go-to performance tool (perf), let's look at the magic-trace output of a fd call in an empty folder 2. If someone is interested, I've attached the trace to this post. Go to https://magic-trace.org/ to load it in their viewer.

The full trace looks like this:

image

The first 2.2 ms are typical process startup things (before main). I don't think there is any room for optimization here (?)

image

The next ~2 ms are more interesting:

image

some notable steps (even if insignificant in time)

  • parsing command-line arguments (724 µs)
  • a "is_existing_directory" check on the search path (28 µs)
  • the parsing of the (empty) search pattern regex (32 µs)
  • isatty check (5 ·s)
  • LsColors::from_env (579 µs)
  • num_cpus::linux::get_num_cpus` (441 µs)
  • RegexBuilder::build (171 µs)

Some things were surprising to me. I didn't expect the get_num_cpus call to take this long. There might be some room for improvement here by doing things in parallel (e.g. LsColors::from_env)? But only if the thread overhead is not too high.

Then we start the actual scan, which takes the majority of the time:

image

Here, I'm not so sure how to interpret the trace, as things are actually happening on multiple threads. But we can (presumably) see some of the thread spawning/joining time here (~ 5 ms):

image

and some gitignore matcher logic going on here (370 µs total):

image

Most of the time is actually unaccounted for in the trace, because I can only see:

image

We can see a bit more when switching off LTO:

image

Apparently, 11 ms are spent in crossbeam_channel::channel::bounded's from_iter method? (probably the receive call?) — even though we don't have any work to do. On a -j1 run, this part only takes 1 ms.

Footnotes

  1. Those "small" folders can be pretty large, actually. It takes hundreds of thousands of files before we can make up for the startup "penalty".

  2. I recently discovered this and used it successfully to benchmark (and then optimize) the startup time of other programs.

tavianator added a commit to tavianator/fd that referenced this issue Oct 30, 2023
We originally switched to bounded channels for backpressure to fix sharkdp#918.
However, bounded channels have a significant initialization overhead as
they pre-allocate a fixed-size buffer for the messages.

This implementation uses a different backpressure strategy: each thread
gets a limited-size pool of WorkerResults.  When the size limit is hit,
the sender thread has to wait for the receiver thread to handle a result
from that pool and recycle it.

Inspired by [snmalloc], results are recycled by sending the boxed result
over a channel back to the thread that allocated it.  By allocating and
freeing each WorkerResult from the same thread, allocator contention is
reduced dramatically.  And since we now pass results by pointer instead
of by value, message passing overhead is reduced as well.

Fixes sharkdp#1408.

[snmalloc]: https://github.com/microsoft/snmalloc
tavianator added a commit to tavianator/fd that referenced this issue Oct 30, 2023
We originally switched to bounded channels for backpressure to fix sharkdp#918.
However, bounded channels have a significant initialization overhead as
they pre-allocate a fixed-size buffer for the messages.

This implementation uses a different backpressure strategy: each thread
gets a limited-size pool of WorkerResults.  When the size limit is hit,
the sender thread has to wait for the receiver thread to handle a result
from that pool and recycle it.

Inspired by [snmalloc], results are recycled by sending the boxed result
over a channel back to the thread that allocated it.  By allocating and
freeing each WorkerResult from the same thread, allocator contention is
reduced dramatically.  And since we now pass results by pointer instead
of by value, message passing overhead is reduced as well.

Fixes sharkdp#1408.

[snmalloc]: https://github.com/microsoft/snmalloc
tavianator added a commit to tavianator/fd that referenced this issue Nov 1, 2023
We originally switched to bounded channels for backpressure to fix sharkdp#918.
However, bounded channels have a significant initialization overhead as
they pre-allocate a fixed-size buffer for the messages.

This implementation uses a different backpressure strategy: each thread
gets a limited-size pool of WorkerResults.  When the size limit is hit,
the sender thread has to wait for the receiver thread to handle a result
from that pool and recycle it.

Inspired by [snmalloc], results are recycled by sending the boxed result
over a channel back to the thread that allocated it.  By allocating and
freeing each WorkerResult from the same thread, allocator contention is
reduced dramatically.  And since we now pass results by pointer instead
of by value, message passing overhead is reduced as well.

Fixes sharkdp#1408.

[snmalloc]: https://github.com/microsoft/snmalloc
tavianator added a commit to tavianator/fd that referenced this issue Nov 2, 2023
We originally switched to bounded channels for backpressure to fix sharkdp#918.
However, bounded channels have a significant initialization overhead as
they pre-allocate a fixed-size buffer for the messages.

This implementation uses a different backpressure strategy: each thread
gets a limited-size pool of WorkerResults.  When the size limit is hit,
the sender thread has to wait for the receiver thread to handle a result
from that pool and recycle it.

Inspired by [snmalloc], results are recycled by sending the boxed result
over a channel back to the thread that allocated it.  By allocating and
freeing each WorkerResult from the same thread, allocator contention is
reduced dramatically.  And since we now pass results by pointer instead
of by value, message passing overhead is reduced as well.

Fixes sharkdp#1408.

[snmalloc]: https://github.com/microsoft/snmalloc
tavianator added a commit to tavianator/fd that referenced this issue Nov 2, 2023
We originally switched to bounded channels for backpressure to fix sharkdp#918.
However, bounded channels have a significant initialization overhead as
they pre-allocate a fixed-size buffer for the messages.

This implementation uses a different backpressure strategy: each thread
gets a limited-size pool of WorkerResults.  When the size limit is hit,
the sender thread has to wait for the receiver thread to handle a result
from that pool and recycle it.

Inspired by [snmalloc], results are recycled by sending the boxed result
over a channel back to the thread that allocated it.  By allocating and
freeing each WorkerResult from the same thread, allocator contention is
reduced dramatically.  And since we now pass results by pointer instead
of by value, message passing overhead is reduced as well.

Fixes sharkdp#1408.

[snmalloc]: https://github.com/microsoft/snmalloc
tavianator added a commit to tavianator/fd that referenced this issue Nov 2, 2023
We originally switched to bounded channels for backpressure to fix sharkdp#918.
However, bounded channels have a significant initialization overhead as
they pre-allocate a fixed-size buffer for the messages.

This implementation uses a different backpressure strategy: each thread
gets a limited-size pool of WorkerResults.  When the size limit is hit,
the sender thread has to wait for the receiver thread to handle a result
from that pool and recycle it.

Inspired by [snmalloc], results are recycled by sending the boxed result
over a channel back to the thread that allocated it.  By allocating and
freeing each WorkerResult from the same thread, allocator contention is
reduced dramatically.  And since we now pass results by pointer instead
of by value, message passing overhead is reduced as well.

Fixes sharkdp#1408.

[snmalloc]: https://github.com/microsoft/snmalloc
@tmccombs
Copy link
Collaborator

tmccombs commented Nov 4, 2023

11 ms are spent in crossbeam_channel::channel::bounded's from_iter method?

That might be where crossbeam_channel initializes the memory for the channel
here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants