Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<format>: Compile-time string size estimation #2437

Closed

Conversation

AdamBucior
Copy link
Contributor

Improves string size estimation in format by using _Basic_format_string's parsing machinery. It takes into account things like argument use count and the specified width and precision. Also allows formatted_size to completely skip the formatting part for simple cases like concatenation of a bunch of strings.

This change might be ABI breaking so hopefully it can make it before <format>'s ABI is freezed.

@AdamBucior AdamBucior requested a review from a team as a code owner December 19, 2021 19:26
stl/inc/format Show resolved Hide resolved
@CaseyCarter CaseyCarter added the performance Must go faster label Dec 29, 2021
@StephanTLavavej StephanTLavavej added the format C++20/23 format label Jan 12, 2022
Copy link
Member

@barcharcraz barcharcraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So aside from possible bugs with user defined formatters that don't call base::parse() this looks functionally correct. However, it's quite a lot of code. We are willing to have gnarly code in service of runtime performance, but benchmarks would be nice. STL points out that benchmarks would also indicate how much effort we should spend on this, for example doing exact estimations for usages of precision with numeric arguments.

I'm curious how much faster this is than just doing the clang fallback path 100% of the time and dropping the exact estimation, it introduces some more capacity checks from string::insert but it still could save allocations, which is probably >>> more expensive than the capacity checks. It might even be faster than no estimation to use formatted_size sometimes and get an exact width estimation to reduce allocations (even though that means doing the formatting work twice).

Also tests for user defined formatters in general would be good. I think it's reasonable to just not support user defined formatters at all, using a mechanism like _From_primary (used by allocator) to detect if derived classes were ours's or the user's.

@@ -2761,27 +2794,89 @@ consteval typename _ParseContext::iterator _Compile_time_parse_format_specs(_Par
using _FormattedType = conditional_t<is_same_v<_FormattedTypeMapping, typename basic_format_arg<_Context>::handle>,
_Ty, _FormattedTypeMapping>;
formatter<_FormattedType, _CharT> _Formatter{};
return _Formatter.parse(_Pc);
auto _Iter = _Formatter.parse(_Pc);
if constexpr (_Derived_from_formatter_base<formatter<_FormattedType, _CharT>>) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User defined formatters that derive from built-in ones don't nessassarly actually populate their format specs, they are not obligated to call their base parse method (though, this is somewhat of an odd formatter)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user defined formatter never calls parse then _Specs will be in default constructed state, which is what this function returns for formatters that do not derive from _Formatter_base anyway. If the user derives from standard formatter and calls parse then _Specs can help in estimation. This is useful for users' enum formatters which often derive from basic_string_view formatter to have width and precision handled automatically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if they call parse but don't actually output the implied number of characters in format?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After realizing my mistake with the flow through _On_format_specs I think things are OK even if they lie

stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated
_Is_estimation_exact = false;
} else if (_Specs._Dynamic_precision_index >= 0) {
// if precision is dynamic we can't really predict so let's estimate it to 32
_Estimated_size += 32;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems open to tweaking against benchmarks. It could be reasonable to calculate the width in this case too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think so, but see #1803 (comment).

stl/inc/format Outdated
} else {
// if the length of the string is known we will add it to estimation
++_Arg_use_count[_Id];
if (_Specs._Precision >= 0 || _Specs._Width > 0 || _Specs._Dynamic_precision_index >= 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can lift this up to apply to all strings right?, simplifying the branches above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to always strlen for null-terminated strings then yes.

stl/inc/format Outdated
constexpr const _CharT* _On_format_specs(const size_t _Id, const _CharT* _First, const _CharT*) {
_Parse_context.advance_to(_Parse_context.begin() + (_First - _Parse_context.begin()._Unwrapped()));
if (_Id < _Num_args) {
auto _Iter = _Parse_funcs[_Id](_Parse_context); // TRANSITION, VSO-1451773 (workaround: named variable)
auto [_Iter, _Specs] = _Parse_funcs[_Id](_Parse_context);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if a user-defined formatter does not derive from _Formatter_base, or does but never calls _Formatter_base::parse then _Specs is default-initialized and width is zero.

Thus the code below will lie about the estimation being exact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-standard string types _Is_estimation_exact is always false. Users can't specialize formatter for standard string types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think you're right and I misread the code, the fact that the dynamic indices are initialized to -1 makes the code set things to -1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, I'm an idiot and the values don't actually matter, as it's never a built-in string type.

stl/inc/format Show resolved Hide resolved
return _Arg_use_count[_Id] * _Arg.size();
} else if constexpr (_Is_nullterminated_string<_ArgTy>) {
// don't bother with calculating the length if we don't need to
return _Arg_use_count[_Id] > 0 ? _Arg_use_count[_Id] * char_traits<_CharT>::length(_Arg) : 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be slower than just doing the formatting and counting. But maybe it's still worth it.

// don't bother with calculating the length if we don't need to
return _Arg_use_count[_Id] > 0 ? _Arg_use_count[_Id] * char_traits<_CharT>::length(_Arg) : 0;
} else {
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we estimating zero for non-strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-strings estimation has been done before:

STL/stl/inc/format

Lines 2876 to 2877 in 3d74470

// for all other arguments use the largest of precision, width and 8
_Estimated_size += (_STD max)((_STD max)(_Specs._Precision, _Specs._Width), 8);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it's still somewhat confusing to split the estimation up like this, but I can see why it's done.

@@ -3074,22 +3209,74 @@ _NODISCARD wstring vformat(const locale& _Loc, const wstring_view _Fmt, const wf

template <class... _Types>
_NODISCARD string format(const _Fmt_string<_Types...> _Fmt, _Types&&... _Args) {
return _STD vformat(_Fmt._Str, _STD make_format_args(_Args...));
const size_t _Estimated_size = _Fmt._Estimate_required_capacity(_Args...);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better with a helper _Vformat that takes the estimated capacity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once that's done maybe this stuff should be done through _Fmt_iterator_buffer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't understand what you mean. How exactly should this be restructured?

stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Show resolved Hide resolved
@miscco
Copy link
Contributor

miscco commented Feb 2, 2022

Thanks, that is much better

@barcharcraz
Copy link
Member

We talked about this PR in our weekly meeting, and I think the next steps are to write a small benchmark and test out a simpler version with only the fallback path. We consider this too complex to merge without a benchmark showing very clear improvement.

@AdamBucior
Copy link
Contributor Author

With this simple benchmark:

#include <string>
#include <format>
#include <chrono>
#include <iostream>

using namespace std;
using namespace chrono;

constexpr size_t Iterations = 10000000;

int main() {
    steady_clock::time_point start, end;
    start = steady_clock::now();

    for (size_t i = 0; i < Iterations; ++i) {
        (void) format("{} {} {} {} {} {}", "let's", "concatenate", "a", "bunch", "of", "strings");
    }

    end = steady_clock::now();

    cout << end - start << endl;

    start = steady_clock::now();

    for (size_t i = 0; i < Iterations; ++i) {
        (void) format("{0} {0} {0}", "repeat this sentence three times");
    }

    end = steady_clock::now();

    cout << end - start << endl;

    system("pause");
}

I got these results (ran for 4 times each):

With PR:

4473119900ns
5105820100ns

4532696100ns
5093983400ns

4463521400ns
5120882300ns

4421421200ns
5108412600ns

Average:
4472689650ns
5107274600ns

Without PR:

4750032000ns
5193818700ns

4790922400ns
5249655700ns

5018952300ns
5281830700ns

4815681900ns
5280669900ns

Average:
4843897150ns ~8% more time than average with PR
5251493750ns ~3% more time than average with PR

So with simple strings there is some improvement. Obviously the result will differ depending on arguments used (of which there is a wide variety of possibilities). I don't think there is really a case in which this PR would perform worse.

@barcharcraz
Copy link
Member

What do you get running that with clang-cl instead? (to do just the fallback path)

@AdamBucior
Copy link
Contributor Author

With clang-cl:

With PR:

3826073500ns
4233170700ns

3630934800ns
4309987800ns

3822146900ns
4210166300ns

3612824000ns
4434587900ns

Average:

3722994800ns
4296978175ns

Without PR:

3760814900ns
4333771500ns

3692610400ns
4379408800ns

3719316200ns
4294400100ns

3744752400ns
4294775400ns

Average:

3729373475ns ~0.1% more time than average with PR (no real difference)
4325588950ns ~0.6% more time than average with PR (no real difference)

Unsurprisingly no real difference. The fallback is basically the same thing that we are already doing.

@AlexGuteniev AlexGuteniev mentioned this pull request Feb 9, 2022
@StephanTLavavej
Copy link
Member

@barcharcraz says that there should be no inherent ABI impact to these changes - at most, we may need to rename an _Ugly internal type to _Ugly2 (as we've done repeatedly in shared_ptr's control blocks), which prevents any mix-and-match problems. Charlie would also prefer not to backport this to VS 2019 16.11.x, so we should ensure that before-and-after code can be mixed harmlessly, even if these changes get into VS 2022 17.2 Preview 3 (before the 17.2 General Availability ABI lockdown).

Charlie plans to investigate this with more benchmarking.

@StephanTLavavej StephanTLavavej removed their assignment Feb 16, 2022
@barcharcraz
Copy link
Member

barcharcraz commented Apr 1, 2022

I benchmarked this using gbench (it did around 3,000,000 iterations for each test) and found an extremely small perf difference even with the full caching (I verified I was using the correct version by adding a define in format). It seems like this pr does make things a new nanoseconds faster but not always.

In general doing three benchmark iterations on a benchmark that is so fast won't give reliable results, and this is true even if the CPU is literally executing nothing else (i.e. the benchmark is running in kernel-mode, all alone). Here are my results

Has cache: 0
2022-03-31T17:16:03-07:00
Running C:\Users\chbarto\source\repos\scratch\format_caching_bench\build\bench.exe
Run on (12 X 3192 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
-------------------------------------------------------
Benchmark             Time             CPU   Iterations
-------------------------------------------------------
bench_concat        225 ns          225 ns      2986667
bench_insert        349 ns          353 ns      2036364
Has cache: 1
2022-03-31T17:16:50-07:00
Running C:\Users\chbarto\source\repos\scratch\format_caching_bench\build\bench.exe
Run on (12 X 3192 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
-------------------------------------------------------
Benchmark             Time             CPU   Iterations
-------------------------------------------------------
bench_concat        227 ns          225 ns      2986667
bench_insert        328 ns          330 ns      2133333

code:

#include <benchmark/benchmark.h>
#include <stdio.h>
#include <format>
#ifndef _STL_FORMAT_HAS_CACHE
#define _STL_FORMAT_HAS_CACHE 0
#endif
using benchmark::State;
static void bench_concat(State &s)
{
    for (auto _ : s)
    {
        (void)std::format("{} {} {} {} {} {}", "let's", "concatenate", "a", "bunch", "of", "strings");
    }
}

static void bench_insert(State &s)
{
    for (auto _ : s)
    {
        (void)std::format("{0} {0} {0}", "repeat this sentence three times");
    }
}
BENCHMARK(bench_concat);
BENCHMARK(bench_insert);
int main(int argc, char **argv)
{
    printf("Has cache: %d\n", _STL_FORMAT_HAS_CACHE);
    ::benchmark::Initialize(&argc, argv);
    if (::benchmark::ReportUnrecognizedArguments(argc, argv))
        return 1;
    ::benchmark::RunSpecifiedBenchmarks();
    ::benchmark::Shutdown();
    return 0;
}

@barcharcraz
Copy link
Member

and here's with just the fallback caching

Has cache: 2
2022-03-31T17:26:47-07:00
Running C:\Users\chbarto\source\repos\scratch\format_caching_bench\build\bench.exe
Run on (12 X 3192 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
-------------------------------------------------------
Benchmark             Time             CPU   Iterations
-------------------------------------------------------
bench_concat        243 ns          246 ns      2800000
bench_insert        328 ns          329 ns      1947826

@barcharcraz
Copy link
Member

oh, you did run this 10 million times, my bad

@StephanTLavavej
Copy link
Member

Thanks @barcharcraz for the additional benchmark results.

@AdamBucior, thanks for investigating this potential optimization. After looking at the perf numbers from your benchmark and Charlie's, we talked about this at the weekly maintainer meeting and have decided that the complexity of this optimization outweighs the minimal potential for performance improvement here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
format C++20/23 format performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants