<fstream>: Performance issue when reading a binary file using std::basic_ifstream<unsigned char> #2109

jamesmagnus · 2021-08-10T12:00:14Z

Describe the bug
Performance issue.
When reading a binary file using unsigned data type (ie std::basic_ifstream<unsigned char>) is order of magnitude slower than reading it using signed data type (ie std::basic_ifstream<char>).
It seems you are doing character by character locale conversion, altough the file is open as binary and not text.

Command-line test case

Z:\beaus\Desktop\MSVC>type Source.cpp
#include <vector>
#include <iostream>
#include <fstream>
#include <chrono>
#include <filesystem>

int main()
{
        const std::filesystem::path filePath{ "test.data" }; // 128 MB binary file
        const size_t fileSize{ std::filesystem::file_size(filePath) };

        // Read the whole file using char data type
        {
                // Open file in binary mode
                std::basic_ifstream<char> fileStream{ filePath, std::fstream::binary };
                std::vector<char> data;
                data.resize(fileSize);

                const auto start{ std::chrono::system_clock::now() };
                fileStream.read(data.data(), fileSize);
                const auto end{ std::chrono::system_clock::now() };

                std::cout << "Duration read signed: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << " ms" << std::endl;
        }

        // Read the whole file using unsigned char data type
        {
                // Open file in binary mode
                std::basic_ifstream<unsigned char> fileStream{ filePath, std::fstream::binary };
                std::vector<unsigned char> data;
                data.resize(fileSize);

                const auto start{ std::chrono::system_clock::now() };
                fileStream.read(data.data(), fileSize);
                const auto end{ std::chrono::system_clock::now() };

                std::cout << "Duration read unsigned: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << " ms" << std::endl;
        }

        return 0;
}
Z:\beaus\Desktop\MSVC>cl /EHsc /W4 /WX /std:c++17 .\Source.cpp
Compilateur d'optimisation Microsoft (R) C/C++ version 19.29.30040 pour x64
Copyright (C) Microsoft Corporation. Tous droits réservés.

Source.cpp
Microsoft (R) Incremental Linker Version 14.29.30040.0
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:Source.exe
Source.obj

Z:\beaus\Desktop\MSVC>Source.exe
Duration read signed: 79 ms
Duration read unsigned: 12838 ms

Expected behavior
Read a binary file using char or unsigned char should take the same time. (no character conversion for binary data)

STL version

Option 1: Visual Studio version
- Displayed in Help > About Microsoft Visual Studio
```
Microsoft Visual Studio Community 2019
Version 16.10.4
```

Additional context
Use a randomly generated test.data binary file to reproduce the issue with my code. Mine is 128MB.

The text was updated successfully, but these errors were encountered:

fsb4000 · 2021-08-10T12:30:42Z

Duplicate: #817

See the comment for more details: #817 (comment)

use .rdbuf()->pubsetbuf()

https://en.cppreference.com/w/cpp/io/basic_streambuf/pubsetbuf

fsb4000 · 2021-08-10T12:49:24Z

#include <vector>
#include <iostream>
#include <fstream>
#include <chrono>
#include <filesystem>

int main()
{
        const std::filesystem::path filePath{ "test.data" }; // 128 MB binary file
        const size_t fileSize{ std::filesystem::file_size(filePath) };

        // Read the whole file using char data type
        {
                // Open file in binary mode
                std::basic_ifstream<char> fileStream{ filePath, std::fstream::binary };
                std::vector<char> data;
                data.resize(fileSize);

                const auto start{ std::chrono::system_clock::now() };
                fileStream.read(data.data(), fileSize);
                const auto end{ std::chrono::system_clock::now() };

                std::cout << "Duration read signed: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << " ms" << std::endl;
        }

        // Read the whole file using unsigned char data type
        {
                // Open file in binary mode
                std::basic_ifstream<unsigned char> fileStream{ filePath, std::fstream::binary };
                std::vector<unsigned char> data;
                data.resize(fileSize);

                const auto start{ std::chrono::system_clock::now() };
                fileStream.read(data.data(), fileSize);
                const auto end{ std::chrono::system_clock::now() };

                std::cout << "Duration read unsigned: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << " ms" << std::endl;
        }

        // Read the whole file using unsigned char data type with buffering
        {
                // Open file in binary mode
                std::basic_ifstream<unsigned char> fileStream{ filePath, std::fstream::binary };
                unsigned char buf[1024*10 + 1];
                fileStream.rdbuf()->pubsetbuf(buf, sizeof buf);
                std::vector<unsigned char> data;
                data.resize(fileSize);

                const auto start{ std::chrono::system_clock::now() };
                fileStream.read(data.data(), fileSize);
                const auto end{ std::chrono::system_clock::now() };

                std::cout << "Duration read unsigned with buffering: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << " ms" << std::endl;
        }

        return 0;
}

Duration read signed: 135 ms
Duration read unsigned: 7279 ms
Duration read unsigned with buffering: 92 ms

jamesmagnus · 2021-08-10T12:51:07Z

Mmmh Ok I see. But why there is no internal buffer for all 1 byte data types ?

fsb4000 · 2021-08-10T13:54:33Z

@jamesmagnus As far as I understand we don't use buffering for cout/cin and ofstream/ifstream.
So pubsetbuf with user provided buffer is the fastest method.
But we use some optimizations for char only because codecvt will be always noconv and we rely on fread/fwrite buffering:

STL/stl/inc/xlocale

Lines 717 to 720 in bfa2cb2

    
           bool __CLR_OR_THIS_CALL do_always_noconv() const noexcept override {
 
               // return true if conversions never change input (from codecvt)
 
               return is_same_v<_Byte, _Elem>;
 
           }

STL/stl/inc/fstream

Lines 609 to 636 in 280347a

    
           virtual streamsize __CLR_OR_THIS_CALL xsputn(const _Elem* _Ptr, streamsize _Count) override {
 
               // put _Count characters to stream
 
               if constexpr (sizeof(_Elem) == 1) {
 
                   if (_Pcvt) { // if we need a nontrivial codecvt transform, do the default expensive thing
 
                       return _Mysb::xsputn(_Ptr, _Count);
 
                   }
 
                   const streamsize _Start_count = _Count;
 
                   streamsize _Size              = _Mysb::_Pnavail();
 
                   if (0 < _Count && 0 < _Size) { // copy to write buffer
 
                       if (_Count < _Size) {
 
                           _Size = _Count;
 
                       }
 
                       _Traits::copy(_Mysb::pptr(), _Ptr, static_cast<size_t>(_Size));
 
                       _Ptr += _Size;
 
                       _Count -= _Size;
 
                       _Mysb::pbump(static_cast<int>(_Size));
 
                   }
 
                   if (0 < _Count && _Myfile) { // open C stream, attempt write
 
                       _Count -= _CSTD fwrite(_Ptr, sizeof(_Elem), static_cast<size_t>(_Count), _Myfile);
 
                   }
 
                   return _Start_count - _Count;
 
               } else { // non-chars always get element-by-element processing
 
                   return _Mysb::xsputn(_Ptr, _Count);
 
               }

STL/stl/inc/fstream

Lines 561 to 607 in 280347a

    
           virtual streamsize __CLR_OR_THIS_CALL xsgetn(_Elem* _Ptr, streamsize _Count) override {
 
               // get _Count characters from stream
 
               if constexpr (sizeof(_Elem) == 1) {
 
                   if (_Count <= 0) {
 
                       return 0;
 
                   }
 
                   if (_Pcvt) { // if we need a nontrivial codecvt transform, do the default expensive thing
 
                       return _Mysb::xsgetn(_Ptr, _Count);
 
                   }
 
                   // assuming this is OK because _Ptr + _Count must be valid
 
                   auto _Count_s           = static_cast<size_t>(_Count);
 
                   const auto _Start_count = _Count;
 
                   const auto _Available   = static_cast<size_t>(_Mysb::_Gnavail());
 
                   if (0 < _Available) { // copy from get area
 
                       const auto _Read_size = (_STD min)(_Count_s, _Available);
 
                       _Traits::copy(_Ptr, _Mysb::gptr(), _Read_size);
 
                       _Ptr += _Read_size;
 
                       _Count_s -= _Read_size;
 
                       _Mysb::gbump(static_cast<int>(_Read_size));
 
                   }
 
                   if (_Myfile) { // open C stream, attempt read
 
                       _Reset_back(); // revert from _Mychar buffer
 
                       // process in 4k - 1 chunks to avoid tripping over fread's clobber-the-end behavior when
 
                       // doing \r\n -> \n translation
 
                       constexpr size_t _Read_size = 4095; // _INTERNAL_BUFSIZ - 1
 
                       while (_Read_size < _Count_s) {
 
                           const auto _Actual_read = _CSTD fread(_Ptr, sizeof(_Elem), _Read_size, _Myfile);
 
                           _Ptr += _Actual_read;
 
                           _Count_s -= _Actual_read;
 
                           if (_Actual_read != _Read_size) {
 
                               return static_cast<streamsize>(_Start_count - _Count_s);
 
                           }
 
                       }
 
                       if (0 < _Count_s) {
 
                           _Count_s -= _CSTD fread(_Ptr, sizeof(_Elem), _Count_s, _Myfile);
 
                       }
 
                   }
 
                   return static_cast<streamsize>(_Start_count - _Count_s);
 
               } else { // non-chars always get element-by-element processing
 
                   return _Mysb::xsgetn(_Ptr, _Count);
 
               }
 
           }

If @StephanTLavavej has time, he can answer in a more detailed way...

fsb4000 · 2021-08-10T14:14:10Z

By the way LLVM libc++ do buffering in the basic_filebuf constructor: https://github.com/llvm/llvm-project/blob/main/libcxx/include/fstream#L314

Maybe we could do similar...

fsb4000 · 2021-08-10T16:00:08Z

@jamesmagnus one more thing.

https://en.cppreference.com/w/cpp/io/basic_streambuf/pubsetbuf and https://stackoverflow.com/a/40317135/4544798 and other internet resources tell that you need to do (it is true for gnu libstdc++):

pubsetbuf
open

But this is not true for MS STL. With MS STL you need to do

open (or constructor with filename)
pubsetbuf

So cross platform C++ might be tricky :(

https://i.imgur.com/rasLdzk.png

AdamBucior · 2021-08-10T17:02:09Z

Couldn't we make codecvt::do_always_noconv return true when _Byte and _Elem are same size integral types (the same criteria copy uses to decide whether to use memmove)? It seems that do_in and do_out just static_cast everything so no real conversion happens in that case.

fsb4000 · 2021-08-10T17:12:47Z

Probably, I offered return true for codecvt<unsigned char, char>::do_always_noconv , codecvt<char8_t, char>::do_always_noconv and codecvt<std::byte, char>::do_always_noconv but your idea is just better because more general.

StephanTLavavej · 2021-08-11T21:57:41Z

@CaseyCarter notes that basic_ifstream<unsigned char> is a quasi-non-Standard specialization, and is unsure what the motivation is for using this type.

jessey-git · 2021-09-30T19:07:01Z

@CaseyCarter notes that basic_ifstream<unsigned char> is a quasi-non-Standard specialization, and is unsure what the motivation is for using this type.

As somewhat alluded to in #817, it's kind of about trying to have sane concepts in your api's and types. Surely, if you open a stream in binary mode you'd really like to do so while using the std::byte type right? Right?! Or similar with uint8_t's or whatever might fit within your given domain while still conveying the concept of "this is an opaque blob of data" not a character string.

And right now you can't do it without resorting to extra, ugly, API calls to set the internal buffers. Maybe at least fix it for std::byte?

In my case, had I not realized that my unit tests were taking 30 to 50x longer to complete (a few seconds longer than previous) I would have subjected dozens (dozens!) of people who use my lib to very bad performance on Windows. And Windows still needs all the help it can get due to slight performance losses everywhere else (ABI nonsense).

CaseyCarter · 2021-10-06T21:40:11Z

I think what could happen here is that someone audits the code to remove the impediment that prevents the optimization being used for all char-like types of size 1, instead of only for char. If WG21 later decides that codecvt<char, std::byte, blah blah> needs to mean something else, then we can make that change at that time, potentially invalidating the optimization.

We don't plan to make this change right now, but would be happy to review a pull request doing so.

StephanTLavavej added the performance Must go faster label Aug 11, 2021

fsb4000 mentioned this issue May 25, 2022

codecvt: no conversion for all char-like types of size 1. #2739

Merged

StephanTLavavej closed this as completed in #2739 Jun 2, 2022

StephanTLavavej added the fixed Something works now, yay! label Jun 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

<fstream>: Performance issue when reading a binary file using std::basic_ifstream<unsigned char> #2109

<fstream>: Performance issue when reading a binary file using std::basic_ifstream<unsigned char> #2109

jamesmagnus commented Aug 10, 2021

fsb4000 commented Aug 10, 2021 •

edited

Loading

fsb4000 commented Aug 10, 2021 •

edited

Loading

jamesmagnus commented Aug 10, 2021

fsb4000 commented Aug 10, 2021 •

edited

Loading

fsb4000 commented Aug 10, 2021

fsb4000 commented Aug 10, 2021 •

edited

Loading

AdamBucior commented Aug 10, 2021

fsb4000 commented Aug 10, 2021

StephanTLavavej commented Aug 11, 2021

jessey-git commented Sep 30, 2021

CaseyCarter commented Oct 6, 2021

<fstream>: Performance issue when reading a binary file using std::basic_ifstream<unsigned char> #2109

<fstream>: Performance issue when reading a binary file using std::basic_ifstream<unsigned char> #2109

Comments

jamesmagnus commented Aug 10, 2021

fsb4000 commented Aug 10, 2021 • edited Loading

fsb4000 commented Aug 10, 2021 • edited Loading

jamesmagnus commented Aug 10, 2021

fsb4000 commented Aug 10, 2021 • edited Loading

fsb4000 commented Aug 10, 2021

fsb4000 commented Aug 10, 2021 • edited Loading

AdamBucior commented Aug 10, 2021

fsb4000 commented Aug 10, 2021

StephanTLavavej commented Aug 11, 2021

jessey-git commented Sep 30, 2021

CaseyCarter commented Oct 6, 2021

fsb4000 commented Aug 10, 2021 •

edited

Loading

fsb4000 commented Aug 10, 2021 •

edited

Loading

fsb4000 commented Aug 10, 2021 •

edited

Loading

fsb4000 commented Aug 10, 2021 •

edited

Loading