Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] Locale independent model load/save #3267

Closed
AlbertoEAF opened this issue Jul 30, 2020 · 27 comments
Closed

[fix] Locale independent model load/save #3267

AlbertoEAF opened this issue Jul 30, 2020 · 27 comments

Comments

@AlbertoEAF
Copy link
Contributor

AlbertoEAF commented Jul 30, 2020

Hello,

Issue

With the revert of its PR, this issue became again a concern in Java: #2890.

At the moment LightGBM uses strtod() and C++ streams to read/write floating point the model file numbers.
The issue is that Java breaks the initialization to the standard "C" locale (see #2891). Current evidence shows this is true for Java providers that use the C API through SWIG (I suspect MMLSpark's LightGBM might be affected by this too), and even for users of the LightGBM Python API that have their code wrapped by JEP.

For production scenarios ensuring the correctness of scores is of paramount importance and right now there are none for LGBM providers that use Java (see issue above).

Request for Change

After assessing alternatives, and seeing that the last fix was reverted, the robust solution would be to change the model read/write of floating numbers through use of process-locale-independent methods.

Implementation proposal

Replace floating point read/writes in model save/load with calls to C++ streams (we already use them anyway for some parameters). We imbue such streams with the "C" locale, to force the same locale for reads and writes. This fixes our bug.

It might be slightly slower to read/write the model but it will be correct everytime independently of process locale settings.
According to the benchmark https://tinodidriksen.com/2011/05/cpp-convert-string-to-double-speed/ it would result in a ~3x slower floating-point conversion (reusing streams). I doubt it's actually 3x slower (see tricks to make it faster http://cplusplus.bordoon.com/speeding_up_string_conversions.html). Besides, we're already using streams for some numbers, thus the model read/write slowdown wouldn't reach that value.

Is that an issue?

Can I get your thoughts on this? Good, bad? Secondary effects I might be missing? Would such PR be welcome? @guolinke @StrikerRUS @imatiach-msft et al.

Thank you!

Not enough? Future - Alternative libraries

In the future, if needed we can upgrade to use different libraries like Boost.Spirit or David M. Gay's library which is more accurate than glibc (current implementation). If we do so, we can have rounding differences (that in principle are small).

Side note on floating-point conversion precision and David Gay's library

GCC uses glibc to compile LightGBM, and according to these links:

Converting decimal strings to doubles correctly and efficiently is a surprisingly complex task; fortunately, David M. Gay did this for us when he wrote this paper and released this code over 20 years ago. (He maintains this reference copy of strtod() to this day.) His code appears in many places, including in the Python, PHP, and Java programming languages, and in the Firefox, Chrome, and Safari Web browsers.

Many programming environments implement their string to double conversions with David Gay’s strtod(); glibc, the GNU C Library, does not.

David Gay's implementation https://www.ampl.com/netlib/fp/dtoa.c of to/from string/floating point conversion is even more accurate than we're using now. It is a single file that can be compiled and configured through preprocessor flags, in our case to ignore locale processing altogether (resulting in the "C" locale behaviour for reads and writes).

@StrikerRUS
Copy link
Collaborator

Speaking about float2str conversion in saving models, XGBoost recently adopted Ryū algorithm (dmlc/xgboost#5772) and numpy uses Dragon4 algorithm (numpy/numpy#9941).

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented Aug 17, 2020

Before I commit to working on this, can someone help me direct me to the files where load/save model files code is found?

Searching for the code... (read only if at a loss)

The fix we had posted in the past https://github.com/microsoft/LightGBM/pull/2891/files required changes in: LGBM_BoosterCreateFromModelfile, LGBM_BoosterSaveModel, LGBM_BoosterSaveModelToString and LGBM_BoosterDumpModel in the C API.

This new implementations will need to change the core code, which means fixes must be done in the functions called therein, namely, Booster::Booster, Booster::SaveModelToFile, Booster::SaveModelToString and Booster::DumpModel.

Reading model files seems to be implemented in:

boosting_->LoadModelFromString(model_str, len) which leads me to GBDT::LoadModelFromString in gbdt_model_text.cpp.
In turn, that uses both ArrayToString and ArrayToStringFast (don't know why we have and use both to save the model - legacy?).
To write stuff Common::DoubleToStr, which uses snprintf (which is not locale-independent) is used.

Conclusion and questions

I got lost on the virtual function pointers, though, I was expecting to see DART and others implement model read/write.

Can I thus assume model text file read/writes are implemented in gbdt_model_text.cpp's:

image
methods alone?

  1. If I change that file and functions called in it like Common::DoubleToStr everything should work, or should I look for additional code in other places?
  2. What about DART and other boosting modes, all of them use gbdt_model_text.cpp as well to implement the model file read/writes?

+1 ping for @guolinke as you wrote most of that code

Thank you all :)

@StrikerRUS
Copy link
Collaborator

Gently ping @guolinke for two questions above about the code location.

@guolinke
Copy link
Collaborator

guolinke commented Sep 9, 2020

sorry for missing this thread...
@AlbertoEAF

  1. I remember ArrayToStringFast is only for integer. So for model_to_string, changing DoubleToStr should be enough. For for the string_to_model, there are several functions. like

    template<typename T>
    struct __StringToTHelper<T, true> {
    T operator()(const std::string& str) const {
    return static_cast<T>(std::stod(str));
    }
    };
    and
    template<typename T>
    struct __StringToTHelperFast<T, true> {
    const char* operator()(const char*p, T* out) const {
    double tmp = 0.0f;
    auto ret = Atof(p, &tmp);
    *out = static_cast<T>(tmp);
    return ret;
    }
    };

  2. yeah, all models share the same model input/output codes.

@AlbertoEAF
Copy link
Contributor Author

Thanks so much! I will start working on it ;)

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented Sep 11, 2020

By the way @StrikerRUS , regarding such libraries, I think I'll go with Ryü, I've been reading up a bit on it. Both came up after David Gay's implementations and Ryü came after Dragon4. Dragon4 has a fallback algorithm for some numbers which is slower.
I also saw even faster implementations but I don't want to get too close to the edge as this is supposed to be stable.

I was thinking of using this C++ implementation of Ryü https://github.com/expnkx/fast_io as it provides a single-header to include but it requires C++20 when LGBM is only C++11.

This leaves only the original implementation with C interface: https://github.com/ulfjack/ryu. I hope it's easy - and possible - to add to our build chain, still will have to figure that one out.

@guolinke
Copy link
Collaborator

@AlbertoEAF BTW, is that possible to have the two-stage solution? like, have a script to locale/de-locale the model?
LightGBM itself could read/write always by current format.
But when users want to read the content of model file, we can locale it.

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented Sep 11, 2020

Hello @guolinke , I'm not sure what you mean. Why would we want to transform the model content to different locales?
The model should behave like a "binary blob" which LightGBM can read and write correctly independently of the machine settings. It's cleaner to have it store and load always with the same format, we don't need models in different locales. Do you agree? :)

Also, I believe it would be even trickier as you would have to know precisely the locale with which the model was written to read it correctly, which can be tricky to know in some corner cases.

Finally, as LightGBM itself does not allow "changing locale" during its execution, we would have to create an external tool, and that tool would need to parse the model file structure to read it and then rewrite it with exactly the same structure - lots of code duplication and complexity.

@guolinke
Copy link
Collaborator

@AlbertoEAF I misunderstood the locale problem.
you are right, the model should be locale-independent.

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented Sep 11, 2020

Hello, I have 2 requests for input @StrikerRUS @guolinke:

  1. Maybe there were changes meanwhile because in the v3.0.0 code ArrayToStringFast and ArrayToString both operate on doubles. ArrayToString only operates on doubles, but ArrayToStringFast is a template that processes arrays of any type. I believe there's no reason to have both anymore.

  2. Second point, instead of using Ryu, I was thinking of using fmt https://fmt.dev/latest/index.html, for its integration is much easier, and still 10x faster than the standard libraries (current state). Besides being the foundation to C++20's new format standard library, it is also very well maintained, uses only C++11 features and it has explicit support for gcc 4.8 (https://fmt.dev/latest/index.html). What do you think? Also it seems much easier to use than Ryu and in this thread they say it's not much different in terms of speed: Optimize floating-point formatting fmtlib/fmt#1426. Update: forgot about the reading part.. Found scnlib with a similar API in github but it's not stable.

Thanks!

@AlbertoEAF
Copy link
Contributor Author

Wow, I replaced threshold, leaf_value and leaf_weight's ArrayToString calls by ArrayToStringFast in src/io/tree.cpp's Tree::ToString() and had surprising results. To test I read a reference model file and wrote it back (left=v3.0.0 code, right=modified code):
image

Whilst ArrayToStringFast produces numbers with less precision which saves space, the tree sizes also changed. Why is that?!

@StrikerRUS
Copy link
Collaborator

@AlbertoEAF I'm totally OK with using fmt library! It looks quite popular based on number of GitHub stars and we can use it as a git submodule (like boosts' compute).

@StrikerRUS
Copy link
Collaborator

Hah, just noticed that treelite project is using fmt library. And treelite in turn is used in rapidsai's project for tree models:
https://github.com/dmlc/treelite/blob/1c495c52a6dff8413f1b66f2f0ecc4e390473c5e/cmake/ExternalLibs.cmake#L10-L14
https://github.com/rapidsai/cuml

@AlbertoEAF
Copy link
Contributor Author

Yeah @StrikerRUS , even integer to string conversions might benefit from it, not only floating point conversions with shortest representation. LightGBM already has a custom Atoi function built to improve performance I believe, but it seems not even that will beat this:

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented Sep 16, 2020

@guolinke @StrikerRUS I've hit some issue with the LightGBM model (same as the one above with the image) which I don't understand and really need your help. Any ideas of what might be causing this are greatly appreciated!

Issue

When I switch the call that turns floating point parameters to strings in the model write, I get numbers which have sometimes a redundant trailing 0 in the scientific notation which shouldn't matter - just posted a bug at fmtlib: fmtlib/fmt#1873. See it to understand what I mean.

The big issue though is that for each of the two trees where one parameter changed, that tree_size in the model grew by +1 as well! Any ideas?!

Diffs in the output models files produced by LightGBM. Each image reports the line difference with the "unpatched"/normal LightGBM code (in red) and with the patch (in green), followed by the differences in items from such lines also in red/green:
image

image

In this case the model written with the patched code results in two trees where one parameter as a trailing 0, which results in this tree size changes:
image

and that's what I don't understand. Why do the tree sizes change because of the written floating point parameters?

Source code references

If you want to take a look at the difference in the code that led to this issue reported here and in fmtlib/fmt#1873, the code change is this one: feedzai@da3d8f2.

The images above were generated with the two commits:

PR's in progress (still in very early development):

Thank you! Any input is appreciated!

@StrikerRUS
Copy link
Collaborator

@guolinke The issue described above can be the root cause of our previous issue when maintainers of Julia package had to remove tree_sizes field to load a model created on another machine: #3137.

@AlbertoEAF
Copy link
Contributor Author

To further reinforce the "bug" origin, I added code to strip away trailing 0's in the decimal digits of numbers, and it already works properly - i.e., there are no differences in the output models! (This is the temporary hack I proposed on the fmtlib/fmt#1873).

This commit feedzai@c34c43d adds that code (which is not a permanent solution as it's a hack which reduces speed and is not 'pretty').

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented Sep 17, 2020

I was already able to replace all model write calls except DoubleToStr, again due to divergences in fmt and snprintf, this time related to output format being with fixed precision (%.17g). However there's a hack to fix it too until they fix it in the library: truncating the output string to the maximum number of significant digits: fmtlib/fmt#1875

@guolinke
Copy link
Collaborator

@AlbertoEAF the tree-size is the string size of tree model content. We use it for the multi-threading loading tree models.

@AlbertoEAF
Copy link
Contributor Author

Ah! So it's no bug! I get it now, shouldn't be a problem then, thanks!

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented Sep 23, 2020

Hello @guolinke @StrikerRUS I have good news, I've completed the code changes and it's passing all the tests!

Instead of relying on Ryu, actually I used fmt to write values to strings as it provides similar performance and its API is much more inline with the rest of the LightGBM codebase. To do string->double parsing I am using this library: https://github.com/lemire/fast_double_parser which provides maximal floating-point string parsing resolution and matches the standard library parsing outputs bit by bit.

Hence, transforming data to strings and string arrays should now be faster. Parsing strings to doubles with StringToArray should now be significantly around 10x faster at parsing doubles, according to their benchmarks.

I'd like to streamline the code more by merging Common's __StringToTHelperFast and __StringToTHelper, however, Common::Atof seems to diverge from standard library's stringstreams and fast_double_parser's resolutions. Those two are equivalent and provide the maximal parsing resolution.
Common::Atof string->double parsing diverges around the 2 last decimal places. See some examples:

image

And seems to have less resolution:
image

As such I had to leave both __StringToTHelperFast and __StringToTHelper separate to maintain the current LightGBM behaviour without affecting read/write floating-point resolution.

@StrikerRUS
Copy link
Collaborator

Common::Atof string->double parsing diverges around the 2 last decimal places. See some examples:

Looks like the difference is in the 18th+ digit in all examples. And according to IEEE 754, double gives only 15-17 first significant digits. So, I believe it is OK to have inconsistent digits after 17th place.

@AlbertoEAF
Copy link
Contributor Author

Hmm that's interesting, in that case we could replace Atof and further simplify the codebase :)

Thanks Nikita! I'll run some tests tomorrow and see if the scores don't change.

@guolinke
Copy link
Collaborator

I remember that atof is used for the precision non-sensitive conversion (it will be faster), like the split_gain.
If the new solution is faster and more precisious, we can switch to it.

@StrikerRUS
Copy link
Collaborator

Fixed via #3405.
Thanks a lot @AlbertoEAF !

@AlbertoEAF
Copy link
Contributor Author

Thanks a lot for all the help @StrikerRUS @jameslamb ;)

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants