Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precise text file parsing #4081

Merged
merged 18 commits into from
May 7, 2021
Merged

Conversation

cyfdecyf
Copy link
Contributor

When comparing prediction result using command line version and Python API, I noticed some prediction values differs starting at the 5th non-zero digits. I suspect the difference is caused by differnet float number parsing algorithms used in LightGBM and pandas. (For reference, I used pandas.read_csv(fname, float_precision="round_trip") to load csv file in my Python code.)

So I added precise text file parsing with fast_double_parser and the result confirms my guess.

This patch also contains a simple benchmark which shows Common::Atof is much faster than using fast_double_parser.

@cyfdecyf
Copy link
Contributor Author

@jameslamb The benchmark result doesn't show fast_double_parser being faster than Comon::Atof. So this PR is more useful for people who wants to verify problems caused by precision lost of when parsing floating number.

Copy link
Collaborator

@shiyu1994 shiyu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution. Just two comments that address my concerns.

@@ -330,6 +331,26 @@ inline static const char* Atof(const char* p, double* out) {
return p;
}

// Use fast_double_parse and strtod (if parse failed) to parse double.
inline static const char* AtofPrecise(const char* p, double* out) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean we can replace

T operator()(const std::string& str) const {
double tmp;
// Fast (common) path: For numeric inputs in RFC 7159 format:
const bool fast_parse_succeeded = fast_double_parser::parse_number(str.c_str(), &tmp);
// Rare path: Not in RFC 7159 format. Possible "inf", "nan", etc.
if (!fast_parse_succeeded) {
std::string strlower(str);
std::transform(strlower.begin(), strlower.end(), strlower.begin(), [](int c) -> char { return static_cast<char>(::tolower(c)); });
if (strlower == std::string("inf"))
tmp = std::numeric_limits<double>::infinity();
else if (strlower == std::string("-inf"))
tmp = -std::numeric_limits<double>::infinity();
else if (strlower == std::string("nan"))
tmp = std::numeric_limits<double>::quiet_NaN();
else if (strlower == std::string("-nan"))
tmp = -std::numeric_limits<double>::quiet_NaN();
else
Log::Fatal("Failed to parse double: %s", str.c_str());
}
return static_cast<T>(tmp);
}

which was added in #3942 with a call to this AtofPrecise? Do they behave exactly the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert on floating number. So I write a test program https://gist.github.com/cyfdecyf/63f4e7339bbe5a5a23474fda66375742

The only difference is that strtod would not return negative NaN. So the Atof function in this gist handles this special case.

Please help take a look at the gist and check if there's any problem. I'll update Common::Atof and replace the change added in #3492.

Refer to: 1 2

Copy link
Contributor Author

@cyfdecyf cyfdecyf Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The special handling for -NaN in gist (revision 5) the has one problem though: it's incorrect to handle input like " -nan" (note beginning space). But I'm wondering if there's need for the special handling of this case? From this mailling list thread, it seems like different C library have different treatment for parsing "-nan".

I suggest just leave Common::AtofPrecise as is and don't add special handling for "-nan" and the like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code introduced in #3942 is replaced by AtofPrecise in commit 498090d.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! LGTM.

@@ -0,0 +1,18 @@
cmake_minimum_required(VERSION 3.0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether it is appropriate to add these benchmark scripts to the master branch. Why do you think they are necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary to merge into the master branch. I'm just curious about the actual performance of Common::Atof and fast_double_parser.

And maybe this benchmark can be helpful for people who want to improve the performance of text parser.

I can remove this commit if you decide to not include it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark commit is now reverted. It's better to create microbenchmarks for this.

CMakeLists.txt Outdated
@@ -1,3 +1,4 @@
OPTION(USE_PRECISE_TEXT_PARSER "Use precise double parser for text input file" OFF)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need new compilation option only for one function? Why not simply use AtofPrecise instead of Atof by default?
Having a lot of functions doing the same work is very confusing, greatly increases maintenance burden and hurts overall development process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If AtofPrecise is faster (or not much slower) than Atof, I'd like to use it by default. But it's actually much slower in my simple benchmark
    • When I see fast_double_parser mentioned in the commit log, I thought it's been used for text file parsing too. But it's actually not which confused me at first. I guess someone might have done the performance test and thus not using it for text parsing
    • With this performance difference, I'd choose precise version only when precision is required
  2. AtofPrecise does not behave exactly the same with Atof. For some of my test models, prediction results for csv input can differ starting from the 5th non-zero digit
    • As comment in utils/common.h says both StringToHelperFast and StringToHelper are kept to maintain bit-for-bit legacy LightGBM behavior for precision, I guess you'd prefer to keep the old behavior by default
    • I noticed this when working on [python-package] Create Dataset from multiple data files  #4089 and some followups. Result verification shows this difference which makes me upset about the correctness

I've also been biten by ClickHouse's choice of non-precise float parsing by default. They choose to sacrifice 1 or 2 bits of precision to keep good float parsing performance ClickHouse/ClickHouse#1665 (3 years ago, not sure whether still true now). For me, I choose to change the float parsing function to precise and the compile the package myself.

Copy link
Collaborator

@StrikerRUS StrikerRUS Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see now that the main reason is that AtofPrecise is much slower than current solution.
But I'm still strongly against adding new compilation option for this because of maintenance burden and not many users will compile the library on their own to get precise file parsing.
However, I believe that new config param will be a good workaround for this situation. Just like recently added (see #3494 and #3578) deterministic param. Users don't have to re-compile the library every time they are switching "performance/accuracy" scenarios but do it in the runtime.
Is it possible to add new param for precise parsing? WDYT? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I'll add a config parameter and remove the compile time option to do this.

BTW, deterministic parameter is very useful for verifying result when change code in LightGBM.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have the config parameter for precise float parsing, shall we remove this compile option?

shiyu1994
shiyu1994 previously approved these changes Apr 14, 2021
@@ -330,6 +331,26 @@ inline static const char* Atof(const char* p, double* out) {
return p;
}

// Use fast_double_parse and strtod (if parse failed) to parse double.
inline static const char* AtofPrecise(const char* p, double* out) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! LGTM.

@cyfdecyf
Copy link
Contributor Author

I've been busy with other features these days. I'll finish adding new option to enable precise float parsing soon.

@shiyu1994 shiyu1994 self-requested a review April 14, 2021 05:56
@shiyu1994
Copy link
Collaborator

Sorry, I didn't noticed that @StrikerRUS 's comment hasn't been addressed before approving this.

@shiyu1994 shiyu1994 dismissed their stale review April 14, 2021 05:57

The compilation option should be changed into a config option.

@cyfdecyf
Copy link
Contributor Author

I rebased this PR to latest master and made a force push.

The latest commit adds new option precise_float_parser for dataset parameters.

Copy link
Collaborator

@shiyu1994 shiyu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost done. Just a question about keeping the compilation option for precise float parsing.

CMakeLists.txt Outdated
@@ -1,3 +1,4 @@
OPTION(USE_PRECISE_TEXT_PARSER "Use precise double parser for text input file" OFF)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have the config parameter for precise float parsing, shall we remove this compile option?

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one nit below to keep all params descriptions in consistent style.

As this parser is not used for model files but only for datasets, I believe there will be no any inconsistency issues with default parser, right?
#3463 (comment)

if (tmp_str == std::string("na") || tmp_str == std::string("nan") ||
tmp_str == std::string("null")) {
*out = NAN;
} else if (tmp_str == std::string("inf") || tmp_str == std::string("infinity")) {
*out = sign * 1e308;
} else {
Log::Fatal("Unknown token %s in data file", tmp_str.c_str());
}

docs/Parameters.rst Outdated Show resolved Hide resolved
include/LightGBM/config.h Outdated Show resolved Hide resolved
@cyfdecyf
Copy link
Contributor Author

cyfdecyf commented Apr 26, 2021

@StrikerRUS Thank you for taking time to review this PR. The added corner test cases indeed found one problem which is not setting errno to 0 before calling strtod.

Regarding your question about loading model files, AtofPrecise is actually used in some places. For example leaf_value_ parsing will call AtofPrecise.

The latest commit in master branch does not handle na, infinity input either. I replaced those code with AtofPrecise and the behavior should be the same on NaN and Inf parsing.

if (!fast_parse_succeeded) {
std::string strlower(str);
std::transform(strlower.begin(), strlower.end(), strlower.begin(), [](int c) -> char { return static_cast<char>(::tolower(c)); });
if (strlower == std::string("inf"))
tmp = std::numeric_limits<double>::infinity();
else if (strlower == std::string("-inf"))
tmp = -std::numeric_limits<double>::infinity();
else if (strlower == std::string("nan"))
tmp = std::numeric_limits<double>::quiet_NaN();
else if (strlower == std::string("-nan"))
tmp = -std::numeric_limits<double>::quiet_NaN();
else
Log::Fatal("Failed to parse double: %s", str.c_str());
}

@cyfdecyf
Copy link
Contributor Author

BTW, why does model loading uses both precise and non-precise version of floating point number parsing function?

For example:

leaf_value_ = CommonC::StringToArray<double>(key_vals["leaf_value"], num_leaves_);

left_child_ = CommonC::StringToArrayFast<int>(key_vals["left_child"], num_leaves_ - 1);

Is this for keeping backward compatibility?

@shiyu1994
Copy link
Collaborator

shiyu1994 commented Apr 28, 2021

The second one is not doing floating point number parsing. It just parses an integer. Using StringToArray or StringToArrayFast should make no difference.

@StrikerRUS
Copy link
Collaborator

@shiyu1994

Using StringToArray or StringToArrayFast should make no difference.

I thought that StringToArrayFast is used in places where some information loss in rare cases is acceptable for the aim of speedup. For example,

LightGBM/src/io/tree.cpp

Lines 751 to 761 in d517ba1

if (key_vals.count("internal_weight")) {
internal_weight_ = CommonC::StringToArrayFast<double>(key_vals["internal_weight"], num_leaves_ - 1);
} else {
internal_weight_.resize(num_leaves_ - 1);
}
if (key_vals.count("leaf_weight")) {
leaf_weight_ = CommonC::StringToArray<double>(key_vals["leaf_weight"], num_leaves_);
} else {
leaf_weight_.resize(num_leaves_);
}

, https://github.com/microsoft/LightGBM/pull/3938/files.

Was I wrong?

@shiyu1994
Copy link
Collaborator

@StrikerRUS Sorry, I did not made it clear. I mean because here the value is an integer, so both methods won't cause information loss.

BTW, can we merge this PR?

@StrikerRUS
Copy link
Collaborator

@shiyu1994

I mean because here the value is an integer, so both methods won't cause information loss.

Thanks, got it! But for the linked case with the double type in both cases, is my intuition correct?

BTW, can we merge this PR?

I don't have any objections. I think we can merge if you don't have any comments for code changed after your previous review.

@AlbertoEAF
Copy link
Contributor

AlbertoEAF commented Apr 30, 2021

Thanks, got it! But for the linked case with the double type in both cases, is my intuition correct?

Exactly @StrikerRUS, and we should be very careful if we were to switch methods when parsing doubles or we can end up with subtly different model scores just by upgrading LightGBM. Hence the decision at the time when the big model read/write was done, to keep the old bit-for-bit behaviour, independently of the speed of the parsing when switching to the fast_double_parser.

@cyfdecyf
Copy link
Contributor Author

cyfdecyf commented May 7, 2021

@StrikerRUS Sorry, I did not made it clear. I mean because here the value is an integer, so both methods won't cause information loss.

I linked to the wrong example in the first place. In fact I was planing to include an example like @StrikerRUS has showed.

@shiyu1994
Copy link
Collaborator

@StrikerRUS Yes, that's correct. @AlbertoEAF thanks for your explanation. @cyfdecyf That's OK. We can merge this PR now. Thanks for your contribution.

@shiyu1994 shiyu1994 merged commit f831808 into microsoft:master May 7, 2021
@cyfdecyf cyfdecyf deleted the precise-text-parse branch May 10, 2021 01:21
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants