Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parsing of non-finite values #3942

Merged
merged 20 commits into from
Mar 16, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 13 additions & 5 deletions include/LightGBM/utils/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -1082,12 +1082,20 @@ struct __StringToTHelper<T, true> {
// Fast (common) path: For numeric inputs in RFC 7159 format:
const bool fast_parse_succeeded = fast_double_parser::parse_number(str.c_str(), &tmp);

// Rare path: Not in RFC 7159 format. Possible "inf", "nan", etc. Fallback to standard library:
// Rare path: Not in RFC 7159 format. Possible "inf", "nan", etc.
if (!fast_parse_succeeded) {
std::stringstream ss;
Common::C_stringstream(ss);
ss << str;
ss >> tmp;
Comment on lines -1087 to -1090
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this code on the else branch instead of raising a fatal error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that code silently fails, which is completely unacceptable.

In reality we are only parsing strings generated by LightGBM itself when the model was written out to file, so we should ensure there are robust round-trip tests which include models that contain inf and nan values. There are no other possible non-finite values defined for IEEE 754 floating point numbers, and if at some point in the future the standard changed, this would be quickly picked up by the round-trip tests and easily addressed.

I am shocked that such tests don't already exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant doing number parsing after the nan and inf checks. Your logic makes sense regarding IEEE-754, although I wouldn't recommend dropping that parsing at the end without adding said tests first, not 100.0% sure fast_double_parser parses all numbers.

std::string strlower(str);
std::transform(strlower.begin(), strlower.end(), strlower.begin(), [](int c) -> char { return static_cast<char>(::tolower(c)); });
Copy link
Contributor

@AlbertoEAF AlbertoEAF Feb 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great clean code @mjmckp ;)

Instead of allocating a string, and since you already have a lambda, what about defining a case-insensitive comparison lambda and use std::equal to check the "inf" and "nan" values below?

Although this is the rare branch there might be longer strings than inf or nan which might be parsed here and might slow down our parsing without need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this hardly seems worth it, this branch is rarely invoked, meanwhile a colossal amount of strings are being allocated in splitting and parsing the input file, so these few extra allocations are a drop in the ocean.

I think our time is better spent adding robust round-trip tests to ensure major bugs like this don't occur again...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our time is better spent adding robust round-trip tests to ensure major bugs like this don't occur again...

Agreed. Will you add such tests?
I actually run such tests but on an external lgbm provider and didn't have nan nor inf on my model, but would prefer to see them in lgbm's CI so any breakage is detected immediately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could assist in adding the tests, any idea where these should go and how best to implement them?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjmckp #3555 has been just merged. We will really appreciate new GTest-compatible tests in this or in a follow-up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, would you mind pointing me towards a similar kind of test that I can use as a starting point please?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we don't have any tests yet. This is something that we should concentrate on in the near future. For now, I think you can take a look at tests from @AlbertoEAF in #3997.
https://github.com/microsoft/LightGBM/pull/3997/files#diff-c363eba6eda99d9e560f8341a1fc8fe02e885d2256db2482e1c543430a25666d

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjmckp I've merged this PR with the aim to not delay the upcoming release. Please feel free to add tests in a new PR. We'll be very grateful! And thanks a lot for the bug fix!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StrikerRUS Thanks a lot, I'll get up to speed on how the new tests work and add some tests for this in a new PR soon.

if (strlower == std::string("inf"))
tmp = std::numeric_limits<double>::infinity();
else if (strlower == std::string("-inf"))
tmp = -std::numeric_limits<double>::infinity();
else if (strlower == std::string("nan"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: Missing -nan handling.

Probably best to halve the string comparisons by parsing first the "-" sign.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks, I'll add -nan. I don't think it's worth obfuscating this rarely executed branch with optimisations until profiling shows it is a bottleneck.

tmp = std::numeric_limits<double>::quiet_NaN();
else if (strlower == std::string("-nan"))
tmp = -std::numeric_limits<double>::quiet_NaN();
else
Log::Fatal("Failed to parse double: %s", str.c_str());
}

return static_cast<T>(tmp);
Expand Down