Precise text file parsing #4081

cyfdecyf · 2021-03-18T12:47:14Z

When comparing prediction result using command line version and Python API, I noticed some prediction values differs starting at the 5th non-zero digits. I suspect the difference is caused by differnet float number parsing algorithms used in LightGBM and pandas. (For reference, I used pandas.read_csv(fname, float_precision="round_trip") to load csv file in my Python code.)

So I added precise text file parsing with fast_double_parser and the result confirms my guess.

This patch also contains a simple benchmark which shows Common::Atof is much faster than using fast_double_parser.

cyfdecyf · 2021-03-18T23:09:18Z

@jameslamb The benchmark result doesn't show fast_double_parser being faster than Comon::Atof. So this PR is more useful for people who wants to verify problems caused by precision lost of when parsing floating number.

shiyu1994

Thanks for your contribution. Just two comments that address my concerns.

shiyu1994 · 2021-03-24T08:37:27Z

include/LightGBM/utils/common.h

@@ -330,6 +331,26 @@ inline static const char* Atof(const char* p, double* out) {
  return p;
 }

+// Use fast_double_parse and strtod (if parse failed) to parse double.
+inline static const char* AtofPrecise(const char* p, double* out) {


Does that mean we can replace

LightGBM/include/LightGBM/utils/common.h

Lines 1079 to 1102 in 1d2f3e1

T operator()(const std::string& str) const {

double tmp;

// Fast (common) path: For numeric inputs in RFC 7159 format:

const bool fast_parse_succeeded = fast_double_parser::parse_number(str.c_str(), &tmp);

// Rare path: Not in RFC 7159 format. Possible "inf", "nan", etc.

if (!fast_parse_succeeded) {

std::string strlower(str);

std::transform(strlower.begin(), strlower.end(), strlower.begin(), [](int c) -> char { return static_cast<char>(::tolower(c)); });

if (strlower == std::string("inf"))

tmp = std::numeric_limits<double>::infinity();

else if (strlower == std::string("-inf"))

tmp = -std::numeric_limits<double>::infinity();

else if (strlower == std::string("nan"))

tmp = std::numeric_limits<double>::quiet_NaN();

else if (strlower == std::string("-nan"))

tmp = -std::numeric_limits<double>::quiet_NaN();

else

Log::Fatal("Failed to parse double: %s", str.c_str());

}

return static_cast<T>(tmp);

}

which was added in #3942 with a call to this AtofPrecise? Do they behave exactly the same?

I'm not an expert on floating number. So I write a test program https://gist.github.com/cyfdecyf/63f4e7339bbe5a5a23474fda66375742

The only difference is that strtod would not return negative NaN. So the Atof function in this gist handles this special case.

Please help take a look at the gist and check if there's any problem. I'll update Common::Atof and replace the change added in #3492.

Refer to: 1 2

The special handling for -NaN in gist (revision 5) the has one problem though: it's incorrect to handle input like " -nan" (note beginning space). But I'm wondering if there's need for the special handling of this case? From this mailling list thread, it seems like different C library have different treatment for parsing "-nan".

I suggest just leave Common::AtofPrecise as is and don't add special handling for "-nan" and the like.

Code introduced in #3942 is replaced by AtofPrecise in commit 498090d.

Great! LGTM.

shiyu1994 · 2021-03-24T08:42:55Z

tests/benchmark/parser/CMakeLists.txt

@@ -0,0 +1,18 @@
+cmake_minimum_required(VERSION 3.0)


I'm not sure whether it is appropriate to add these benchmark scripts to the master branch. Why do you think they are necessary?

This is not necessary to merge into the master branch. I'm just curious about the actual performance of Common::Atof and fast_double_parser.

And maybe this benchmark can be helpful for people who want to improve the performance of text parser.

I can remove this commit if you decide to not include it.

The benchmark commit is now reverted. It's better to create microbenchmarks for this.

StrikerRUS · 2021-03-26T13:23:33Z

CMakeLists.txt

@@ -1,3 +1,4 @@
+OPTION(USE_PRECISE_TEXT_PARSER "Use precise double parser for text input file" OFF)


Why do we need new compilation option only for one function? Why not simply use AtofPrecise instead of Atof by default?
Having a lot of functions doing the same work is very confusing, greatly increases maintenance burden and hurts overall development process.

If AtofPrecise is faster (or not much slower) than Atof, I'd like to use it by default. But it's actually much slower in my simple benchmark

When I see fast_double_parser mentioned in the commit log, I thought it's been used for text file parsing too. But it's actually not which confused me at first. I guess someone might have done the performance test and thus not using it for text parsing

With this performance difference, I'd choose precise version only when precision is required

AtofPrecise does not behave exactly the same with Atof. For some of my test models, prediction results for csv input can differ starting from the 5th non-zero digit

As comment in utils/common.h says both StringToHelperFast and StringToHelper are kept to maintain bit-for-bit legacy LightGBM behavior for precision, I guess you'd prefer to keep the old behavior by default

I noticed this when working on [python-package] Create Dataset from multiple data files #4089 and some followups. Result verification shows this difference which makes me upset about the correctness

I've also been biten by ClickHouse's choice of non-precise float parsing by default. They choose to sacrifice 1 or 2 bits of precision to keep good float parsing performance ClickHouse/ClickHouse#1665 (3 years ago, not sure whether still true now). For me, I choose to change the float parsing function to precise and the compile the package myself.

OK, I see now that the main reason is that AtofPrecise is much slower than current solution.
But I'm still strongly against adding new compilation option for this because of maintenance burden and not many users will compile the library on their own to get precise file parsing.
However, I believe that new config param will be a good workaround for this situation. Just like recently added (see #3494 and #3578) deterministic param. Users don't have to re-compile the library every time they are switching "performance/accuracy" scenarios but do it in the runtime.
Is it possible to add new param for precise parsing? WDYT? Thanks!

Thanks for the feedback. I'll add a config parameter and remove the compile time option to do this.

BTW, deterministic parameter is very useful for verifying result when change code in LightGBM.

Now that we have the config parameter for precise float parsing, shall we remove this compile option?

shiyu1994 · 2021-04-14T05:24:06Z

include/LightGBM/utils/common.h

@@ -330,6 +331,26 @@ inline static const char* Atof(const char* p, double* out) {
  return p;
 }

+// Use fast_double_parse and strtod (if parse failed) to parse double.
+inline static const char* AtofPrecise(const char* p, double* out) {


Great! LGTM.

cyfdecyf · 2021-04-14T05:52:40Z

I've been busy with other features these days. I'll finish adding new option to enable precise float parsing soon.

shiyu1994 · 2021-04-14T05:57:06Z

Sorry, I didn't noticed that @StrikerRUS 's comment hasn't been addressed before approving this.

The compilation option should be changed into a config option.

cyfdecyf · 2021-04-15T02:12:47Z

I rebased this PR to latest master and made a force push.

The latest commit adds new option precise_float_parser for dataset parameters.

shiyu1994

Almost done. Just a question about keeping the compilation option for precise float parsing.

shiyu1994 · 2021-04-15T08:40:00Z

CMakeLists.txt

@@ -1,3 +1,4 @@
+OPTION(USE_PRECISE_TEXT_PARSER "Use precise double parser for text input file" OFF)


Now that we have the config parameter for precise float parsing, shall we remove this compile option?

This reverts commit 92ab0b6.

This reverts commit 4f8639a.

…input.

This triggers Log::Fatal which aborts the test program.

This reverts commit 346c76a.

StrikerRUS

LGTM except one nit below to keep all params descriptions in consistent style.

As this parser is not used for model files but only for datasets, I believe there will be no any inconsistency issues with default parser, right?
#3463 (comment)

LightGBM/include/LightGBM/utils/common.h

Lines 314 to 321 in d517ba1

    
           if (tmp_str == std::string("na") || tmp_str == std::string("nan") || 
        
               tmp_str == std::string("null")) { 
        
             *out = NAN; 
        
           } else if (tmp_str == std::string("inf") || tmp_str == std::string("infinity")) { 
        
             *out = sign * 1e308; 
        
           } else { 
        
             Log::Fatal("Unknown token %s in data file", tmp_str.c_str()); 
        
           }

docs/Parameters.rst

include/LightGBM/config.h

Co-authored-by: Nikita Titov <[email protected]>

cyfdecyf · 2021-04-26T23:20:44Z

@StrikerRUS Thank you for taking time to review this PR. The added corner test cases indeed found one problem which is not setting errno to 0 before calling strtod.

Regarding your question about loading model files, AtofPrecise is actually used in some places. For example leaf_value_ parsing will call AtofPrecise.

The latest commit in master branch does not handle na, infinity input either. I replaced those code with AtofPrecise and the behavior should be the same on NaN and Inf parsing.

LightGBM/include/LightGBM/utils/common.h

Lines 1086 to 1099 in 5014f19

    
           if (!fast_parse_succeeded) { 
        
             std::string strlower(str); 
        
             std::transform(strlower.begin(), strlower.end(), strlower.begin(), [](int c) -> char { return static_cast<char>(::tolower(c)); }); 
        
             if (strlower == std::string("inf")) 
        
               tmp = std::numeric_limits<double>::infinity(); 
        
             else if (strlower == std::string("-inf")) 
        
               tmp = -std::numeric_limits<double>::infinity(); 
        
             else if (strlower == std::string("nan")) 
        
               tmp = std::numeric_limits<double>::quiet_NaN(); 
        
             else if (strlower == std::string("-nan")) 
        
               tmp = -std::numeric_limits<double>::quiet_NaN(); 
        
             else 
        
               Log::Fatal("Failed to parse double: %s", str.c_str()); 
        
           }

cyfdecyf · 2021-04-26T23:28:55Z

BTW, why does model loading uses both precise and non-precise version of floating point number parsing function?

For example:

LightGBM/src/io/tree.cpp

Line 686 in 5014f19

    
           leaf_value_ = CommonC::StringToArray<double>(key_vals["leaf_value"], num_leaves_);

LightGBM/src/io/tree.cpp

Line 710 in 5014f19

    
           left_child_ = CommonC::StringToArrayFast<int>(key_vals["left_child"], num_leaves_ - 1);

Is this for keeping backward compatibility?

shiyu1994 · 2021-04-28T06:45:22Z

The second one is not doing floating point number parsing. It just parses an integer. Using StringToArray or StringToArrayFast should make no difference.

StrikerRUS · 2021-04-28T13:56:03Z

@shiyu1994

Using StringToArray or StringToArrayFast should make no difference.

I thought that StringToArrayFast is used in places where some information loss in rare cases is acceptable for the aim of speedup. For example,

LightGBM/src/io/tree.cpp

Lines 751 to 761 in d517ba1

    
           if (key_vals.count("internal_weight")) { 
        
             internal_weight_ = CommonC::StringToArrayFast<double>(key_vals["internal_weight"], num_leaves_ - 1); 
        
           } else { 
        
             internal_weight_.resize(num_leaves_ - 1); 
        
           } 
        
           if (key_vals.count("leaf_weight")) { 
        
             leaf_weight_ = CommonC::StringToArray<double>(key_vals["leaf_weight"], num_leaves_); 
        
           } else { 
        
             leaf_weight_.resize(num_leaves_); 
        
           }

, https://github.com/microsoft/LightGBM/pull/3938/files.

Was I wrong?

shiyu1994 · 2021-04-29T02:07:40Z

@StrikerRUS Sorry, I did not made it clear. I mean because here the value is an integer, so both methods won't cause information loss.

BTW, can we merge this PR?

StrikerRUS · 2021-04-29T22:18:42Z

@shiyu1994

I mean because here the value is an integer, so both methods won't cause information loss.

Thanks, got it! But for the linked case with the double type in both cases, is my intuition correct?

BTW, can we merge this PR?

I don't have any objections. I think we can merge if you don't have any comments for code changed after your previous review.

AlbertoEAF · 2021-04-30T13:51:16Z

Thanks, got it! But for the linked case with the double type in both cases, is my intuition correct?

Exactly @StrikerRUS, and we should be very careful if we were to switch methods when parsing doubles or we can end up with subtly different model scores just by upgrading LightGBM. Hence the decision at the time when the big model read/write was done, to keep the old bit-for-bit behaviour, independently of the speed of the parsing when switching to the fast_double_parser.

cyfdecyf · 2021-05-07T01:58:31Z

@StrikerRUS Sorry, I did not made it clear. I mean because here the value is an integer, so both methods won't cause information loss.

I linked to the wrong example in the first place. In fact I was planing to include an example like @StrikerRUS has showed.

shiyu1994 · 2021-05-07T03:00:25Z

@StrikerRUS Yes, that's correct. @AlbertoEAF thanks for your explanation. @cyfdecyf That's OK. We can merge this PR now. Thanks for your contribution.

github-actions · 2023-08-23T22:38:48Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

cyfdecyf requested review from btrotta, chivee, guolinke, henry0312, huanzhang12, jameslamb, Laurae2, shiyu1994, StrikerRUS and wxchan as code owners March 18, 2021 12:47

jameslamb added awaiting review efficiency labels Mar 18, 2021

shiyu1994 reviewed Mar 24, 2021

View reviewed changes

StrikerRUS reviewed Mar 26, 2021

View reviewed changes

StrikerRUS added feature in progress and removed awaiting review efficiency labels Mar 26, 2021

cyfdecyf requested review from shiyu1994 and StrikerRUS March 30, 2021 00:22

shiyu1994 previously approved these changes Apr 14, 2021

View reviewed changes

shiyu1994 self-requested a review April 14, 2021 05:56

cyfdecyf force-pushed the precise-text-parse branch from 498090d to 9db490a Compare April 15, 2021 02:08

cyfdecyf force-pushed the precise-text-parse branch from 9db490a to 724872b Compare April 15, 2021 06:21

shiyu1994 reviewed Apr 15, 2021

View reviewed changes

cyfdecyf added 13 commits April 26, 2021 17:15

Fix typo in open result error message.

0453f8e

Revert "Fix lint complaint."

9416169

This reverts commit 92ab0b6.

Revert "Add benchmark for CSVParser with Atof and AtofPrecise."

b1d2fd4

This reverts commit 4f8639a.

Use AtofPrecise in Common::__StringToTHelper.

494cdc3

[option] precise_float_parser: precise float number parsing for text …

e1b0e92

…input.

Remove USE_PRECISE_TEXT_PARSER compile option.

bec4479

test: add test for Common::AtofPrecise.

cc71045

test: remove ChunkedArrayTest with 0 length.

9d6fb02

This triggers Log::Fatal which aborts the test program.

fix lint, add copyright.

d20a65c

Revert "test: remove ChunkedArrayTest with 0 length."

1fb062d

This reverts commit 346c76a.

Use LightGBM::Common::Sign

95d0874

save precise_float_parser in model file.

9c30899

Fix error checking in AtofPrecise. Add more test cases.

4a658e0

cyfdecyf force-pushed the precise-text-parse branch from a60d94f to 4a658e0 Compare April 26, 2021 09:16

Remove test case that can't pass under macOS.

f789cfe

StrikerRUS approved these changes Apr 26, 2021

View reviewed changes

docs/Parameters.rst Outdated Show resolved Hide resolved

include/LightGBM/config.h Outdated Show resolved Hide resolved

Apply suggestions from code review

131f8ae

Co-authored-by: Nikita Titov <[email protected]>

shiyu1994 merged commit f831808 into microsoft:master May 7, 2021

cyfdecyf deleted the precise-text-parse branch May 10, 2021 01:21

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precise text file parsing #4081

Precise text file parsing #4081

cyfdecyf commented Mar 18, 2021

cyfdecyf commented Mar 18, 2021

shiyu1994 left a comment

shiyu1994 Mar 24, 2021

cyfdecyf Mar 25, 2021

cyfdecyf Mar 25, 2021 •

edited

Loading

cyfdecyf Mar 29, 2021

shiyu1994 Apr 14, 2021

shiyu1994 Mar 24, 2021

cyfdecyf Mar 25, 2021

cyfdecyf Mar 29, 2021

StrikerRUS Mar 26, 2021

cyfdecyf Mar 28, 2021

StrikerRUS Mar 30, 2021 •

edited

Loading

cyfdecyf Mar 30, 2021

shiyu1994 Apr 15, 2021

shiyu1994 Apr 14, 2021

cyfdecyf commented Apr 14, 2021

shiyu1994 commented Apr 14, 2021

cyfdecyf commented Apr 15, 2021

shiyu1994 left a comment

shiyu1994 Apr 15, 2021

StrikerRUS left a comment

cyfdecyf commented Apr 26, 2021 •

edited

Loading

cyfdecyf commented Apr 26, 2021

shiyu1994 commented Apr 28, 2021 •

edited

Loading

StrikerRUS commented Apr 28, 2021

shiyu1994 commented Apr 29, 2021

StrikerRUS commented Apr 29, 2021

AlbertoEAF commented Apr 30, 2021 •

edited

Loading

cyfdecyf commented May 7, 2021

shiyu1994 commented May 7, 2021

github-actions bot commented Aug 23, 2023

	T operator()(const std::string& str) const {
	double tmp;

	// Fast (common) path: For numeric inputs in RFC 7159 format:
	const bool fast_parse_succeeded = fast_double_parser::parse_number(str.c_str(), &tmp);

	// Rare path: Not in RFC 7159 format. Possible "inf", "nan", etc.
	if (!fast_parse_succeeded) {
	std::string strlower(str);
	std::transform(strlower.begin(), strlower.end(), strlower.begin(), [](int c) -> char { return static_cast<char>(::tolower(c)); });
	if (strlower == std::string("inf"))
	tmp = std::numeric_limits<double>::infinity();
	else if (strlower == std::string("-inf"))
	tmp = -std::numeric_limits<double>::infinity();
	else if (strlower == std::string("nan"))
	tmp = std::numeric_limits<double>::quiet_NaN();
	else if (strlower == std::string("-nan"))
	tmp = -std::numeric_limits<double>::quiet_NaN();
	else
	Log::Fatal("Failed to parse double: %s", str.c_str());
	}

	return static_cast<T>(tmp);
	}

		@@ -1,3 +1,4 @@
		OPTION(USE_PRECISE_TEXT_PARSER "Use precise double parser for text input file" OFF)

	if (tmp_str == std::string("na") \|\| tmp_str == std::string("nan") \|\|
	tmp_str == std::string("null")) {
	*out = NAN;
	} else if (tmp_str == std::string("inf") \|\| tmp_str == std::string("infinity")) {
	out = sign 1e308;
	} else {
	Log::Fatal("Unknown token %s in data file", tmp_str.c_str());
	}

Precise text file parsing #4081

Precise text file parsing #4081

Conversation

cyfdecyf commented Mar 18, 2021

cyfdecyf commented Mar 18, 2021

shiyu1994 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyfdecyf Mar 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS Mar 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyfdecyf commented Apr 14, 2021

shiyu1994 commented Apr 14, 2021

cyfdecyf commented Apr 15, 2021

shiyu1994 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

cyfdecyf commented Apr 26, 2021 • edited Loading

cyfdecyf commented Apr 26, 2021

shiyu1994 commented Apr 28, 2021 • edited Loading

StrikerRUS commented Apr 28, 2021

shiyu1994 commented Apr 29, 2021

StrikerRUS commented Apr 29, 2021

AlbertoEAF commented Apr 30, 2021 • edited Loading

cyfdecyf commented May 7, 2021

shiyu1994 commented May 7, 2021

github-actions bot commented Aug 23, 2023

cyfdecyf Mar 25, 2021 •

edited

Loading

StrikerRUS Mar 30, 2021 •

edited

Loading

cyfdecyf commented Apr 26, 2021 •

edited

Loading

shiyu1994 commented Apr 28, 2021 •

edited

Loading

AlbertoEAF commented Apr 30, 2021 •

edited

Loading