Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-reproducible XGBoost ranking #497

Open
pltb opened this issue Jul 5, 2024 · 13 comments
Open

Non-reproducible XGBoost ranking #497

pltb opened this issue Jul 5, 2024 · 13 comments

Comments

@pltb
Copy link

pltb commented Jul 5, 2024

We've noticed some unexpected xgboost model behavior when comparing the scoring in the ES+LTR runtime to what the original inference (python + pandas + xgboost) gave us.

The likely cause for the discrepancies was this: since every feature that the model gets is a function score, the model will receive float32-typed features in the ES LTR runtime (which means they would have up to about 7 decimal digits), but our model's splitting thresholds are numbers with 9 or more decimal digits.

We thought that it is important to mention this explicitly in the plugin's documentation, because it's not obvious, and may lead to unpredictable results.

For more context please see the following:

@pltb pltb changed the title Put a warning in the documentation about features always being 32-bit floats Put a warning in the documentation about features always being rounded to 32-bit float Jul 5, 2024
@renekrie
Copy link
Member

renekrie commented Jul 5, 2024

On which ES / Java version did you observe this behaviour? Just to make sure it's not < Java 17 and you are not observing the behaviour that is described here in the context of strictfp (which isn't used in Lucene/ES/LTR and might cause floating point differences between JDK instances): https://en.wikipedia.org/wiki/Strictfp

@pltb
Copy link
Author

pltb commented Jul 8, 2024

Hi @renekrie, thanks for your reply =)
We're on ES 8.8.2 and jvm 20.0.1

@pltb
Copy link
Author

pltb commented Jul 12, 2024

I just noticed that the XGBoost example in the plugin's docs uses the get_dump method, however the XGBoost docs state that it's primarily supposed to be used for visualization or interpretation (and it won't be even able to load it back from this dump):

I have a hunch that this get_dump method may be dumping the numbers as higher-precision decimals instead of float32s... (which, considering ltr plugin's trimming to float32 during parsing, may produce an essentially different model when loaded by the ltr plugin).

@pltb
Copy link
Author

pltb commented Jul 12, 2024

I'm wondering if it's possible for the LTR plugin to load the model from the output of this xgboost method: https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.Booster.save_model

@pltb
Copy link
Author

pltb commented Jul 12, 2024

Okay, I've confirmed my theory =) Here's the complete summary of the issue that we observe right now:

The model training code (as well as the plugin's docs) uses the get_dump xgboost method to save the trained model, however the XGBoost docs state that it's primarily supposed to be used for visualization or interpretation (and it won't be even able to load it back from this dump). The docs say that for dumping models for later inference, save_model should be used instead.

I've checked the way the model's split thresholds (the split_condition(s) field) are represented in the resulting files in both cases, here's an example:

  • save_model produces 7.82353E-2 (the actual threshold from the model)
  • get_dump produces 0.0782352963 (the "human-readable" number with extra digits)

As you can see, the number from the get_dump method has more significant decimal digits than the one from save_model – and it's also true for the rest of the model. Of course this kind of difference may lead to an essentially different model being loaded by the LTR plugin (even despite the fact that it parses the numbers as 32-bit floats), and this may definitely be the reason why we were seeing different scores. Those extra digits at the end simply shouldn't be there, the original model doesn't have them (which is logical because XGBoost operates with float32 ).

If we want the scores to be reproducible, we should ensure that the model that we load in ES shouldn't contain the extra digits that we see in the get_dump's result.

The problem now is that the output of the "proper" save_model model saving method can't be read by the LTR plugin.

@pltb pltb changed the title Put a warning in the documentation about features always being rounded to 32-bit float Non-reproducible XGBoost ranking Jul 12, 2024
@pltb
Copy link
Author

pltb commented Jul 23, 2024

Hey @renekrie, I've made a draft of an alternative parser that would accept the output of save_model().

#500

I hope I will be able to put more time into this soon.

@pltb
Copy link
Author

pltb commented Jul 25, 2024

We ran tests with our own model and the implementation I've posted above, apparently it fixes the score mismatches that we've been having

cc @wrigleyDan

@renekrie
Copy link
Member

Thank you for your work and insights @pltb.

To be honest, I'm not aware of the differences in format between get_dump vs save_dump apart from the float-related issue that you described. Could the code in #500 also read a model that was produced using get_dump? - If that was the case, all we'd need should be a minimal patch to the existing parser which would make it easy to have people use either get_dump or save_dump as they like (or as they are used to)

@pltb
Copy link
Author

pltb commented Jul 30, 2024

Thank you for your response @renekrie!

Could the code in #500 also read a model that was produced using get_dump? - If that was the case, all we'd need should be a minimal patch to the existing parser which would make it easy to have people use either get_dump or save_dump as they like (or as they are used to)

Unfortunately it cannot, because the structure of JSON is entirely different (it's flat, you can see the json schema here).

I wish we were able to somehow reuse/adapt the original parser, but unfortunately it seems like introducing a different parser (and exposing it via separate content type in the API) is the easier way to address the issue

@pltb
Copy link
Author

pltb commented Jul 30, 2024

@renekrie I've cleaned the code up a bit, now it should be more clear why the parser can't be simply reused.
I think the result I've arrived at would benefit from more tests and someone else with more expertise than me looking at it, but it already looks more or less fine to me.

@renekrie
Copy link
Member

Thank you @pltb ! I will have a look and then we'll need to figure out how we deal with it. I think he save_model should be the default and we'd need a plan how to introduce that as a breaking change

@pltb
Copy link
Author

pltb commented Jul 30, 2024

@renekrie thank you!

@pltb
Copy link
Author

pltb commented Aug 1, 2024

then we'll need to figure out how we deal with it. I think he save_model should be the default and we'd need a plan how to introduce that as a breaking change

@renekrie let me know if you'll need my help with that 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants