Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error loading model file produced on different version/platform #3137

Closed
yalwan-iqvia opened this issue Jun 1, 2020 · 18 comments
Closed

Error loading model file produced on different version/platform #3137

yalwan-iqvia opened this issue Jun 1, 2020 · 18 comments

Comments

@yalwan-iqvia
Copy link

yalwan-iqvia commented Jun 1, 2020

On trying to load a LightGBM model produced using python/LightGBM 2.2.3/linux, loading into LightGBM 2.3.1 via Julia FFI on windows fails.

Specifically, this error is encountered while trying load a file for use in the test suite for the julia wrapper.

How you are using LightGBM?

Julia FFI wrapper to C -library

LightGBM component:

Environment info

Operating System: Windows

CPU/GPU model: CPU

LightGBM version or commit hash: 2.3.1

Error message and / or logs

https://github.com/IQVIA-ML/LightGBM.jl/pull/52/checks?check_run_id=727774501#step:7:110
Pastebin: https://pastebin.com/H2cucgHA

Reproducible example(s)

Please use model from here: https://github.com/IQVIA-ML/LightGBM.jl/blob/132d9eaebb6fba44f1cbc377ab0a00d4ac0d3244/test/ffi/data/gain_test_booster
Pastebin: https://pastebin.com/CQsDdR0P

This model was produced using Python LightGBM 2.2.3 on linux

Steps to reproduce

  1. Load model on windows using LightGBM 2.3.1
@StrikerRUS
Copy link
Collaborator

LightGBM 2.3.1 has version 3 of model file, while 2.2.3 produces version 2. However, if I'm not mistaken, new version of LightGBM is able to load old model: #2269 (comment).

@yalwan-iqvia Do you manually modify text model before loading back?

@yalwan-iqvia
Copy link
Author

yalwan-iqvia commented Jun 1, 2020

So I thought this might be related, so I loaded and re-saved (via linux) to get a V3 model file. Same error occurred.

Then I tried to truncate number of trees in boosted model (cause I wondered if that would be related) and same error occurred

Loaded and re-saved model: https://github.com/IQVIA-ML/LightGBM.jl/blob/afc7cc18e9a69b4a47ed52902cf50960ec2c8719/test/ffi/data/gain_test_booster
Error: https://github.com/IQVIA-ML/LightGBM.jl/runs/727839963#step:7:110

Truncated boosters model: https://github.com/IQVIA-ML/LightGBM.jl/blob/d98c963b3e99357e8c014c543d12808b3de60b25/test/ffi/data/gain_test_booster
Error: https://github.com/IQVIA-ML/LightGBM.jl/runs/727873809#step:7:110

In each occasion you can see the error differs. I am wondering if it has something to do with length of model data, but it would be weird.

Happy to run whatever experiments to help troubleshoot, but you can see from those status checks its working well in Linux/Mac. The binary is the binary obtained from https://github.com/microsoft/LightGBM/releases/download/v2.3.1/lib_lightgbm.dll in case it is relevant.

@yalwan-iqvia
Copy link
Author

I can add that testing this locally on a windows machine worked ok ... so it seems to be an issue with the status checks system (docker image?) but what, I don't know

@yalwan-iqvia
Copy link
Author

Each time I have truncated the model (to a point prior to "met ..., expected Tree") it gets a new error earlier on in the model parsing, see for example:

https://github.com/IQVIA-ML/LightGBM.jl/blob/98800ff14f77c7ad198c4ab7a342845ba7acfa16/test/ffi/data/gain_test_booster

https://github.com/IQVIA-ML/LightGBM.jl/pull/52/checks?check_run_id=730609619#step:7:111

I've also tried reading the model into memory first and then using LGBM_BoosterLoadModelFromString and it still fails

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jun 3, 2020

@yalwan-iqvia I can confirm that the error happens on my local Windows machine. However, if I remove tree_sizes= model from your original comment can be loaded successfully .

import requests

import lightgbm as lgb  # version 2.3.1

url = r"https://raw.githubusercontent.com/IQVIA-ML/LightGBM.jl/132d9eaebb6fba44f1cbc377ab0a00d4ac0d3244/test/ffi/data/gain_test_booster"

r = requests.get(url)
with open('model.txt', 'w', encoding='utf-8') as file:
    file.write(r.text)

with open('model.txt', encoding='utf-8') as file:
    model_text = file.readlines()
with open('model.txt', "w", encoding='utf-8') as file:
    for line in model_text:
        if not line.startswith("tree_sizes="):
            file.write(line)

model = lgb.Booster(model_file="model.txt")
model.predict([[1,2,3,4,5]])

>>> array([0.37117142])

I guess that the original model file was modified and tree_sizes are incorrect now.

@yalwan-iqvia
Copy link
Author

@StrikerRUS I only ever produced these files by calling booster.save_model (with num_iterations arg to truncate) or LGBM_BoosterSaveModel -- as you can see from tests on that branch that unix and mac passes, and it definitely worked for me locally too so I am still thinking there might be some failure in loading logic which is only triggered in a platform dependent way. If it loads correctly without that tree_sizes field at all maybe files shouldn't write them (especially if they're apparently incorrect sometimes?)

The tip worked (thank you!) and allowed our development branch to pass on CI server, so I'm happy to accept it as workaround, but perhaps the underlying issue might still need to considered by the LightGBM team -- I leave that decision to you guys.

@StrikerRUS
Copy link
Collaborator

@guolinke Is it possible that calculated tree sizes on one platform are incorrect on another?

@guolinke
Copy link
Collaborator

guolinke commented Jun 6, 2020

I think they should be identical in different platforms.
The only possible difference is the new line symbol. But we force to use \n for newline for all platforms.
@yalwan-iqvia did you manually save/re-save the model file? And can you check the new-line symbols?

@guolinke
Copy link
Collaborator

guolinke commented Jun 6, 2020

@StrikerRUS for your example (#3137 (comment)), I think it will break the newline symbols for cross-platform.

@StrikerRUS
Copy link
Collaborator

@guolinke readlines() method leaves newline character \n at the end of the string. Or do you mean that there can be an issue during writing lines due to that Python automatically translates \n to the proper newline character based on the platform.

@guolinke
Copy link
Collaborator

guolinke commented Jun 6, 2020

@StrikerRUS I am not a python expert, but I guess it could be.

@guolinke
Copy link
Collaborator

@StrikerRUS can you confirm whether the \n is the problem or not?

@StrikerRUS
Copy link
Collaborator

@guolinke I'll try to get Linux machine in next days and reproduce the issue. But TBH, I don't think that \n is a root cause.

@yalwan-iqvia
Copy link
Author

I think they should be identical in different platforms.
The only possible difference is the new line symbol. But we force to use \n for newline for all platforms.
@yalwan-iqvia did you manually save/re-save the model file? And can you check the new-line symbols?

@guolinke sorry i took this long to reply.

I can say that I tried various things, including using julia wrapper to C API to re-save the model and it did not change the result. I only initially produced model using python for a convenience, and I tried a lot of things to make the issue go away (including saving with truncation by using a lower num_iterations than what was used for saving, not a manual truncation) -- but @StrikerRUS suggestion to remove tree_sizes field was the only thing which worked for the system.

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jul 11, 2020

Finally got an access to a Linux machine. Can confirm that the model r"https://raw.githubusercontent.com/IQVIA-ML/LightGBM.jl/132d9eaebb6fba44f1cbc377ab0a00d4ac0d3244/test/ffi/data/gain_test_booster" can be loaded back successfully with both 2.2.3 and 2.3.1 versions on Linux machine.

However, I cannot reproduce the issue with random data and model. I trained and saved a model on Linux with 2.2.3 version and was able to load it on Windows with 2.3.1 version successfully.

X, y = load_digits(2, True)
X = X[:, :4]  # to match feature number from the original model
est = lgb.LGBMClassifier().fit(X, y)
est.booster__.save_model('model-linux.txt')

IDK, maybe that original model suffers from some edge case of tree size calculation that differs on Linux and Windows. Or maybe that file is just corrupted somehow.

@guolinke
Copy link
Collaborator

guolinke commented Aug 5, 2020

@yalwan-iqvia
Does the problem still exist? From our tests, we think the model file should work fine across platform.
If your model is re-saved by other processes, like your mentioned Julia wrapper, you should better remove the tree_sizes filed.

@guolinke guolinke closed this as completed Aug 5, 2020
@yalwan-iqvia
Copy link
Author

If it loads correctly without that tree_sizes field at all maybe files shouldn't write them (especially if they're apparently incorrect sometimes?)

My take is that if this is apparently occasionally correct and the workaround is to remove tree_sizes field then perhaps processes should not write it at all, which might be an acceptable fix to the issue.

I don't regularly anticipate a need to produce a file on one system and consume it on another, so this doesn't affect me personally. I just needed tests to pass, but this does look like a potential problem for users who might be doing cross platform production/consumption.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants