Error loading model file produced on different version/platform #3137

yalwan-iqvia · 2020-06-01T15:36:21Z

On trying to load a LightGBM model produced using python/LightGBM 2.2.3/linux, loading into LightGBM 2.3.1 via Julia FFI on windows fails.

Specifically, this error is encountered while trying load a file for use in the test suite for the julia wrapper.

How you are using LightGBM?

Julia FFI wrapper to C -library

LightGBM component:

Environment info

Operating System: Windows

CPU/GPU model: CPU

LightGBM version or commit hash: 2.3.1

Error message and / or logs

https://github.com/IQVIA-ML/LightGBM.jl/pull/52/checks?check_run_id=727774501#step:7:110
Pastebin: https://pastebin.com/H2cucgHA

Reproducible example(s)

Please use model from here: https://github.com/IQVIA-ML/LightGBM.jl/blob/132d9eaebb6fba44f1cbc377ab0a00d4ac0d3244/test/ffi/data/gain_test_booster
Pastebin: https://pastebin.com/CQsDdR0P

This model was produced using Python LightGBM 2.2.3 on linux

Steps to reproduce

Load model on windows using LightGBM 2.3.1

StrikerRUS · 2020-06-01T19:11:21Z

LightGBM 2.3.1 has version 3 of model file, while 2.2.3 produces version 2. However, if I'm not mistaken, new version of LightGBM is able to load old model: #2269 (comment).

@yalwan-iqvia Do you manually modify text model before loading back?

yalwan-iqvia · 2020-06-01T19:24:29Z

So I thought this might be related, so I loaded and re-saved (via linux) to get a V3 model file. Same error occurred.

Then I tried to truncate number of trees in boosted model (cause I wondered if that would be related) and same error occurred

Loaded and re-saved model: https://github.com/IQVIA-ML/LightGBM.jl/blob/afc7cc18e9a69b4a47ed52902cf50960ec2c8719/test/ffi/data/gain_test_booster
Error: https://github.com/IQVIA-ML/LightGBM.jl/runs/727839963#step:7:110

Truncated boosters model: https://github.com/IQVIA-ML/LightGBM.jl/blob/d98c963b3e99357e8c014c543d12808b3de60b25/test/ffi/data/gain_test_booster
Error: https://github.com/IQVIA-ML/LightGBM.jl/runs/727873809#step:7:110

In each occasion you can see the error differs. I am wondering if it has something to do with length of model data, but it would be weird.

Happy to run whatever experiments to help troubleshoot, but you can see from those status checks its working well in Linux/Mac. The binary is the binary obtained from https://github.com/microsoft/LightGBM/releases/download/v2.3.1/lib_lightgbm.dll in case it is relevant.

yalwan-iqvia · 2020-06-02T10:05:01Z

I can add that testing this locally on a windows machine worked ok ... so it seems to be an issue with the status checks system (docker image?) but what, I don't know

yalwan-iqvia · 2020-06-02T11:13:35Z

Each time I have truncated the model (to a point prior to "met ..., expected Tree") it gets a new error earlier on in the model parsing, see for example:

https://github.com/IQVIA-ML/LightGBM.jl/blob/98800ff14f77c7ad198c4ab7a342845ba7acfa16/test/ffi/data/gain_test_booster

https://github.com/IQVIA-ML/LightGBM.jl/pull/52/checks?check_run_id=730609619#step:7:111

I've also tried reading the model into memory first and then using LGBM_BoosterLoadModelFromString and it still fails

StrikerRUS · 2020-06-03T17:00:25Z

@yalwan-iqvia I can confirm that the error happens on my local Windows machine. However, if I remove tree_sizes= model from your original comment can be loaded successfully .

import requests

import lightgbm as lgb  # version 2.3.1

url = r"https://raw.githubusercontent.com/IQVIA-ML/LightGBM.jl/132d9eaebb6fba44f1cbc377ab0a00d4ac0d3244/test/ffi/data/gain_test_booster"

r = requests.get(url)
with open('model.txt', 'w', encoding='utf-8') as file:
    file.write(r.text)

with open('model.txt', encoding='utf-8') as file:
    model_text = file.readlines()
with open('model.txt', "w", encoding='utf-8') as file:
    for line in model_text:
        if not line.startswith("tree_sizes="):
            file.write(line)

model = lgb.Booster(model_file="model.txt")
model.predict([[1,2,3,4,5]])

>>> array([0.37117142])

I guess that the original model file was modified and tree_sizes are incorrect now.

yalwan-iqvia · 2020-06-04T06:58:17Z

@StrikerRUS I only ever produced these files by calling booster.save_model (with num_iterations arg to truncate) or LGBM_BoosterSaveModel -- as you can see from tests on that branch that unix and mac passes, and it definitely worked for me locally too so I am still thinking there might be some failure in loading logic which is only triggered in a platform dependent way. If it loads correctly without that tree_sizes field at all maybe files shouldn't write them (especially if they're apparently incorrect sometimes?)

The tip worked (thank you!) and allowed our development branch to pass on CI server, so I'm happy to accept it as workaround, but perhaps the underlying issue might still need to considered by the LightGBM team -- I leave that decision to you guys.

StrikerRUS · 2020-06-04T21:24:11Z

@guolinke Is it possible that calculated tree sizes on one platform are incorrect on another?

guolinke · 2020-06-06T06:25:24Z

I think they should be identical in different platforms.
The only possible difference is the new line symbol. But we force to use \n for newline for all platforms.
@yalwan-iqvia did you manually save/re-save the model file? And can you check the new-line symbols?

guolinke · 2020-06-06T06:27:16Z

@StrikerRUS for your example (#3137 (comment)), I think it will break the newline symbols for cross-platform.

StrikerRUS · 2020-06-06T15:24:17Z

@guolinke readlines() method leaves newline character \n at the end of the string. Or do you mean that there can be an issue during writing lines due to that Python automatically translates \n to the proper newline character based on the platform.

guolinke · 2020-06-06T16:48:05Z

@StrikerRUS I am not a python expert, but I guess it could be.

guolinke · 2020-06-22T01:06:30Z

@StrikerRUS can you confirm whether the \n is the problem or not?

StrikerRUS · 2020-06-22T01:45:14Z

@guolinke I'll try to get Linux machine in next days and reproduce the issue. But TBH, I don't think that \n is a root cause.

yalwan-iqvia · 2020-06-23T14:28:01Z

I think they should be identical in different platforms.
The only possible difference is the new line symbol. But we force to use \n for newline for all platforms.
@yalwan-iqvia did you manually save/re-save the model file? And can you check the new-line symbols?

@guolinke sorry i took this long to reply.

I can say that I tried various things, including using julia wrapper to C API to re-save the model and it did not change the result. I only initially produced model using python for a convenience, and I tried a lot of things to make the issue go away (including saving with truncation by using a lower num_iterations than what was used for saving, not a manual truncation) -- but @StrikerRUS suggestion to remove tree_sizes field was the only thing which worked for the system.

StrikerRUS · 2020-07-11T16:12:42Z

Finally got an access to a Linux machine. Can confirm that the model r"https://raw.githubusercontent.com/IQVIA-ML/LightGBM.jl/132d9eaebb6fba44f1cbc377ab0a00d4ac0d3244/test/ffi/data/gain_test_booster" can be loaded back successfully with both 2.2.3 and 2.3.1 versions on Linux machine.

However, I cannot reproduce the issue with random data and model. I trained and saved a model on Linux with 2.2.3 version and was able to load it on Windows with 2.3.1 version successfully.

X, y = load_digits(2, True)
X = X[:, :4]  # to match feature number from the original model
est = lgb.LGBMClassifier().fit(X, y)
est.booster__.save_model('model-linux.txt')

IDK, maybe that original model suffers from some edge case of tree size calculation that differs on Linux and Windows. Or maybe that file is just corrupted somehow.

guolinke · 2020-08-05T23:58:55Z

@yalwan-iqvia
Does the problem still exist? From our tests, we think the model file should work fine across platform.
If your model is re-saved by other processes, like your mentioned Julia wrapper, you should better remove the tree_sizes filed.

yalwan-iqvia · 2020-08-06T06:53:00Z

If it loads correctly without that tree_sizes field at all maybe files shouldn't write them (especially if they're apparently incorrect sometimes?)

My take is that if this is apparently occasionally correct and the workaround is to remove tree_sizes field then perhaps processes should not write it at all, which might be an acceptable fix to the issue.

I don't regularly anticipate a need to produce a file on one system and consume it on another, so this doesn't affect me personally. I just needed tests to pass, but this does look like a potential problem for users who might be doing cross platform production/consumption.

github-actions · 2023-08-23T22:37:02Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

guolinke closed this as completed Aug 5, 2020

StrikerRUS mentioned this issue Sep 16, 2020

[fix] Locale independent model load/save #3267

Closed

ankane mentioned this issue Nov 22, 2020

Model saved on Mac errors when loading on Windows #3589

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error loading model file produced on different version/platform #3137

Error loading model file produced on different version/platform #3137

yalwan-iqvia commented Jun 1, 2020 •

edited

Loading

StrikerRUS commented Jun 1, 2020

yalwan-iqvia commented Jun 1, 2020 •

edited

Loading

yalwan-iqvia commented Jun 2, 2020

yalwan-iqvia commented Jun 2, 2020

StrikerRUS commented Jun 3, 2020 •

edited

Loading

yalwan-iqvia commented Jun 4, 2020

StrikerRUS commented Jun 4, 2020

guolinke commented Jun 6, 2020 •

edited

Loading

guolinke commented Jun 6, 2020

StrikerRUS commented Jun 6, 2020

guolinke commented Jun 6, 2020

guolinke commented Jun 22, 2020

StrikerRUS commented Jun 22, 2020

yalwan-iqvia commented Jun 23, 2020

StrikerRUS commented Jul 11, 2020 •

edited

Loading

guolinke commented Aug 5, 2020

yalwan-iqvia commented Aug 6, 2020

github-actions bot commented Aug 23, 2023

Error loading model file produced on different version/platform #3137

Error loading model file produced on different version/platform #3137

Comments

yalwan-iqvia commented Jun 1, 2020 • edited Loading

How you are using LightGBM?

Environment info

Error message and / or logs

Reproducible example(s)

Steps to reproduce

StrikerRUS commented Jun 1, 2020

yalwan-iqvia commented Jun 1, 2020 • edited Loading

yalwan-iqvia commented Jun 2, 2020

yalwan-iqvia commented Jun 2, 2020

StrikerRUS commented Jun 3, 2020 • edited Loading

yalwan-iqvia commented Jun 4, 2020

StrikerRUS commented Jun 4, 2020

guolinke commented Jun 6, 2020 • edited Loading

guolinke commented Jun 6, 2020

StrikerRUS commented Jun 6, 2020

guolinke commented Jun 6, 2020

guolinke commented Jun 22, 2020

StrikerRUS commented Jun 22, 2020

yalwan-iqvia commented Jun 23, 2020

StrikerRUS commented Jul 11, 2020 • edited Loading

guolinke commented Aug 5, 2020

yalwan-iqvia commented Aug 6, 2020

github-actions bot commented Aug 23, 2023

yalwan-iqvia commented Jun 1, 2020 •

edited

Loading

yalwan-iqvia commented Jun 1, 2020 •

edited

Loading

StrikerRUS commented Jun 3, 2020 •

edited

Loading

guolinke commented Jun 6, 2020 •

edited

Loading

StrikerRUS commented Jul 11, 2020 •

edited

Loading