Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add customized parser support #4782

Merged
merged 3 commits into from
Nov 16, 2021

Conversation

chjinche
Copy link
Contributor

@chjinche chjinche commented Nov 8, 2021

Motivation

Currently, lightGBM can only support three text data formats (tsv, csv, libsvm), and data type check process is kind of coarse. For example, a tsv may be judged as libsvm if there is string feature containing colon. Besides, lightGBM does not support feature transformation, users need to do separate transformation themselves, which may be a bit inconvenient.

Proposal

In this PR, we'd like to add customized parser to unlock free data preprocessing scenario. Users could use lightgbm as dependency, and follow parser interface to design their own parsers or could simply try freeform parser.
Compared to separate transformation. this way has several pros:

  • More efficient. Data preprocessing can go with parsing each line, and take advantage of multi-processing designed by lightgbm naturally. No need to store whole transformed data in file/memory.
  • More convenient for development and iteration. Built-in transformation could keep offline and online consistent well.

Changes

The main code changes are:

  • Add parser_config_file to config.
  • Add parser factory and reflector to support customized parser.
  • Save/load parser config content along with model for automatic data processing in prediction.

src/boosting/gbdt.h Outdated Show resolved Hide resolved
include/LightGBM/dataset.h Outdated Show resolved Hide resolved
include/LightGBM/dataset.h Outdated Show resolved Hide resolved
include/LightGBM/dataset.h Outdated Show resolved Hide resolved
@shiyu1994
Copy link
Collaborator

@jameslamb @StrikerRUS @guolinke Could you please have a look at this PR if you have time. Thanks.

@StrikerRUS
Copy link
Collaborator

@chjinche From the first glance this PR looks really good and doesn't break backward compatibility. Thanks a lot for finding a way to support feature transformations in a non-invasive for core LightGBM API way! I'll provide more thoughtful review when have time.

@chjinche
Copy link
Contributor Author

@StrikerRUS Much appreciated for reviewing and glad that you like the implementation 😍
As the feature is required urgently by Microsoft Internal products, and we have already spent more time on making the integration change 😅, it will be very grateful if you could provide feedbacks before next week. Your views are important to us, we're very happy to work on the feature with all of you! 😄

@jameslamb
Copy link
Collaborator

jameslamb commented Nov 12, 2021

/gha run r-solaris

Workflow Solaris CRAN check has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/1451685924

solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.3.1.99.tar.gz-917bd78bb1fe442a8c8d245586e29b13
Reports also have been sent to LightGBM public e-mail: https://yopmail.com?lightgbm_rhub_checks
Status: failure ❌.

@jameslamb
Copy link
Collaborator

jameslamb commented Nov 12, 2021

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/1451686218

Status: success ✔️.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like merging this PR as-is would break the ability to use the LightGBM R package on Solaris, which would result in that package being rejected and removed from CRAN.

Here is the relevant error from https://builder.r-hub.io/status/lightgbm_3.3.1.99.tar.gz-917bd78bb1fe442a8c8d245586e29b13

#> In file included from ./include/LightGBM/config.h:16:0,
1782#> from ./include/LightGBM/boosting.h:8,
1783#> from boosting/boosting.cpp:5:
1784#> ./include/LightGBM/utils/common.h:207:82: error: ‘INT_MAX’ was not declared in this scope
1785#> inline static std::string LoadStringFromFile(const char* filename, int row_num = INT_MAX) {
1786#> ^
1787#> gmake: *** [/opt/csw/lib/R/etc/Makeconf:177: boosting/boosting.o] Error 1
1788#> ERROR: compilation failed for package ‘lightgbm’
1789#> * removing ‘/export/home/XjiOipx/R/lightgbm’
1790#> Warning message:
1791#> In i.p(...) :
1792#> installation of package ‘/export/home/XjiOipx/Rtemp/RtmpoRqZQc/file6c8076a9dbf/lightgbm_3.3.1.99.tar.gz’ had non-zero exit status

I think this is might be a compiler-specific problem, and maybe explicitly including <climits> would fix it? Based on this Stack Overflow answer which seems related.

Could you try that and re-run the Solaris CI job? (link to instructions)

Sorry for the inconvenience. I hope the fix is that simple 😬

@jameslamb
Copy link
Collaborator

Other than the issue with the Solaris job, I agree with @StrikerRUS ... this looks like a good approach for the integration! Thanks very very much for your efforts there.

I'm definitely interested in trying this out once the relevant changes have been merged in LightGBM and the lightgbm-transform docs have been updated. We had good success at a previous company I worked at bundling feature engineering and inference code together. I think it's a really powerful idea.

@shiyu1994
Copy link
Collaborator

We can also use std::numeric_limits<int>::max() as an alternative to INT_MAX. Since it is already used in other place of the code base, see

std::numeric_limits<int>::max() // make sure the lambda is almost never called

so it should be safe to use.

Inclusion of #include <limits> is required for this.

@jameslamb
Copy link
Collaborator

We can also use std::numeric_limits::max()

Oh nice! I like that even better, and since it's used in other places in the code I think it's likely to work on Solaris.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for this PR!
Please check some my suggestions and questions below:

README.md Outdated Show resolved Hide resolved
include/LightGBM/config.h Outdated Show resolved Hide resolved
docs/Parameters.rst Outdated Show resolved Hide resolved
include/LightGBM/utils/common.h Outdated Show resolved Hide resolved
src/boosting/gbdt.cpp Show resolved Hide resolved
src/io/dataset_loader.cpp Outdated Show resolved Hide resolved
tests/python_package_test/test_utilities.py Outdated Show resolved Hide resolved
tests/python_package_test/test_utilities.py Outdated Show resolved Hide resolved
tests/python_package_test/test_utilities.py Outdated Show resolved Hide resolved
tests/python_package_test/test_utilities.py Outdated Show resolved Hide resolved
@chjinche
Copy link
Contributor Author

chjinche commented Nov 15, 2021

/gha run r-solaris

Workflow Solaris CRAN check has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/1461290348

solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.3.1.99.tar.gz-af4c61e8f58a4461865e362f5f8289d7
solaris-x86-patched-ods: https://builder.r-hub.io/status/lightgbm_3.3.1.99.tar.gz-23621e6c0c7e4bf2bdafe6bc0c89cd3d
Reports also have been sent to LightGBM public e-mail: https://yopmail.com?lightgbm_rhub_checks
Status: success ✔️.

@chjinche
Copy link
Contributor Author

chjinche commented Nov 15, 2021

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/1461291238

Status: success ✔️.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to me now that valgrind and solaris tests are passing, thanks! Please wait for a review from @shiyu1994 before merging, my review shouldn't be enough for a merge on this one.

By the way, we use squash merging in this project, so it's not necessary to rebase, squash commits, and force-push. In fact, as a reviewer I personally would prefer that you didn't do squashing within a pull request, so it's possible to see the changes in each subsequent commit pushed in response to reviewers' comments.

@jameslamb jameslamb dismissed their stale review November 15, 2021 15:09

changes have been addressed, but wanted to only leave a "comment" follow-up review so my review doesn't count towards a merge

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Nov 15, 2021

I would kindly ask next time address review comments in a separate commits and don't perform force pushes in a such kind of big PRs. Separate commits speed up next round of reviews by allowing to check that only requested changes were done and nothing else were touched by a bad merge commit for example. Without separate commits one should review entire PR from scratch again.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️
LGTM, except two very minor suggestions below! Thank you very much for for this enhancement!

I'm leaving only "comment" because I believe that only review from cpp maintainer should count towards a merge.

include/LightGBM/utils/text_reader.h Outdated Show resolved Hide resolved
docs/Parameters.rst Outdated Show resolved Hide resolved
include/LightGBM/config.h Outdated Show resolved Hide resolved
Copy link
Collaborator

@shiyu1994 shiyu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes LGTM!

@shiyu1994
Copy link
Collaborator

shiyu1994 commented Nov 16, 2021

@chjinche Thanks for this great work! And thank @StrikerRUS @jameslamb for your review. BTW, @StrikerRUS, could you please mark the request changes as resolved? It seems that the requested changes have been addressed, but they still block the merging. Thanks for your detailed review.

@chjinche
Copy link
Contributor Author

chjinche commented Nov 16, 2021

/gha run r-solaris

Workflow Solaris CRAN check has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/1465159443

solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.3.1.99.tar.gz-6d170faaa30041cdbbe614bea360064b
solaris-x86-patched-ods: https://builder.r-hub.io/status/lightgbm_3.3.1.99.tar.gz-5d4b9f70c579475c97bd096dadbd8a2b
Reports also have been sent to LightGBM public e-mail: https://yopmail.com?lightgbm_rhub_checks
Status: success ✔️.

@chjinche
Copy link
Contributor Author

chjinche commented Nov 16, 2021

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/1465159857

Status: success ✔️.

@shiyu1994
Copy link
Collaborator

shiyu1994 commented Nov 16, 2021

I'm going to merge this given that all the checks have been passed. The latest suggestions of @StrikerRUS have been addressed by @chjinche. Thank you.

@shiyu1994 shiyu1994 merged commit b0137de into microsoft:master Nov 16, 2021
@StrikerRUS
Copy link
Collaborator

It seems that the requested changes have been addressed, but they still block the merging.

Sorry about that!

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Add FreeForm2 Into LightGBM
4 participants