From 11dc37fd6a0cdecb0f1194e55e75cd10985c3819 Mon Sep 17 00:00:00 2001 From: akshitadixit Date: Tue, 2 Mar 2021 12:36:50 +0530 Subject: [PATCH 1/7] [docs]Add alt text on images --- docs/Features.rst | 2 ++ docs/GPU-Performance.rst | 1 + docs/GPU-Windows.rst | 19 +++++++++++++++++++ 3 files changed, 22 insertions(+) diff --git a/docs/Features.rst b/docs/Features.rst index 6566eb628af2..6c7b07a8c813 100644 --- a/docs/Features.rst +++ b/docs/Features.rst @@ -45,6 +45,7 @@ Most decision tree learning algorithms grow trees by level (depth)-wise, like th .. image:: ./_static/images/level-wise.png :align: center + :alt: A diagram depicting level wise tree growth in which the best possible node is split one level down. The strategy results in a symmetric tree, where every node in a level has child nodes resulting in an additional layer of depth. LightGBM grows trees leaf-wise (best-first)\ `[7] <#references>`__. It will choose the leaf with max delta loss to grow. Holding ``#leaf`` fixed, leaf-wise algorithms tend to achieve lower loss than level-wise algorithms. @@ -53,6 +54,7 @@ Leaf-wise may cause over-fitting when ``#data`` is small, so LightGBM includes t .. image:: ./_static/images/leaf-wise.png :align: center + :alt: A diagram depicting leaf wise tree growth in which only the node with the highest loss change is split and not bother with the rest of the nodes in the same level. This results in an asymmetrical tree where subsequent splitting is happenning only on one side of the tree. Optimal Split for Categorical Features ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/docs/GPU-Performance.rst b/docs/GPU-Performance.rst index 107e6d33b51b..06931e271933 100644 --- a/docs/GPU-Performance.rst +++ b/docs/GPU-Performance.rst @@ -163,6 +163,7 @@ We record the wall clock time after 500 iterations, as shown in the figure below .. image:: ./_static/images/gpu-performance-comparison.png :align: center :target: ./_static/images/gpu-performance-comparison.png + :alt: A G. P. U. performance chart which is a record of the wall clock time after 500 iterations for Higgs, epsilon, Bosch, Microsoft L. T. R, Expo and Yahoo L. T. R. and bin size of 63 performs comparitively better. When using a GPU, it is advisable to use a bin size of 63 rather than 255, because it can speed up training significantly without noticeably affecting accuracy. On CPU, using a smaller bin size only marginally improves performance, sometimes even slows down training, diff --git a/docs/GPU-Windows.rst b/docs/GPU-Windows.rst index 65adb169fe45..2ee1832e21d7 100644 --- a/docs/GPU-Windows.rst +++ b/docs/GPU-Windows.rst @@ -46,18 +46,21 @@ To modify PATH, just follow the pictures after going to the ``Control Panel``: .. image:: ./_static/images/screenshot-system.png :align: center :target: ./_static/images/screenshot-system.png + :alt: A screenshot of the System option under System and Security of the Control Panel Then, go to ``Advanced`` > ``Environment Variables...``: .. image:: ./_static/images/screenshot-advanced-system-settings.png :align: center :target: ./_static/images/screenshot-advanced-system-settings.png + :alt: A screenshot of the System Properties window Under ``System variables``, the variable ``Path``: .. image:: ./_static/images/screenshot-environment-variables.png :align: center :target: ./_static/images/screenshot-environment-variables.png + :alt: A screenshot of the Environment variables window with variable path selected under the system variables -------------- @@ -105,6 +108,7 @@ You may choose a version other than the most recent one if you need a previous M .. image:: ./_static/images/screenshot-mingw-installation.png :align: center :target: ./_static/images/screenshot-mingw-installation.png + :alt: A screenshot of the Min G. W. installation setup settings window Then, add to your PATH the following (to adjust to your MinGW version): @@ -123,6 +127,7 @@ You can check which MinGW version you are using by running the following in a co .. image:: ./_static/images/screenshot-r-mingw-used.png :align: center :target: ./_static/images/screenshot-r-mingw-used.png + :alt: A screenshot of the administrator command prompt where G. C. C. version is being checked To check whether you need 32-bit or 64-bit MinGW for R, install LightGBM as usual and check for the following: @@ -220,6 +225,7 @@ This is what you should (approximately) get at the end of Boost compilation: .. image:: ./_static/images/screenshot-boost-compiled.png :align: center :target: ./_static/images/screenshot-boost-compiled.png + :alt: A screenshot of the command prompt that ends with text that reads - updated 14621 targets If you are getting an error: @@ -243,6 +249,7 @@ Installing Git for Windows is straightforward, use the following `link`_. .. image:: ./_static/images/screenshot-git-for-windows.png :align: center :target: ./_static/images/screenshot-git-for-windows.png + :alt: A screenshot of the website to download git that shows various versions of git compatible with 32 bit and 64 bit Windows separately Now, we can fetch LightGBM repository for GitHub. Run Git Bash and the following command: @@ -270,6 +277,7 @@ Installing CMake requires one download first and then a lot of configuration for .. image:: ./_static/images/screenshot-downloading-cmake.png :align: center :target: ./_static/images/screenshot-downloading-cmake.png + :alt: A screenshot of the binary distributions of c-make for downloading on Windows 64 bit - Download `CMake`_ (3.8 or higher) @@ -288,16 +296,19 @@ Installing CMake requires one download first and then a lot of configuration for .. image:: ./_static/images/screenshot-create-directory.png :align: center :target: ./_static/images/screenshot-create-directory.png + :alt: A screenshot with a pop-up window that reads - Build directory does not exist, should I recreate it? .. image:: ./_static/images/screenshot-mingw-makefiles-to-use.png :align: center :target: ./_static/images/screenshot-mingw-makefiles-to-use.png + :alt: A screenshot that asks to sepcify the generator for the project which should be selected as Min G W makefiles and selected as the use default native compilers option - Lookup for ``USE_GPU`` and check the checkbox .. image:: ./_static/images/screenshot-use-gpu.png :align: center :target: ./_static/images/screenshot-use-gpu.png + :alt: A screenshot of the C Make window where the checkbox with the test Use G P U is checked. - Click ``Configure`` @@ -306,6 +317,7 @@ Installing CMake requires one download first and then a lot of configuration for .. image:: ./_static/images/screenshot-configured-lightgbm.png :align: center :target: ./_static/images/screenshot-configured-lightgbm.png + :alt: A screenshot of the C Make window after clicking on the configure button :: @@ -366,6 +378,7 @@ You can do everything in the Git Bash console you left open: .. image:: ./_static/images/screenshot-lightgbm-with-gpu-support-compiled.png :align: center :target: ./_static/images/screenshot-lightgbm-with-gpu-support-compiled.png + :alt: A screenshot of the gitbash window with Light G B M successfully installed If everything was done correctly, you now compiled CLI LightGBM with GPU support! @@ -382,6 +395,7 @@ You can now test LightGBM directly in CLI in a **command prompt** (not Git Bash) .. image:: ./_static/images/screenshot-lightgbm-in-cli-with-gpu.png :align: center :target: ./_static/images/screenshot-lightgbm-in-cli-with-gpu.png + :alt: A screenshot of the command prompt where a binary classification model is being trained using Light G B M Congratulations for reaching this stage! @@ -397,6 +411,7 @@ Now that you compiled LightGBM, you try it... and you always see a segmentation .. image:: ./_static/images/screenshot-segmentation-fault.png :align: center :target: ./_static/images/screenshot-segmentation-fault.png + :alt: A screenshot of the command prompt where a segmentation fault has occured while using Light B G M Please check if you are using the right device (``Using GPU device: ...``). You can find a list of your OpenCL devices using `GPUCapsViewer`_, and make sure you are using a discrete (AMD/NVIDIA) GPU if you have both integrated (Intel) and discrete GPUs installed. Also, try to set ``gpu_device_id = 0`` and ``gpu_platform_id = 0`` or ``gpu_device_id = -1`` and ``gpu_platform_id = -1`` to use the first platform and device or the default platform and device. @@ -411,6 +426,7 @@ You will have to redo the compilation steps for LightGBM to add debugging mode. .. image:: ./_static/images/screenshot-files-to-remove.png :align: center :target: ./_static/images/screenshot-files-to-remove.png + :alt: A screenshot of the Light G B M folder with 1 folder and 3 files selected to be removed Once you removed the file, go into CMake, and follow the usual steps. Before clicking "Generate", click on "Add Entry": @@ -418,12 +434,14 @@ Before clicking "Generate", click on "Add Entry": .. image:: ./_static/images/screenshot-added-manual-entry-in-cmake.png :align: center :target: ./_static/images/screenshot-added-manual-entry-in-cmake.png + :alt: A screenshot of the Cache Entry popup where the name is set to C Make_Build_Type in all caps, the type is set to STRING in all caps and the value is set to Debug In addition, click on Configure and Generate: .. image:: ./_static/images/screenshot-configured-and-generated-cmake.png :align: center :target: ./_static/images/screenshot-configured-and-generated-cmake.png + :alt: A screenshot of the C Make window after clicking on configure and generate And then, follow the regular LightGBM CLI installation from there. @@ -437,6 +455,7 @@ open a command prompt and run the following: .. image:: ./_static/images/screenshot-debug-run.png :align: center :target: ./_static/images/screenshot-debug-run.png + :alt: A screenshot of the command prompt after the command above is run Type ``run`` and press the Enter key. From 03ff667cf5a26f251fef799cf03dd423f7f89f0b Mon Sep 17 00:00:00 2001 From: Akshita Dixit <56997545+akshitadixit@users.noreply.github.com> Date: Tue, 23 Mar 2021 13:12:04 +0530 Subject: [PATCH 2/7] Update docs/GPU-Windows.rst Co-authored-by: James Lamb --- docs/GPU-Windows.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/GPU-Windows.rst b/docs/GPU-Windows.rst index 2ee1832e21d7..2edd999209be 100644 --- a/docs/GPU-Windows.rst +++ b/docs/GPU-Windows.rst @@ -108,7 +108,7 @@ You may choose a version other than the most recent one if you need a previous M .. image:: ./_static/images/screenshot-mingw-installation.png :align: center :target: ./_static/images/screenshot-mingw-installation.png - :alt: A screenshot of the Min G. W. installation setup settings window + :alt: A screenshot of the Min G W installation setup settings window Then, add to your PATH the following (to adjust to your MinGW version): From 0fc0f7b07b9f13d0ce4694e53e73872377a1ce59 Mon Sep 17 00:00:00 2001 From: Akshita Dixit <56997545+akshitadixit@users.noreply.github.com> Date: Tue, 23 Mar 2021 13:32:29 +0530 Subject: [PATCH 3/7] Update docs/GPU-Windows.rst Co-authored-by: James Lamb --- docs/GPU-Windows.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/GPU-Windows.rst b/docs/GPU-Windows.rst index 2edd999209be..401be35997a1 100644 --- a/docs/GPU-Windows.rst +++ b/docs/GPU-Windows.rst @@ -411,7 +411,7 @@ Now that you compiled LightGBM, you try it... and you always see a segmentation .. image:: ./_static/images/screenshot-segmentation-fault.png :align: center :target: ./_static/images/screenshot-segmentation-fault.png - :alt: A screenshot of the command prompt where a segmentation fault has occured while using Light B G M + :alt: A screenshot of the command prompt where a segmentation fault has occurred while using Light G B M Please check if you are using the right device (``Using GPU device: ...``). You can find a list of your OpenCL devices using `GPUCapsViewer`_, and make sure you are using a discrete (AMD/NVIDIA) GPU if you have both integrated (Intel) and discrete GPUs installed. Also, try to set ``gpu_device_id = 0`` and ``gpu_platform_id = 0`` or ``gpu_device_id = -1`` and ``gpu_platform_id = -1`` to use the first platform and device or the default platform and device. From 721d7facb735455d131d8702bf8f386859a90b33 Mon Sep 17 00:00:00 2001 From: Akshita Dixit <56997545+akshitadixit@users.noreply.github.com> Date: Tue, 23 Mar 2021 13:33:43 +0530 Subject: [PATCH 4/7] Apply suggestions from code review Co-authored-by: James Lamb --- docs/GPU-Performance.rst | 2 +- docs/GPU-Windows.rst | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/GPU-Performance.rst b/docs/GPU-Performance.rst index 06931e271933..186c5b7c4b18 100644 --- a/docs/GPU-Performance.rst +++ b/docs/GPU-Performance.rst @@ -163,7 +163,7 @@ We record the wall clock time after 500 iterations, as shown in the figure below .. image:: ./_static/images/gpu-performance-comparison.png :align: center :target: ./_static/images/gpu-performance-comparison.png - :alt: A G. P. U. performance chart which is a record of the wall clock time after 500 iterations for Higgs, epsilon, Bosch, Microsoft L. T. R, Expo and Yahoo L. T. R. and bin size of 63 performs comparitively better. + :alt: A performance chart which is a record of the wall clock time after 500 iterations on G P U for Higgs, epsilon, Bosch, Microsoft L T R, Expo and Yahoo L T R and bin size of 63 performs comparitively better. When using a GPU, it is advisable to use a bin size of 63 rather than 255, because it can speed up training significantly without noticeably affecting accuracy. On CPU, using a smaller bin size only marginally improves performance, sometimes even slows down training, diff --git a/docs/GPU-Windows.rst b/docs/GPU-Windows.rst index 401be35997a1..84f347d86454 100644 --- a/docs/GPU-Windows.rst +++ b/docs/GPU-Windows.rst @@ -127,7 +127,7 @@ You can check which MinGW version you are using by running the following in a co .. image:: ./_static/images/screenshot-r-mingw-used.png :align: center :target: ./_static/images/screenshot-r-mingw-used.png - :alt: A screenshot of the administrator command prompt where G. C. C. version is being checked + :alt: A screenshot of the administrator command prompt where G C C version is being checked To check whether you need 32-bit or 64-bit MinGW for R, install LightGBM as usual and check for the following: From 8be88d0fadf1d6ca6462c2b66b944f88802fff9b Mon Sep 17 00:00:00 2001 From: Akshita Dixit <56997545+akshitadixit@users.noreply.github.com> Date: Tue, 23 Mar 2021 13:34:27 +0530 Subject: [PATCH 5/7] Apply suggestions from code review Co-authored-by: James Lamb --- docs/GPU-Windows.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/GPU-Windows.rst b/docs/GPU-Windows.rst index 84f347d86454..58ea43b08ab4 100644 --- a/docs/GPU-Windows.rst +++ b/docs/GPU-Windows.rst @@ -277,7 +277,7 @@ Installing CMake requires one download first and then a lot of configuration for .. image:: ./_static/images/screenshot-downloading-cmake.png :align: center :target: ./_static/images/screenshot-downloading-cmake.png - :alt: A screenshot of the binary distributions of c-make for downloading on Windows 64 bit + :alt: A screenshot of the binary distributions of C Make for downloading on 64 bit Windows - Download `CMake`_ (3.8 or higher) @@ -378,7 +378,7 @@ You can do everything in the Git Bash console you left open: .. image:: ./_static/images/screenshot-lightgbm-with-gpu-support-compiled.png :align: center :target: ./_static/images/screenshot-lightgbm-with-gpu-support-compiled.png - :alt: A screenshot of the gitbash window with Light G B M successfully installed + :alt: A screenshot of the git bash window with Light G B M successfully installed If everything was done correctly, you now compiled CLI LightGBM with GPU support! From ce1be2857a90030b8bfec44cbc79fb4f85be03bf Mon Sep 17 00:00:00 2001 From: Akshita Dixit <56997545+akshitadixit@users.noreply.github.com> Date: Tue, 23 Mar 2021 13:38:39 +0530 Subject: [PATCH 6/7] Merge main branch commit updates (#1) * [docs] Add alt text to image in Parameters-Tuning.rst (#4035) * [docs] Add alt text to image in Parameters-Tuning.rst Add alt text to Leaf-wise growth image, as part of #4028 * Update docs/Parameters-Tuning.rst Co-authored-by: James Lamb Co-authored-by: James Lamb * [ci] [R-package] upgrade to R 4.0.4 in CI (#4042) * [docs] update description of deterministic parameter (#4027) * update description of deterministic parameter to require using with force_row_wise or force_col_wise * Update include/LightGBM/config.h Co-authored-by: Nikita Titov * update docs Co-authored-by: Nikita Titov * [dask] Include support for init_score (#3950) * include support for init_score * use dataframe from init_score and test difference with and without init_score in local model * revert refactoring * initial docs. test between distributed models with and without init_score * remove ranker from tests * test value for root node and change docs * comma * re-include parametrize * fix incorrect merge * use single init_score and the booster_ attribute * use np.float64 instead of float * [ci] ignore untitle Jupyter notebooks in .gitignore (#4047) * [ci] prevent getting incompatible dask and distributed versions (#4054) * [ci] prevent getting incompatible dask and distributed versions * Update .ci/test.sh Co-authored-by: Nikita Titov * empty commit Co-authored-by: Nikita Titov * [ci] fix R CMD CHECK note about example timings (fixes #4049) (#4055) * [ci] fix R CMD CHECK note about example timings (fixes #4049) * Apply suggestions from code review Co-authored-by: Nikita Titov * empty commit Co-authored-by: Nikita Titov * [ci] add CMake + R 3.6 test back (fixes #3469) (#4053) * [ci] add CMake + R 3.6 test back (fixes #3469) * Apply suggestions from code review Co-authored-by: Nikita Titov * Update .ci/test_r_package_windows.ps1 * -Wait and remove rtools40 * empty commit Co-authored-by: Nikita Titov * [dask] include multiclass-classification task in tests (#4048) * include multiclass-classification task and task_to_model_factory dicts * define centers coordinates. flatten init_scores within each partition for multiclass-classification * include issue comment and fix linting error * Update index.rst (#4029) Add alt text to logo image Co-authored-by: James Lamb * [dask] raise more informative error for duplicates in 'machines' (fixes #4057) (#4059) * [dask] raise more informative error for duplicates in 'machines' * uncomment * avoid test failure * Revert "avoid test failure" This reverts commit 9442bdf00f193a19a923dc0deb46b7822cb6f601. * [dask] add tutorial documentation (fixes #3814, fixes #3838) (#4030) * [dask] add tutorial documentation (fixes #3814, fixes #3838) * add notes on saving the model * quick start examples * add examples * fix timeouts in examples * remove notebook * fill out prediction section * table of contents * add line back * linting * isort * Apply suggestions from code review Co-authored-by: Nikita Titov * Apply suggestions from code review Co-authored-by: Nikita Titov * move examples under python-guide * remove unused pickle import Co-authored-by: Nikita Titov * set 'pending' commit status for R Solaris optional workflow (#4061) * [docs] add Yu Shi to repo maintainers (#4060) * Update FAQ.rst * Update CODEOWNERS * set is_linear_ to false when it is absent from the model file (fix #3778) (#4056) * Add CMake option to enable sanitizers and build gtest (#3555) * Add CMake option to enable sanitizer * Set up gtest * Address reviewer's feedback * Address reviewer's feedback * Update CMakeLists.txt Co-authored-by: Nikita Titov Co-authored-by: Nikita Titov * added type hint (#4070) * [ci] run Dask examples on CI (#4064) * Update Parallel-Learning-Guide.rst * Update test.sh * fix path * address review comments * [python-package] add type hints on Booster.set_network() (#4068) * [python-package] add type hints on Booster.set_network() * change behavior * [python-package] Some mypy fixes (#3916) * Some mypy fixes * address James' comments * Re-introduce pass in empty classes * Update compat.py Remove extra lines * [dask] [ci] fix flaky network-setup test (#4071) * [tests][dask] simplify code in Dask tests (#4075) * simplify Dask tests code * enable CI * disable CI * Revert "[ci] prevent getting incompatible dask and distributed versions (#4054)" (#4076) This reverts commit 4e9c976867e1493b881b32d0e94ccf5c915fa31f. * Fix parsing of non-finite values (#3942) * Fix index out-of-range exception generated by BaggingHelper on small datasets. Prior to this change, the line "score_t threshold = tmp_gradients[top_k - 1];" would generate an exception, since tmp_gradients would be empty when the cnt input value to the function is zero. * Update goss.hpp * Update goss.hpp * Add API method LGBM_BoosterPredictForMats which runs prediction on a data set given as of array of pointers to rows (as opposed to existing method LGBM_BoosterPredictForMat which requires data given as contiguous array) * Fix incorrect upstream merge * Add link to LightGBM.NET * Fix indenting to 2 spaces * Dummy edit to trigger CI * Dummy edit to trigger CI * remove duplicate functions from merge * Fix parsing of non-finite values. Current implementation silently returns zero when input string is "inf", "-inf", or "nan" when compiled with VS2017, so instead just explicitly check for these values and fail if there is no match. No attempt to optimise string allocations in this implementation since it is usually rarely invoked. * Dummy commit to trigger CI * Also handle -nan in double parsing method * Update include/LightGBM/utils/common.h Remove trailing whitespace to pass linting tests Co-authored-by: Nikita Titov Co-authored-by: matthew-peacock Co-authored-by: Guolin Ke Co-authored-by: Nikita Titov * [dask] remove unused imports from typing (#4079) * Range check for DCG position discount lookup (#4069) * Add check to prevent out of index lookup in the position discount table. Add debug logging to report number of queries found in the data. * Change debug logging location so that we can print the data file name as well. * Revert "Change debug logging location so that we can print the data file name as well." This reverts commit 3981b34bd6e0530f89c4733e78e6b6603bf50d48. * Add data file name to debug logging. * Move log line to a place where it is output even when query IDs are read from a separate file. * Also add the out-of-range check to rank metrics. * Perform check after number of queries is initialized. * Update * [ci] upgrade R CI scripts to work on Ubuntu 20.04 (#4084) * [ci] install additional LaTeX packages in R CI jobs * update autoconf version * bump upper limit on package size to 100 * [SWIG] Add streaming data support + cpp tests (#3997) * [feature] Add ChunkedArray to SWIG * Add ChunkedArray * Add ChunkedArray_API_extensions.i * Add SWIG class wrappers * Address some review comments * Fix linting issues * Move test to tests/test_ChunkedArray_manually.cpp * Add test note * Move ChunkedArray to include/LightGBM/utils/ * Declare more explicit types of ChunkedArray in the SWIG API. * Port ChunkedArray tests to googletest * Please C++ linter * Address StrikerRUS' review comments * Update SWIG doc & disable ChunkedArray * Use CHECK_EQ instead of assert * Change include order (linting) * Rename ChunkedArray -> chunked_array files * Change header guards * Address last comments from StrikerRUS * store all CMake files in one place (#4087) * v3.2.0 release (#3872) * Update VERSION.txt * update appveyor.yml and configure * fix Appveyor builds Co-authored-by: James Lamb Co-authored-by: Nikita Titov Co-authored-by: StrikerRUS * [ci] Bump version for development (#4094) * Update .appveyor.yml * Update cran-comments.md * Update VERSION.txt * update configure Co-authored-by: James Lamb * [ci] fix flaky Azure Pipelines jobs (#4095) * Update test.sh * Update setup.sh * Update .vsts-ci.yml * Update test.sh * Update setup.sh * Update .vsts-ci.yml * Update setup.sh * Update setup.sh Co-authored-by: Subham Agrawal <34346812+subhamagrawal7@users.noreply.github.com> Co-authored-by: James Lamb Co-authored-by: shiyu1994 Co-authored-by: Nikita Titov Co-authored-by: jmoralez Co-authored-by: marcelonieva7 <72712805+marcelonieva7@users.noreply.github.com> Co-authored-by: Philip Hyunsu Cho Co-authored-by: Deddy Jobson Co-authored-by: Alberto Ferreira Co-authored-by: mjmckp Co-authored-by: matthew-peacock Co-authored-by: Guolin Ke Co-authored-by: ashok-ponnuswami-msft <57648631+ashok-ponnuswami-msft@users.noreply.github.com> Co-authored-by: StrikerRUS --- .appveyor.yml | 2 +- .ci/setup.sh | 4 + .ci/test.sh | 7 +- .ci/test_r_package.sh | 13 +- .ci/test_r_package_windows.ps1 | 32 +- .github/CODEOWNERS | 32 +- .github/workflows/r_package.yml | 6 + .github/workflows/r_solaris.yml | 1 + .gitignore | 2 + CMakeLists.txt | 40 +- R-package/AUTOCONF_UBUNTU_VERSION | 2 +- R-package/README.md | 8 +- R-package/configure | 18 +- R-package/cran-comments.md | 6 + R-package/recreate-configure.sh | 2 +- VERSION.txt | 2 +- build-cran-package.sh | 1 - build_r.R | 8 + .../IntegratedOpenCL.cmake | 0 cmake/Sanitizer.cmake | 61 ++ cmake/modules/FindASan.cmake | 9 + cmake/modules/FindLSan.cmake | 9 + .../cmake => cmake}/modules/FindLibR.cmake | 0 cmake/modules/FindTSan.cmake | 9 + cmake/modules/FindUBSan.cmake | 9 + docs/FAQ.rst | 1 + docs/Parallel-Learning-Guide.rst | 294 +++++++++ docs/Parameters-Tuning.rst | 1 + docs/Parameters.rst | 2 + docs/_static/images/dask-concat.svg | 1 + docs/_static/images/dask-initial-setup.svg | 1 + docs/index.rst | 1 + examples/python-guide/README.md | 1 + examples/python-guide/dask/README.md | 25 + .../dask/binary-classification.py | 30 + .../dask/multiclass-classification.py | 30 + examples/python-guide/dask/prediction.py | 48 ++ examples/python-guide/dask/ranking.py | 62 ++ examples/python-guide/dask/regression.py | 30 + include/LightGBM/config.h | 1 + include/LightGBM/metric.h | 8 + include/LightGBM/utils/chunked_array.hpp | 260 ++++++++ include/LightGBM/utils/common.h | 18 +- python-package/MANIFEST.in | 2 +- python-package/README.rst | 4 + python-package/lightgbm/basic.py | 13 +- python-package/lightgbm/callback.py | 2 +- python-package/lightgbm/compat.py | 12 +- python-package/lightgbm/dask.py | 49 +- python-package/lightgbm/libpath.py | 3 +- python-package/lightgbm/sklearn.py | 3 +- python-package/setup.py | 6 +- src/io/metadata.cpp | 4 + src/io/tree.cpp | 2 + src/metric/dcg_calculator.cpp | 13 + src/metric/rank_metric.hpp | 3 +- src/objective/rank_objective.hpp | 1 + swig/ChunkedArray_API_extensions.i | 23 + swig/StringArray.hpp | 56 +- swig/lightgbmlib.i | 1 + tests/cpp_test/test_chunked_array.cpp | 262 ++++++++ tests/cpp_test/test_main.cpp | 11 + tests/python_package_test/test_dask.py | 590 +++++++++--------- 63 files changed, 1749 insertions(+), 408 deletions(-) rename CMakeIntegratedOpenCL.cmake => cmake/IntegratedOpenCL.cmake (100%) create mode 100644 cmake/Sanitizer.cmake create mode 100644 cmake/modules/FindASan.cmake create mode 100644 cmake/modules/FindLSan.cmake rename {R-package/src/cmake => cmake}/modules/FindLibR.cmake (100%) create mode 100644 cmake/modules/FindTSan.cmake create mode 100644 cmake/modules/FindUBSan.cmake create mode 100644 docs/_static/images/dask-concat.svg create mode 100644 docs/_static/images/dask-initial-setup.svg create mode 100644 examples/python-guide/dask/README.md create mode 100644 examples/python-guide/dask/binary-classification.py create mode 100644 examples/python-guide/dask/multiclass-classification.py create mode 100644 examples/python-guide/dask/prediction.py create mode 100644 examples/python-guide/dask/ranking.py create mode 100644 examples/python-guide/dask/regression.py create mode 100644 include/LightGBM/utils/chunked_array.hpp create mode 100644 swig/ChunkedArray_API_extensions.i create mode 100644 tests/cpp_test/test_chunked_array.cpp create mode 100644 tests/cpp_test/test_main.cpp diff --git a/.appveyor.yml b/.appveyor.yml index 96087355b933..30f2941b9257 100644 --- a/.appveyor.yml +++ b/.appveyor.yml @@ -1,4 +1,4 @@ -version: 3.1.1.99.{build} +version: 3.2.0.99.{build} image: Visual Studio 2015 platform: x64 diff --git a/.ci/setup.sh b/.ci/setup.sh index fd6bda6fb746..f013a9b551ce 100755 --- a/.ci/setup.sh +++ b/.ci/setup.sh @@ -44,6 +44,8 @@ else # Linux libicu66 \ libssl1.1 \ libunwind8 \ + libxau-dev \ + libxrender1 \ locales \ netcat \ unzip \ @@ -81,6 +83,8 @@ else # Linux apt-get update apt-get install --no-install-recommends -y \ curl \ + libxau-dev \ + libxrender1 \ lsb-release \ software-properties-common if [[ $COMPILER == "clang" ]]; then diff --git a/.ci/test.sh b/.ci/test.sh index 8b1bfd13ebe6..66745a0b6de5 100755 --- a/.ci/test.sh +++ b/.ci/test.sh @@ -64,7 +64,7 @@ if [[ $TASK == "lint" ]]; then echo "Linting R code" Rscript ${BUILD_DIRECTORY}/.ci/lint_r_code.R ${BUILD_DIRECTORY} || exit -1 echo "Linting C++ code" - cpplint --filter=-build/c++11,-build/include_subdir,-build/header_guard,-whitespace/line_length --recursive ./src ./include ./R-package || exit -1 + cpplint --filter=-build/c++11,-build/include_subdir,-build/header_guard,-whitespace/line_length --recursive ./src ./include ./R-package ./swig || exit -1 exit 0 fi @@ -103,8 +103,7 @@ conda install -q -y -n $CONDA_ENV cloudpickle dask distributed joblib matplotlib conda install -q -y \ -n $CONDA_ENV \ -c conda-forge \ - python-graphviz \ - xorg-libxau + python-graphviz if [[ $OS_NAME == "macos" ]] && [[ $COMPILER == "clang" ]]; then # fix "OMP: Error #15: Initializing libiomp5.dylib, but found libomp.dylib already initialized." (OpenMP library conflict due to conda's MKL) @@ -218,7 +217,7 @@ import matplotlib\ matplotlib.use\(\"Agg\"\)\ ' plot_example.py # prevent interactive window mode sed -i'.bak' 's/graph.render(view=True)/graph.render(view=False)/' plot_example.py - for f in *.py; do python $f || exit -1; done # run all examples + for f in *.py **/*.py; do python $f || exit -1; done # run all examples cd $BUILD_DIRECTORY/examples/python-guide/notebooks conda install -q -y -n $CONDA_ENV ipywidgets notebook jupyter nbconvert --ExecutePreprocessor.timeout=180 --to notebook --execute --inplace *.ipynb || exit -1 # run all notebooks diff --git a/.ci/test_r_package.sh b/.ci/test_r_package.sh index 43f58c80c66d..d8932ee6a834 100755 --- a/.ci/test_r_package.sh +++ b/.ci/test_r_package.sh @@ -18,7 +18,13 @@ export _R_CHECK_CRAN_INCOMING_REMOTE_=0 # CRAN ignores the "installed size is too large" NOTE, # so our CI can too. Setting to a large value here just # to catch extreme problems -export _R_CHECK_PKG_SIZES_THRESHOLD_=60 +export _R_CHECK_PKG_SIZES_THRESHOLD_=100 + +# don't fail builds for long-running examples unless they're very long. +# See https://github.com/microsoft/LightGBM/issues/4049#issuecomment-793412254. +if [[ $R_BUILD_TYPE != "cran" ]]; then + export _R_CHECK_EXAMPLE_TIMING_THRESHOLD_=30 +fi # Get details needed for installing R components R_MAJOR_VERSION=( ${R_VERSION//./ } ) @@ -27,8 +33,8 @@ if [[ "${R_MAJOR_VERSION}" == "3" ]]; then export R_LINUX_VERSION="3.6.3-1bionic" export R_APT_REPO="bionic-cran35/" elif [[ "${R_MAJOR_VERSION}" == "4" ]]; then - export R_MAC_VERSION=4.0.3 - export R_LINUX_VERSION="4.0.3-1.1804.0" + export R_MAC_VERSION=4.0.4 + export R_LINUX_VERSION="4.0.4-1.1804.0" export R_APT_REPO="bionic-cran40/" else echo "Unrecognized R version: ${R_VERSION}" @@ -53,6 +59,7 @@ if [[ $OS_NAME == "linux" ]]; then devscripts \ r-base-dev=${R_LINUX_VERSION} \ texinfo \ + texlive-latex-extra \ texlive-latex-recommended \ texlive-fonts-recommended \ texlive-fonts-extra \ diff --git a/.ci/test_r_package_windows.ps1 b/.ci/test_r_package_windows.ps1 index e49a638b10b7..ac9410a823c7 100644 --- a/.ci/test_r_package_windows.ps1 +++ b/.ci/test_r_package_windows.ps1 @@ -28,6 +28,26 @@ function Run-R-Code-Redirect-Stderr { Rscript --vanilla -e $decorated_code } +# Remove all items matching some pattern from PATH environment variable +function Remove-From-Path { + param( + [string]$pattern_to_remove + ) + $env:PATH = ($env:PATH.Split(';') | Where-Object { $_ -notmatch "$pattern_to_remove" }) -join ';' +} + +# remove some details that exist in the GitHub Actions images which might +# cause conflicts with R and other components installed by this script +$env:RTOOLS40_HOME = "" +Remove-From-Path ".*chocolatey.*" +Remove-From-Path ".*Chocolatey.*" +Remove-From-Path ".*Git.*mingw64.*" +Remove-From-Path ".*msys64.*" +Remove-From-Path ".*rtools40.*" +Remove-From-Path ".*Strawberry.*" + +Remove-Item C:\rtools40 -Force -Recurse -ErrorAction Ignore + # Get details needed for installing R components # # NOTES: @@ -46,7 +66,7 @@ if ($env:R_MAJOR_VERSION -eq "3") { $env:RTOOLS_BIN = "$RTOOLS_INSTALL_PATH\usr\bin" $env:RTOOLS_MINGW_BIN = "$RTOOLS_INSTALL_PATH\mingw64\bin" $env:RTOOLS_EXE_FILE = "rtools40-x86_64.exe" - $env:R_WINDOWS_VERSION = "4.0.3" + $env:R_WINDOWS_VERSION = "4.0.4" } else { Write-Output "[ERROR] Unrecognized R version: $env:R_VERSION" Check-Output $false @@ -70,7 +90,13 @@ $env:_R_CHECK_CRAN_INCOMING_REMOTE_ = 0 # CRAN ignores the "installed size is too large" NOTE, # so our CI can too. Setting to a large value here just # to catch extreme problems -$env:_R_CHECK_PKG_SIZES_THRESHOLD_ = 60 +$env:_R_CHECK_PKG_SIZES_THRESHOLD_ = 100 + +# don't fail builds for long-running examples unless they're very long. +# See https://github.com/microsoft/LightGBM/issues/4049#issuecomment-793412254. +if ($env:R_BUILD_TYPE -ne "cran") { + $env:_R_CHECK_EXAMPLE_TIMING_THRESHOLD_ = 30 +} if (($env:COMPILER -eq "MINGW") -and ($env:R_BUILD_TYPE -eq "cmake")) { $env:CXX = "$env:RTOOLS_MINGW_BIN/g++.exe" @@ -92,7 +118,7 @@ Start-Process -FilePath R-win.exe -NoNewWindow -Wait -ArgumentList "/VERYSILENT Write-Output "Done installing R" Write-Output "Installing Rtools" -./Rtools.exe /VERYSILENT /SUPPRESSMSGBOXES /DIR=$RTOOLS_INSTALL_PATH ; Check-Output $? +Start-Process -FilePath Rtools.exe -NoNewWindow -Wait -ArgumentList "/VERYSILENT /SUPPRESSMSGBOXES /DIR=$RTOOLS_INSTALL_PATH" ; Check-Output $? Write-Output "Done installing Rtools" Write-Output "Installing dependencies" diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 7812ccdc3bff..4ee3bf477167 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -11,24 +11,24 @@ # other catch-alls that will get matched if specific rules below are not matched *.R @Laurae2 @jameslamb -*.py @StrikerRUS @chivee @wxchan @henry0312 -*.cpp @guolinke @chivee @btrotta -*.h @guolinke @chivee @btrotta +*.py @StrikerRUS @chivee @wxchan @henry0312 @shiyu1994 +*.cpp @guolinke @chivee @btrotta @shiyu1994 +*.h @guolinke @chivee @btrotta @shiyu1994 # main C++ code -include/ @guolinke @chivee @btrotta -src/ @guolinke @chivee @btrotta -CMakeLists.txt @guolinke @chivee @Laurae2 @jameslamb @wxchan @henry0312 @StrikerRUS @huanzhang12 @btrotta -tests/c_api_test/ @guolinke @chivee @btrotta -tests/cpp_test/ @guolinke @chivee @btrotta -tests/data/ @guolinke @chivee @btrotta -windows/ @guolinke @chivee @btrotta @StrikerRUS +include/ @guolinke @chivee @btrotta @shiyu1994 +src/ @guolinke @chivee @btrotta @shiyu1994 +CMakeLists.txt @guolinke @chivee @Laurae2 @jameslamb @wxchan @henry0312 @StrikerRUS @huanzhang12 @btrotta @shiyu1994 +tests/c_api_test/ @guolinke @chivee @btrotta @shiyu1994 +tests/cpp_test/ @guolinke @chivee @btrotta @shiyu1994 +tests/data/ @guolinke @chivee @btrotta @shiyu1994 +windows/ @guolinke @chivee @btrotta @StrikerRUS @shiyu1994 # R code R-package/ @Laurae2 @jameslamb # Python code -python-package/ @StrikerRUS @chivee @wxchan @henry0312 +python-package/ @StrikerRUS @chivee @wxchan @henry0312 @shiyu1994 # Dask integration python-package/lightgbm/dask.py @jameslamb @@ -46,15 +46,15 @@ examples/ @StrikerRUS @jameslamb @guolinke # docker setup docker/ @StrikerRUS @jameslamb -docker/dockerfile-cli @guolinke @chivee +docker/dockerfile-cli @guolinke @chivee @shiyu1994 docker/gpu/ @huanzhang12 -docker/dockerfile-python @StrikerRUS @chivee @wxchan @henry0312 +docker/dockerfile-python @StrikerRUS @chivee @wxchan @henry0312 @shiyu1994 docker/dockerfile-r @Laurae2 @jameslamb # GPU code docs/GPU-*.rst @huanzhang12 -src/treelearner/gpu_tree_learner.cpp @huanzhang12 @guolinke @chivee -src/treelearner/tree_learner.cpp @huanzhang12 @guolinke @chivee +src/treelearner/gpu_tree_learner.cpp @huanzhang12 @guolinke @chivee @shiyu1994 +src/treelearner/tree_learner.cpp @huanzhang12 @guolinke @chivee @shiyu1994 # JAVA code -swig/ @imatiach-msft +swig/ @guolinke @chivee @shiyu1994 diff --git a/.github/workflows/r_package.yml b/.github/workflows/r_package.yml index 54b2798fa40a..d8015f72b09d 100644 --- a/.github/workflows/r_package.yml +++ b/.github/workflows/r_package.yml @@ -60,6 +60,12 @@ jobs: compiler: clang r_version: 4.0 build_type: cmake + - os: windows-latest + task: r-package + compiler: MINGW + toolchain: MINGW + r_version: 3.6 + build_type: cmake - os: windows-latest task: r-package compiler: MINGW diff --git a/.github/workflows/r_solaris.yml b/.github/workflows/r_solaris.yml index b11c0182312f..b5d33d2ec1bd 100644 --- a/.github/workflows/r_solaris.yml +++ b/.github/workflows/r_solaris.yml @@ -31,6 +31,7 @@ jobs: - name: Send init status if: ${{ always() }} run: | + $GITHUB_WORKSPACE/.ci/set_commit_status.sh "${{ github.workflow }}" "pending" "${{ github.event.client_payload.pr_sha }}" $GITHUB_WORKSPACE/.ci/append_comment.sh \ "${{ github.event.client_payload.comment_number }}" \ "Workflow **${{ github.workflow }}** has been triggered! 🚀\r\n${GITHUB_SERVER_URL}/microsoft/LightGBM/actions/runs/${GITHUB_RUN_ID}" diff --git a/.gitignore b/.gitignore index 959462864825..66b8a9b4acff 100644 --- a/.gitignore +++ b/.gitignore @@ -269,6 +269,7 @@ _Pvt_Extensions *.app /windows/LightGBM.VC.db lightgbm +/testlightgbm # Created by https://www.gitignore.io/api/python @@ -354,6 +355,7 @@ target/ # Jupyter Notebook .ipynb_checkpoints +Untitled*.ipynb # pyenv .python-version diff --git a/CMakeLists.txt b/CMakeLists.txt index 29a786d3a506..b845f36244d7 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -6,6 +6,12 @@ OPTION(USE_HDFS "Enable HDFS support (EXPERIMENTAL)" OFF) OPTION(USE_TIMETAG "Set to ON to output time costs" OFF) OPTION(USE_CUDA "Enable CUDA-accelerated training (EXPERIMENTAL)" OFF) OPTION(USE_DEBUG "Set to ON for Debug mode" OFF) +OPTION(USE_SANITIZER "Use santizer flags" OFF) +SET(SANITIZER_PATH "" CACHE STRING "Path to sanitizer libs") +SET(ENABLED_SANITIZERS "address" "leak" "undefined" CACHE STRING + "Semicolon separated list of sanitizer names. E.g 'address;leak'. Supported sanitizers are +address, leak, undefined and thread.") +OPTION(BUILD_CPP_TEST "Build C++ tests with Google Test" OFF) OPTION(BUILD_STATIC_LIB "Build static library" OFF) OPTION(__BUILD_FOR_R "Set to ON if building lib_lightgbm for use with the R package" OFF) OPTION(__INTEGRATE_OPENCL "Set to ON if building LightGBM with the OpenCL ICD Loader and its dependencies included" OFF) @@ -26,6 +32,14 @@ endif() PROJECT(lightgbm LANGUAGES C CXX) +list(APPEND CMAKE_MODULE_PATH "${PROJECT_SOURCE_DIR}/cmake/modules") + +#-- Sanitizer +if (USE_SANITIZER) + include(cmake/Sanitizer.cmake) + enable_sanitizers("${ENABLED_SANITIZERS}") +endif (USE_SANITIZER) + if(__INTEGRATE_OPENCL) set(__INTEGRATE_OPENCL ON CACHE BOOL "" FORCE) set(USE_GPU OFF CACHE BOOL "" FORCE) @@ -86,7 +100,6 @@ include_directories(${EIGEN_DIR}) ADD_DEFINITIONS(-DEIGEN_MPL2_ONLY) if(__BUILD_FOR_R) - list(APPEND CMAKE_MODULE_PATH "${PROJECT_SOURCE_DIR}/cmake/modules") find_package(LibR REQUIRED) message(STATUS "LIBR_EXECUTABLE: ${LIBR_EXECUTABLE}") message(STATUS "LIBR_INCLUDE_DIRS: ${LIBR_INCLUDE_DIRS}") @@ -143,7 +156,7 @@ endif(USE_GPU) if(__INTEGRATE_OPENCL) if(WIN32) - include(CMakeIntegratedOpenCL.cmake) + include(cmake/IntegratedOpenCL.cmake) ADD_DEFINITIONS(-DUSE_GPU) else() message(FATAL_ERROR "Integrated OpenCL build is available only for Windows") @@ -404,10 +417,10 @@ if(USE_GPU) endif(USE_GPU) if(__INTEGRATE_OPENCL) - # targets OpenCL and Boost are added in CMakeIntegratedOpenCL.cmake + # targets OpenCL and Boost are added in IntegratedOpenCL.cmake add_dependencies(lightgbm OpenCL Boost) add_dependencies(_lightgbm OpenCL Boost) - # variables INTEGRATED_OPENCL_* are set in CMakeIntegratedOpenCL.cmake + # variables INTEGRATED_OPENCL_* are set in IntegratedOpenCL.cmake target_include_directories(lightgbm PRIVATE ${INTEGRATED_OPENCL_INCLUDES}) target_include_directories(_lightgbm PRIVATE ${INTEGRATED_OPENCL_INCLUDES}) target_compile_definitions(lightgbm PRIVATE ${INTEGRATED_OPENCL_DEFINITIONS}) @@ -451,6 +464,25 @@ if(__BUILD_FOR_R) endif(MSVC) endif(__BUILD_FOR_R) +#-- Google C++ tests +if(BUILD_CPP_TEST) + find_package(GTest CONFIG) + if(NOT GTEST_FOUND) + message(STATUS "Did not find Google Test in the system root. Fetching Google Test now...") + include(FetchContent) + FetchContent_Declare( + googletest + GIT_REPOSITORY https://github.com/google/googletest.git + GIT_TAG release-1.10.0 + ) + FetchContent_MakeAvailable(googletest) + add_library(GTest::GTest ALIAS gtest) + endif() + file(GLOB CPP_TEST_SOURCES tests/cpp_test/*.cpp) + add_executable(testlightgbm ${CPP_TEST_SOURCES} ${SOURCES}) + target_link_libraries(testlightgbm PRIVATE GTest::GTest) +endif() + install(TARGETS lightgbm _lightgbm RUNTIME DESTINATION ${CMAKE_INSTALL_PREFIX}/bin LIBRARY DESTINATION ${CMAKE_INSTALL_PREFIX}/lib diff --git a/R-package/AUTOCONF_UBUNTU_VERSION b/R-package/AUTOCONF_UBUNTU_VERSION index 212138979d46..2eee9f218a39 100644 --- a/R-package/AUTOCONF_UBUNTU_VERSION +++ b/R-package/AUTOCONF_UBUNTU_VERSION @@ -1 +1 @@ -2.69-11 +2.69-11.1 diff --git a/R-package/README.md b/R-package/README.md index cc41cd08878e..3dbe706a44ac 100644 --- a/R-package/README.md +++ b/R-package/README.md @@ -291,20 +291,20 @@ This section briefly explains the key files for building a CRAN package. To upda At build time, `configure` will be run and used to create a file `Makevars`, using `Makevars.in` as a template. 1. Edit `configure.ac`. -2. Create `configure` with `autoconf`. Do not edit it by hand. This file must be generated on Ubuntu 18.04. +2. Create `configure` with `autoconf`. Do not edit it by hand. This file must be generated on Ubuntu 20.04. - If you have an Ubuntu 18.04 environment available, run the provided script from the root of the `LightGBM` repository. + If you have an Ubuntu 20.04 environment available, run the provided script from the root of the `LightGBM` repository. ```shell ./R-package/recreate-configure.sh ``` - If you do not have easy access to an Ubuntu 18.04 environment, the `configure` script can be generated using Docker by running the code below from the root of this repo. + If you do not have easy access to an Ubuntu 20.04 environment, the `configure` script can be generated using Docker by running the code below from the root of this repo. ```shell docker run \ -v $(pwd):/opt/LightGBM \ - -t ubuntu:18.04 \ + -t ubuntu:20.04 \ /bin/bash -c "cd /opt/LightGBM && ./R-package/recreate-configure.sh" ``` diff --git a/R-package/configure b/R-package/configure index f12a9a6387b8..d88339576316 100755 --- a/R-package/configure +++ b/R-package/configure @@ -1,6 +1,6 @@ #! /bin/sh # Guess values for system-dependent variables and create Makefiles. -# Generated by GNU Autoconf 2.69 for lightgbm 3.1.1.99. +# Generated by GNU Autoconf 2.69 for lightgbm 3.2.0.99. # # # Copyright (C) 1992-1996, 1998-2012 Free Software Foundation, Inc. @@ -576,8 +576,8 @@ MAKEFLAGS= # Identity of this package. PACKAGE_NAME='lightgbm' PACKAGE_TARNAME='lightgbm' -PACKAGE_VERSION='3.1.1.99' -PACKAGE_STRING='lightgbm 3.1.1.99' +PACKAGE_VERSION='3.2.0.99' +PACKAGE_STRING='lightgbm 3.2.0.99' PACKAGE_BUGREPORT='' PACKAGE_URL='' @@ -1182,7 +1182,7 @@ if test "$ac_init_help" = "long"; then # Omit some internal or obsolete options to make the list less imposing. # This message is too long to be a string in the A/UX 3.1 sh. cat <<_ACEOF -\`configure' configures lightgbm 3.1.1.99 to adapt to many kinds of systems. +\`configure' configures lightgbm 3.2.0.99 to adapt to many kinds of systems. Usage: $0 [OPTION]... [VAR=VALUE]... @@ -1244,7 +1244,7 @@ fi if test -n "$ac_init_help"; then case $ac_init_help in - short | recursive ) echo "Configuration of lightgbm 3.1.1.99:";; + short | recursive ) echo "Configuration of lightgbm 3.2.0.99:";; esac cat <<\_ACEOF @@ -1311,7 +1311,7 @@ fi test -n "$ac_init_help" && exit $ac_status if $ac_init_version; then cat <<\_ACEOF -lightgbm configure 3.1.1.99 +lightgbm configure 3.2.0.99 generated by GNU Autoconf 2.69 Copyright (C) 2012 Free Software Foundation, Inc. @@ -1328,7 +1328,7 @@ cat >config.log <<_ACEOF This file contains any messages produced by compilers while running configure, to aid debugging if configure makes a mistake. -It was created by lightgbm $as_me 3.1.1.99, which was +It was created by lightgbm $as_me 3.2.0.99, which was generated by GNU Autoconf 2.69. Invocation command line was $ $0 $@ @@ -2395,7 +2395,7 @@ cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1 # report actual input values of CONFIG_FILES etc. instead of their # values after options handling. ac_log=" -This file was extended by lightgbm $as_me 3.1.1.99, which was +This file was extended by lightgbm $as_me 3.2.0.99, which was generated by GNU Autoconf 2.69. Invocation command line was CONFIG_FILES = $CONFIG_FILES @@ -2448,7 +2448,7 @@ _ACEOF cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1 ac_cs_config="`$as_echo "$ac_configure_args" | sed 's/^ //; s/[\\""\`\$]/\\\\&/g'`" ac_cs_version="\\ -lightgbm config.status 3.1.1.99 +lightgbm config.status 3.2.0.99 configured by $0, generated by GNU Autoconf 2.69, with options \\"\$ac_cs_config\\" diff --git a/R-package/cran-comments.md b/R-package/cran-comments.md index 9f29ecf9dcdf..bd713391b054 100644 --- a/R-package/cran-comments.md +++ b/R-package/cran-comments.md @@ -1,5 +1,11 @@ # CRAN Submission History +## v3.2.0 - Submission 1 - (TBD) + +### CRAN response + +### Maintainer Notes + ## v3.1.1 - Submission 1 - (December 7, 2020) ### CRAN response diff --git a/R-package/recreate-configure.sh b/R-package/recreate-configure.sh index 12d6e12d2903..2df5ffa64f6f 100755 --- a/R-package/recreate-configure.sh +++ b/R-package/recreate-configure.sh @@ -1,7 +1,7 @@ #!/bin/bash # recreates 'configure' from 'configure.ac' -# this script should run on Ubuntu 18.04 +# this script should run on Ubuntu 20.04 AUTOCONF_VERSION=$(cat R-package/AUTOCONF_UBUNTU_VERSION) # R packages cannot have versions like 3.0.0rc1, but diff --git a/VERSION.txt b/VERSION.txt index 1dd983a8d060..6d4caf4466e3 100644 --- a/VERSION.txt +++ b/VERSION.txt @@ -1 +1 @@ -3.1.1.99 +3.2.0.99 diff --git a/build-cran-package.sh b/build-cran-package.sh index 72e61f7c1c9f..318b96fde22e 100755 --- a/build-cran-package.sh +++ b/build-cran-package.sh @@ -63,7 +63,6 @@ cd ${TEMP_R_DIR} # Remove files not needed for CRAN echo "Removing files not needed for CRAN" rm src/install.libs.R - rm -r src/cmake/ rm -r inst/ rm -r pkgdown/ rm cran-comments.md diff --git a/build_r.R b/build_r.R index d96447c2e604..a4eeee228c6b 100644 --- a/build_r.R +++ b/build_r.R @@ -316,6 +316,14 @@ for (submodule in list.dirs( } # copy files into the place CMake expects +CMAKE_MODULES_R_DIR <- file.path(TEMP_SOURCE_DIR, "cmake", "modules") +dir.create(CMAKE_MODULES_R_DIR, recursive = TRUE) +result <- file.copy( + from = file.path("cmake", "modules", "FindLibR.cmake") + , to = sprintf("%s/", CMAKE_MODULES_R_DIR) + , overwrite = TRUE +) +.handle_result(result) for (src_file in c("lightgbm_R.cpp", "lightgbm_R.h", "R_object_helper.h")) { result <- file.copy( from = file.path(TEMP_SOURCE_DIR, src_file) diff --git a/CMakeIntegratedOpenCL.cmake b/cmake/IntegratedOpenCL.cmake similarity index 100% rename from CMakeIntegratedOpenCL.cmake rename to cmake/IntegratedOpenCL.cmake diff --git a/cmake/Sanitizer.cmake b/cmake/Sanitizer.cmake new file mode 100644 index 000000000000..9e1a94bc1569 --- /dev/null +++ b/cmake/Sanitizer.cmake @@ -0,0 +1,61 @@ +# Set appropriate compiler and linker flags for sanitizers. +# +# Usage of this module: +# enable_sanitizers("address;leak") + +# Add flags +macro(enable_sanitizer sanitizer) + if(${sanitizer} MATCHES "address") + find_package(ASan REQUIRED) + set(SAN_COMPILE_FLAGS "${SAN_COMPILE_FLAGS} -fsanitize=address") + link_libraries(${ASan_LIBRARY}) + + elseif(${sanitizer} MATCHES "thread") + find_package(TSan REQUIRED) + set(SAN_COMPILE_FLAGS "${SAN_COMPILE_FLAGS} -fsanitize=thread") + link_libraries(${TSan_LIBRARY}) + + elseif(${sanitizer} MATCHES "leak") + find_package(LSan REQUIRED) + set(SAN_COMPILE_FLAGS "${SAN_COMPILE_FLAGS} -fsanitize=leak") + link_libraries(${LSan_LIBRARY}) + + elseif(${sanitizer} MATCHES "undefined") + find_package(UBSan REQUIRED) + set(SAN_COMPILE_FLAGS "${SAN_COMPILE_FLAGS} -fsanitize=undefined -fno-sanitize-recover=undefined") + link_libraries(${UBSan_LIBRARY}) + + else() + message(FATAL_ERROR "Santizer ${sanitizer} not supported.") + endif() +endmacro() + +macro(enable_sanitizers SANITIZERS) + # Check sanitizers compatibility. + foreach ( _san ${SANITIZERS} ) + string(TOLOWER ${_san} _san) + if (_san MATCHES "thread") + if (${_use_other_sanitizers}) + message(FATAL_ERROR + "thread sanitizer is not compatible with ${_san} sanitizer.") + endif() + set(_use_thread_sanitizer 1) + else () + if (${_use_thread_sanitizer}) + message(FATAL_ERROR + "${_san} sanitizer is not compatible with thread sanitizer.") + endif() + set(_use_other_sanitizers 1) + endif() + endforeach() + + message(STATUS "Sanitizers: ${SANITIZERS}") + + foreach( _san ${SANITIZERS} ) + string(TOLOWER ${_san} _san) + enable_sanitizer(${_san}) + endforeach() + message(STATUS "Sanitizers compile flags: ${SAN_COMPILE_FLAGS}") + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${SAN_COMPILE_FLAGS}") + set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${SAN_COMPILE_FLAGS}") +endmacro() diff --git a/cmake/modules/FindASan.cmake b/cmake/modules/FindASan.cmake new file mode 100644 index 000000000000..660dfa3a15b2 --- /dev/null +++ b/cmake/modules/FindASan.cmake @@ -0,0 +1,9 @@ +set(ASan_LIB_NAME ASan) + +find_library(ASan_LIBRARY + NAMES libasan.so libasan.so.5 libasan.so.4 libasan.so.3 libasan.so.2 libasan.so.1 libasan.so.0 libasan.so.0.0.0 + PATHS ${SANITIZER_PATH} /usr/lib64 /usr/lib /usr/local/lib64 /usr/local/lib ${CMAKE_PREFIX_PATH}/lib) + +include(FindPackageHandleStandardArgs) +find_package_handle_standard_args(ASan DEFAULT_MSG + ASan_LIBRARY) diff --git a/cmake/modules/FindLSan.cmake b/cmake/modules/FindLSan.cmake new file mode 100644 index 000000000000..1b69fb7aa74a --- /dev/null +++ b/cmake/modules/FindLSan.cmake @@ -0,0 +1,9 @@ +set(LSan_LIB_NAME lsan) + +find_library(LSan_LIBRARY + NAMES liblsan.so liblsan.so.0 liblsan.so.0.0.0 + PATHS ${SANITIZER_PATH} /usr/lib64 /usr/lib /usr/local/lib64 /usr/local/lib ${CMAKE_PREFIX_PATH}/lib) + +include(FindPackageHandleStandardArgs) +find_package_handle_standard_args(LSan DEFAULT_MSG + LSan_LIBRARY) diff --git a/R-package/src/cmake/modules/FindLibR.cmake b/cmake/modules/FindLibR.cmake similarity index 100% rename from R-package/src/cmake/modules/FindLibR.cmake rename to cmake/modules/FindLibR.cmake diff --git a/cmake/modules/FindTSan.cmake b/cmake/modules/FindTSan.cmake new file mode 100644 index 000000000000..0fd26ace0781 --- /dev/null +++ b/cmake/modules/FindTSan.cmake @@ -0,0 +1,9 @@ +set(TSan_LIB_NAME tsan) + +find_library(TSan_LIBRARY + NAMES libtsan.so libtsan.so.0 libtsan.so.0.0.0 + PATHS ${SANITIZER_PATH} /usr/lib64 /usr/lib /usr/local/lib64 /usr/local/lib ${CMAKE_PREFIX_PATH}/lib) + +include(FindPackageHandleStandardArgs) +find_package_handle_standard_args(TSan DEFAULT_MSG + TSan_LIBRARY) diff --git a/cmake/modules/FindUBSan.cmake b/cmake/modules/FindUBSan.cmake new file mode 100644 index 000000000000..400007a86ae2 --- /dev/null +++ b/cmake/modules/FindUBSan.cmake @@ -0,0 +1,9 @@ +set(UBSan_LIB_NAME UBSan) + +find_library(UBSan_LIBRARY + NAMES libubsan.so libubsan.so.1 libubsan.so.0 libubsan.so.0.0.0 + PATHS ${SANITIZER_PATH} /usr/lib64 /usr/lib /usr/local/lib64 /usr/local/lib ${CMAKE_PREFIX_PATH}/lib) + +include(FindPackageHandleStandardArgs) +find_package_handle_standard_args(UBSan DEFAULT_MSG + UBSan_LIBRARY) diff --git a/docs/FAQ.rst b/docs/FAQ.rst index e90776347b01..9b3cd7e45797 100644 --- a/docs/FAQ.rst +++ b/docs/FAQ.rst @@ -22,6 +22,7 @@ You may also ping a member of the core team according to the relevant area of ex - `@guolinke `__ **Guolin Ke** (C++ code / R-package / Python-package) - `@chivee `__ **Qiwei Ye** (C++ code / Python-package) +- `@shiyu1994 `__ **Yu Shi** (C++ code / Python-package) - `@btrotta `__ **Belinda Trotta** (C++ code) - `@Laurae2 `__ **Damien Soukhavong** (R-package) - `@jameslamb `__ **James Lamb** (R-package / Dask-package) diff --git a/docs/Parallel-Learning-Guide.rst b/docs/Parallel-Learning-Guide.rst index 550d8b1dfea2..7bcb3fdfa865 100644 --- a/docs/Parallel-Learning-Guide.rst +++ b/docs/Parallel-Learning-Guide.rst @@ -62,6 +62,288 @@ Dask LightGBM's Python package supports distributed learning via `Dask`_. This integration is maintained by LightGBM's maintainers. +.. warning:: + + Dask integration is only tested on Linux. + +Dask Examples +''''''''''''' + +For sample code using ``lightgbm.dask``, see `these Dask examples`_. + +Training with Dask +'''''''''''''''''' + +This section contains detailed information on performing LightGBM distributed training using Dask. + +Configuring the Dask Cluster +**************************** + +**Allocating Threads** + +When setting up a Dask cluster for training, give each Dask worker process at least two threads. If you do not do this, training might be substantially slower because communication work and training work will block each other. + +If you do not have other significant processes competing with Dask for resources, just accept the default ``nthreads`` from your chosen ``dask.distributed`` cluster. + +.. code:: python + + from distributed import Client, LocalCluster + + cluster = LocalCluster(n_workers=3) + client = Client(cluster) + +**Managing Memory** + +Use the Dask diagnostic dashboard or your preferred monitoring tool to monitor Dask workers' memory consumption during training. As described in `the Dask worker documentation`_, Dask workers will automatically start spilling data to disk if memory consumption gets too high. This can substantially slow down computations, since disk I/O is usually much slower than reading the same data from memory. + + `At 60% of memory load, [Dask will] spill least recently used data to disk` + +To reduce the risk of hitting memory limits, consider restarting each worker process before running any data loading or training code. + +.. code:: python + + client.restart() + +Setting Up Training Data +************************* + +The estimators in ``lightgbm.dask`` expect that matrix-like or array-like data are provided in Dask DataFrame, Dask Array, or (in some cases) Dask Series format. See `the Dask DataFrame documentation`_ and `the Dask Array documentation`_ for more information on how to create such data structures. + +.. image:: ./_static/images/dask-initial-setup.svg + :align: center + :width: 600px + :alt: On the left, rectangles showing a 5 by 5 grid for a local dataset. On the right, two circles representing Dask workers, one with a 3 by 5 grid and one with a 2 by 5 grid. + :target: ./_static/images/dask-initial-setup.svg + +While setting up for training, ``lightgbm`` will concatenate all of the partitions on a worker into a single dataset. Distributed training then proceeds with one LightGBM worker process per Dask worker. + +.. image:: ./_static/images/dask-concat.svg + :align: center + :width: 600px + :alt: A section labeled "before" showing two grids and a section labeled "after" showing a single grid that looks like the two from "before" stacked one on top of the other. + :target: ./_static/images/dask-concat.svg + +When setting up data partitioning for LightGBM training with Dask, try to follow these suggestions: + +* ensure that each worker in the cluster has some of the training data +* try to give each worker roughly the same amount of data, especially if your dataset is small +* if you plan to train multiple models (for example, to tune hyperparameters) on the same data, use ``client.persist()`` before training to materialize the data one time + +Using a Specific Dask Client +**************************** + +In most situations, you should not need to tell ``lightgbm.dask`` to use a specific Dask client. By default, the client returned by ``distributed.default_client()`` will be used. + +However, you might want to explicitly control the Dask client used by LightGBM if you have multiple active clients in the same session. This is useful in more complex workflows like running multiple training jobs on different Dask clusters. + +LightGBM's Dask estimators support setting an attribute ``client`` to control the client that is used. + +.. code:: python + + import lightgbm as lgb + from distributed import Client, LocalCluster + + cluster = LocalCluster() + client = Client(cluster) + + # option 1: keyword argument in constructor + dask_model = lgb.DaskLGBMClassifier(client=client) + + # option 2: set_params() after construction + dask_model = lgb.DaskLGBMClassifier() + dask_model.set_params(client=client) + +Using Specific Ports +******************** + +At the beginning of training, ``lightgbm.dask`` sets up a LightGBM network where each Dask worker runs one long-running task that acts as a LightGBM worker. During training, LightGBM workers communicate with each other over TCP sockets. By default, random open ports are used when creating these sockets. + +If the communication between Dask workers in the cluster used for training is restricted by firewall rules, you must tell LightGBM exactly what ports to use. + +**Option 1: provide a specific list of addresses and ports** + +LightGBM supports a parameter ``machines``, a comma-delimited string where each entry refers to one worker (host name or IP) and a port that that worker will accept connections on. If you provide this parameter to the estimators in ``lightgbm.dask``, LightGBM will not search randomly for ports. + +For example, consider the case where you are running one Dask worker process on each of the following IP addresses: + +:: + + 10.0.1.0 + 10.0.2.0 + 10.0.3.0 + +You could edit your firewall rules to allow traffic on one additional port on each of these hosts, then provide ``machines`` directly. + +.. code:: python + + import lightgbm as lgb + + machines = "10.0.1.0:12401,10.0.2.0:12402,10.0.3.0:15000" + dask_model = lgb.DaskLGBMRegressor(machines=machines) + +If you are running multiple Dask worker processes on physical host in the cluster, be sure that there are multiple entries for that IP address, with different ports. For example, if you were running a cluster with ``nprocs=2`` (2 Dask worker processes per machine), you might open two additional ports on each of these hosts, then provide ``machines`` as follows. + +.. code:: python + + import lightgbm as lgb + + machines = ",".join([ + "10.0.1.0:16000", + "10.0.1.0:16001", + "10.0.2.0:16000", + "10.0.2.0:16001", + ]) + dask_model = lgb.DaskLGBMRegressor(machines=machines) + +.. warning:: + + Providing ``machines`` gives you complete control over the networking details of training, but it also makes the training process fragile. Training will fail if you use ``machines`` and any of the following are true: + + * any of the ports mentioned in ``machines`` are not open when training begins + * some partitions of the training data are held by machines that that are not present in ``machines`` + * some machines mentioned in ``machines`` do not hold any of the training data + +**Option 2: specify one port to use on every worker** + +If you are only running one Dask worker process on each host, and if you can reliably identify a port that is open on every host, using ``machines`` is unnecessarily complicated. If ``local_listen_port`` is given and ``machines`` is not, LightGBM will not search for ports randomly, but it will limit the list of addresses in the LightGBM network to those Dask workers that have a piece of the training data. + +For example, consider the case where you are running one Dask worker process on each of the following IP addresses: + +:: + + 10.0.1.0 + 10.0.2.0 + 10.0.3.0 + +You could edit your firewall rules to allow communication between any of the workers over one port, then provide that port via parameter ``local_listen_port``. + +.. code:: python + + import lightgbm as lgb + + dask_model = lgb.DaskLGBMRegressor(local_listen_port=12400) + +.. warning:: + + Providing ``local_listen_port`` is slightly less fragile than ``machines`` because LightGBM will automatically figure out which workers have pieces of the training data. However, using this method, training can fail if any of the following are true: + + * the port ``local_listen_port`` is not open on any of the worker hosts + * any machine has multiple Dask worker processes running on it + +Prediction with Dask +'''''''''''''''''''' + +The estimators from ``lightgbm.dask`` can be used to create predictions based on data stored in Dask collections. In that interface, ``.predict()`` expects a Dask Array or Dask DataFrame, and returns a Dask Array of predictions. + +See `the Dask prediction example`_ for some sample code that shows how to perform Dask-based prediction. + +For model evaluation, consider using `the metrics functions from dask-ml`_. Those functions are intended to provide the same API as equivalent functions in ``sklearn.metrics``, but they use distributed computation powered by Dask to compute metrics without all of the input data ever needing to be on a single machine. + +Saving Dask Models +'''''''''''''''''' + +After training with Dask, you have several options for saving a fitted model. + +**Option 1: pickle the Dask estimator** + +LightGBM's Dask estimators can be pickled directly with ``cloudpickle``, ``joblib``, or ``pickle``. + +.. code:: python + + import dask.array as da + import pickle + import lightgbm as lgb + from distributed import Client, LocalCluster + + cluster = LocalCluster(n_workers=2) + client = Client(cluster) + + X = da.random.random((1000, 10), (500, 10)) + y = da.random.random((1000,), (500,)) + + dask_model = lgb.DaskLGBMRegressor() + dask_model.fit(X, y) + + with open("dask-model.pkl", "wb") as f: + pickle.dump(dask_model, f) + +A model saved this way can then later be loaded with whichever serialization library you used to save it. + +.. code:: python + + import pickle + with open("dask-model.pkl", "rb") as f: + dask_model = pickle.load(f) + +.. note:: + + If you explicitly set a Dask client (see `Using a Specific Dask Client <#using-a-specific-dask-client>`__), it will not be saved when pickling the estimator. When loading a Dask estimator from disk, if you need to use a specific client you can add it after loading with ``dask_model.set_params(client=client)``. + +**Option 2: pickle the sklearn estimator** + +The estimators available from ``lightgbm.dask`` can be converted to an instance of the equivalent class from ``lightgbm.sklearn``. Choosing this option allows you to use Dask for training but avoid depending on any Dask libraries at scoring time. + +.. code:: python + + import dask.array as da + import joblib + import lightgbm as lgb + from distributed import Client, LocalCluster + + cluster = LocalCluster(n_workers=2) + client = Client(cluster) + + X = da.random.random((1000, 10), (500, 10)) + y = da.random.random((1000,), (500,)) + + dask_model = lgb.DaskLGBMRegressor() + dask_model.fit(X, y) + + # convert to sklearn equivalent + sklearn_model = dask_model.to_local() + + print(type(sklearn_model)) + #> lightgbm.sklearn.LGBMRegressor + + joblib.dump(sklearn_model, "sklearn-model.joblib") + +A model saved this way can then later be loaded with whichever serialization library you used to save it. + +.. code:: python + + import joblib + + sklearn_model = joblib.load("sklearn-model.joblib") + +**Option 3: save the LightGBM Booster** + +The lowest-level model object in LightGBM is the ``lightgbm.Booster``. After training, you can extract a Booster from the Dask estimator. + +.. code:: python + + import dask.array as da + import lightgbm as lgb + from distributed import Client, LocalCluster + + cluster = LocalCluster(n_workers=2) + client = Client(cluster) + + X = da.random.random((1000, 10), (500, 10)) + y = da.random.random((1000,), (500,)) + + dask_model = lgb.DaskLGBMRegressor() + dask_model.fit(X, y) + + # get underlying Booster object + bst = dask_model.booster_ + +From the point forward, you can use any of the following methods to save the Booster: + +* serialize with ``cloudpickle``, ``joblib``, or ``pickle`` +* ``bst.dump_model()``: dump the model to a dictionary which could be written out as JSON +* ``bst.model_to_string()``: dump the model to a string in memory +* ``bst.save_model()``: write the output of ``bst.model_to_string()`` to a text file + Kubeflow ^^^^^^^^ @@ -175,8 +457,20 @@ Example .. _this MMLSpark example: https://github.com/Azure/mmlspark/blob/master/notebooks/samples/LightGBM%20-%20Quantile%20Regression%20for%20Drug%20Discovery.ipynb +.. _the Dask Array documentation: https://docs.dask.org/en/latest/array.html + +.. _the Dask DataFrame documentation: https://docs.dask.org/en/latest/dataframe.html + +.. _the Dask prediction example: https://github.com/microsoft/lightgbm/tree/master/examples/python-guide/dask/prediction.py + +.. _the Dask worker documentation: https://distributed.dask.org/en/latest/worker.html#memory-management + +.. _the metrics functions from dask-ml: https://ml.dask.org/modules/api.html#dask-ml-metrics-metrics + .. _the MMLSpark Documentation: https://github.com/Azure/mmlspark/blob/master/docs/lightgbm.md +.. _these Dask examples: https://github.com/microsoft/lightgbm/tree/master/examples/python-guide/dask + .. _Kubeflow Fairing: https://www.kubeflow.org/docs/components/fairing/fairing-overview .. _These examples: https://github.com/kubeflow/fairing/tree/master/examples/lightgbm diff --git a/docs/Parameters-Tuning.rst b/docs/Parameters-Tuning.rst index db333318920d..0171f456c967 100644 --- a/docs/Parameters-Tuning.rst +++ b/docs/Parameters-Tuning.rst @@ -70,6 +70,7 @@ LightGBM adds nodes to trees based on the gain from adding that node, regardless .. image:: ./_static/images/leaf-wise.png :align: center + :alt: Three consecutive images of decision trees, where each shows the tree with an additional two leaf nodes added. Shows that leaf-wise growth can result in trees that have some branches which are longer than others. Because of this growth strategy, it isn't straightforward to use ``max_depth`` alone to limit the complexity of trees. The ``num_leaves`` parameter sets the maximum number of nodes per tree. Decrease ``num_leaves`` to reduce training time. diff --git a/docs/Parameters.rst b/docs/Parameters.rst index 41d6ef6cc62e..258a63608d49 100644 --- a/docs/Parameters.rst +++ b/docs/Parameters.rst @@ -229,6 +229,8 @@ Core Parameters - **Note**: setting this to ``true`` may slow down the training + - **Note**: to avoid potential instability due to numerical issues, please set ``force_col_wise=true`` or ``force_row_wise=true`` when setting ``deterministic=true`` + Learning Control Parameters --------------------------- diff --git a/docs/_static/images/dask-concat.svg b/docs/_static/images/dask-concat.svg new file mode 100644 index 000000000000..a230535d50c2 --- /dev/null +++ b/docs/_static/images/dask-concat.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/_static/images/dask-initial-setup.svg b/docs/_static/images/dask-initial-setup.svg new file mode 100644 index 000000000000..5ffe85b87397 --- /dev/null +++ b/docs/_static/images/dask-initial-setup.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/index.rst b/docs/index.rst index 96b0b92be8f5..54ca853e5539 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -6,6 +6,7 @@ .. image:: ./logo/LightGBM_logo_black_text.svg :align: center :width: 600 + :alt: Light Gradient Boosting Machine logo. | diff --git a/examples/python-guide/README.md b/examples/python-guide/README.md index aba3c9f51d7a..08ded17ab559 100644 --- a/examples/python-guide/README.md +++ b/examples/python-guide/README.md @@ -19,6 +19,7 @@ python simple_example.py Examples include: +- [`dask/`](./dask): examples using Dask for distributed training - [simple_example.py](https://github.com/microsoft/LightGBM/blob/master/examples/python-guide/simple_example.py) - Construct Dataset - Basic train and predict diff --git a/examples/python-guide/dask/README.md b/examples/python-guide/dask/README.md new file mode 100644 index 000000000000..c0c2639b7d36 --- /dev/null +++ b/examples/python-guide/dask/README.md @@ -0,0 +1,25 @@ +Dask Examples +============= + +This directory contains examples of machine learning workflows with LightGBM and [Dask](https://dask.org/). + +Before running this code, see [the installation instructions for the Dask-package](https://github.com/microsoft/LightGBM/tree/master/python-package#install-dask-package). + +After installing the package and its dependencies, any of the examples here can be run with a command like this: + +```shell +python binary-classification.py +``` + +The examples listed below contain minimal code showing how to train LightGBM models using Dask. + +**Training** + +* [binary-classification.py](./binary-classification.py) +* [multiclass-classification.py](./multiclass-classification.py) +* [ranking.py](./ranking.py) +* [regression.py](./regression.py) + +**Prediction** + +* [prediction.py](./prediction.py) diff --git a/examples/python-guide/dask/binary-classification.py b/examples/python-guide/dask/binary-classification.py new file mode 100644 index 000000000000..4de9245d4472 --- /dev/null +++ b/examples/python-guide/dask/binary-classification.py @@ -0,0 +1,30 @@ +import dask.array as da +from distributed import Client, LocalCluster +from sklearn.datasets import make_blobs + +import lightgbm as lgb + +if __name__ == "__main__": + print("loading data") + + X, y = make_blobs(n_samples=1000, n_features=50, centers=2) + + print("initializing a Dask cluster") + + cluster = LocalCluster() + client = Client(cluster) + + print("created a Dask LocalCluster") + + print("distributing training data on the Dask cluster") + + dX = da.from_array(X, chunks=(100, 50)) + dy = da.from_array(y, chunks=(100,)) + + print("beginning training") + + dask_model = lgb.DaskLGBMClassifier(n_estimators=10) + dask_model.fit(dX, dy) + assert dask_model.fitted_ + + print("done training") diff --git a/examples/python-guide/dask/multiclass-classification.py b/examples/python-guide/dask/multiclass-classification.py new file mode 100644 index 000000000000..bcda9589ab84 --- /dev/null +++ b/examples/python-guide/dask/multiclass-classification.py @@ -0,0 +1,30 @@ +import dask.array as da +from distributed import Client, LocalCluster +from sklearn.datasets import make_blobs + +import lightgbm as lgb + +if __name__ == "__main__": + print("loading data") + + X, y = make_blobs(n_samples=1000, n_features=50, centers=3) + + print("initializing a Dask cluster") + + cluster = LocalCluster(n_workers=2) + client = Client(cluster) + + print("created a Dask LocalCluster") + + print("distributing training data on the Dask cluster") + + dX = da.from_array(X, chunks=(100, 50)) + dy = da.from_array(y, chunks=(100,)) + + print("beginning training") + + dask_model = lgb.DaskLGBMClassifier(n_estimators=10) + dask_model.fit(dX, dy) + assert dask_model.fitted_ + + print("done training") diff --git a/examples/python-guide/dask/prediction.py b/examples/python-guide/dask/prediction.py new file mode 100644 index 000000000000..a4cb5cd8592e --- /dev/null +++ b/examples/python-guide/dask/prediction.py @@ -0,0 +1,48 @@ +import dask.array as da +from distributed import Client, LocalCluster +from sklearn.datasets import make_regression +from sklearn.metrics import mean_squared_error + +import lightgbm as lgb + +if __name__ == "__main__": + print("loading data") + + X, y = make_regression(n_samples=1000, n_features=50) + + print("initializing a Dask cluster") + + cluster = LocalCluster(n_workers=2) + client = Client(cluster) + + print("created a Dask LocalCluster") + + print("distributing training data on the Dask cluster") + + dX = da.from_array(X, chunks=(100, 50)) + dy = da.from_array(y, chunks=(100,)) + + print("beginning training") + + dask_model = lgb.DaskLGBMRegressor(n_estimators=10) + dask_model.fit(dX, dy) + assert dask_model.fitted_ + + print("done training") + + print("predicting on the training data") + + preds = dask_model.predict(dX) + + # the code below uses sklearn.metrics, but this requires pulling all of the + # predictions and target values back from workers to the client + # + # for larger datasets, consider the metrics from dask-ml instead + # https://ml.dask.org/modules/api.html#dask-ml-metrics-metrics + print("computing MSE") + + preds_local = preds.compute() + actuals_local = dy.compute() + mse = mean_squared_error(actuals_local, preds_local) + + print(f"MSE: {mse}") diff --git a/examples/python-guide/dask/ranking.py b/examples/python-guide/dask/ranking.py new file mode 100644 index 000000000000..5693ed9a5b67 --- /dev/null +++ b/examples/python-guide/dask/ranking.py @@ -0,0 +1,62 @@ +import os + +import dask.array as da +import numpy as np +from distributed import Client, LocalCluster +from sklearn.datasets import load_svmlight_file + +import lightgbm as lgb + +if __name__ == "__main__": + print("loading data") + + X, y = load_svmlight_file(os.path.join(os.path.dirname(os.path.realpath(__file__)), + '../../lambdarank/rank.train')) + group = np.loadtxt(os.path.join(os.path.dirname(os.path.realpath(__file__)), + '../../lambdarank/rank.train.query')) + + print("initializing a Dask cluster") + + cluster = LocalCluster(n_workers=2) + client = Client(cluster) + + print("created a Dask LocalCluster") + + print("distributing training data on the Dask cluster") + + # split training data into two partitions + rows_in_part1 = int(np.sum(group[:100])) + rows_in_part2 = X.shape[0] - rows_in_part1 + num_features = X.shape[1] + + # make this array dense because we're splitting across + # a sparse boundary to partition the data + X = X.todense() + + dX = da.from_array( + x=X, + chunks=[ + (rows_in_part1, rows_in_part2), + (num_features,) + ] + ) + dy = da.from_array( + x=y, + chunks=[ + (rows_in_part1, rows_in_part2), + ] + ) + dg = da.from_array( + x=group, + chunks=[ + (100, group.size - 100) + ] + ) + + print("beginning training") + + dask_model = lgb.DaskLGBMRanker(n_estimators=10) + dask_model.fit(dX, dy, group=dg) + assert dask_model.fitted_ + + print("done training") diff --git a/examples/python-guide/dask/regression.py b/examples/python-guide/dask/regression.py new file mode 100644 index 000000000000..4d15547ff501 --- /dev/null +++ b/examples/python-guide/dask/regression.py @@ -0,0 +1,30 @@ +import dask.array as da +from distributed import Client, LocalCluster +from sklearn.datasets import make_regression + +import lightgbm as lgb + +if __name__ == "__main__": + print("loading data") + + X, y = make_regression(n_samples=1000, n_features=50) + + print("initializing a Dask cluster") + + cluster = LocalCluster(n_workers=2) + client = Client(cluster) + + print("created a Dask LocalCluster") + + print("distributing training data on the Dask cluster") + + dX = da.from_array(X, chunks=(100, 50)) + dy = da.from_array(y, chunks=(100,)) + + print("beginning training") + + dask_model = lgb.DaskLGBMRegressor(n_estimators=10) + dask_model.fit(dX, dy) + assert dask_model.fitted_ + + print("done training") diff --git a/include/LightGBM/config.h b/include/LightGBM/config.h index 559b52bd6559..66fe7319141d 100644 --- a/include/LightGBM/config.h +++ b/include/LightGBM/config.h @@ -236,6 +236,7 @@ struct Config { // desc = when you use the different seeds, different LightGBM versions, the binaries compiled by different compilers, or in different systems, the results are expected to be different // desc = you can `raise issues `__ in LightGBM GitHub repo when you meet the unstable results // desc = **Note**: setting this to ``true`` may slow down the training + // desc = **Note**: to avoid potential instability due to numerical issues, please set ``force_col_wise=true`` or ``force_row_wise=true`` when setting ``deterministic=true`` bool deterministic = false; #pragma endregion diff --git a/include/LightGBM/metric.h b/include/LightGBM/metric.h index 61d9fc99ea80..9d505d2768d1 100644 --- a/include/LightGBM/metric.h +++ b/include/LightGBM/metric.h @@ -103,6 +103,14 @@ class DCGCalculator { static double CalMaxDCGAtK(data_size_t k, const label_t* label, data_size_t num_data); + + /*! + * \brief Check the metadata for NDCG and lambdarank + * \param metadata Metadata + * \param num_queries Number of queries + */ + static void CheckMetadata(const Metadata& metadata, data_size_t num_queries); + /*! * \brief Check the label range for NDCG and lambdarank * \param label Pointer of label diff --git a/include/LightGBM/utils/chunked_array.hpp b/include/LightGBM/utils/chunked_array.hpp new file mode 100644 index 000000000000..6160dafa07af --- /dev/null +++ b/include/LightGBM/utils/chunked_array.hpp @@ -0,0 +1,260 @@ +/*! + * Copyright (c) 2021 Microsoft Corporation. All rights reserved. + * Licensed under the MIT License. See LICENSE file in the project root for license information. + * + * Author: Alberto Ferreira + */ +#ifndef LIGHTGBM_UTILS_CHUNKED_ARRAY_HPP_ +#define LIGHTGBM_UTILS_CHUNKED_ARRAY_HPP_ + +#include + +#include + +#include +#include +#include + + +namespace LightGBM { + +/** + * Container that manages a dynamic array of fixed-length chunks. + * + * The class also takes care of allocation & release of the underlying + * memory. It can be used with either a high or low-level API. + * + * The high-level API allocates chunks as needed, manages addresses automatically and keeps + * track of number of inserted elements, but is not thread-safe (this is ok as usually input is a streaming iterator). + * For parallel input sources the low-level API must be used. + * + * Note: When using this for `LGBM_DatasetCreateFromMats` use a + * chunk_size multiple of #num_cols for your dataset, so each chunk + * contains "complete" instances. + * + * === High-level insert API intro === + * + * The easiest way to use is: + * 0. ChunkedArray(chunk_size) # Choose appropriate size + * 1. add(value) # as many times as you want (will generate chunks as needed) + * 2. data() or void_data() # retrieves a T** or void** pointer (useful for `LGBM_DatasetCreateFromMats`). + * + * Useful query methods (all O(1)): + * - get_add_count() # total count of added elements. + * - get_chunks_count() # how many chunks are currently allocated. + * - get_current_chunk_added_count() # for the last add() chunk, how many items there are. + * - get_chunk_size() # get constant chunk_size from constructor call. + * + * With those you can generate int32_t sizes[]. Last chunk can be smaller than chunk_size, so, for any i: + * - sizes[i +class ChunkedArray { + public: + explicit ChunkedArray(size_t chunk_size) + : _chunk_size(chunk_size), _last_chunk_idx(0), _last_idx_in_last_chunk(0) { + if (chunk_size == 0) { + Log::Fatal("ChunkedArray chunk size must be larger than 0!"); + } + new_chunk(); + } + + ~ChunkedArray() { + release(); + } + + /** + * Adds a value to the chunks sequentially. + * If the last chunk is full it creates a new one and appends to it. + * + * @param value value to insert. + */ + void add(T value) { + if (!within_bounds(_last_chunk_idx, _last_idx_in_last_chunk)) { + new_chunk(); + ++_last_chunk_idx; + _last_idx_in_last_chunk = 0; + } + + CHECK_EQ(setitem(_last_chunk_idx, _last_idx_in_last_chunk, value), 0); + ++_last_idx_in_last_chunk; + } + + /** + * @return Number of add() calls. + */ + size_t get_add_count() const { + return _last_chunk_idx * _chunk_size + _last_idx_in_last_chunk; + } + + /** + * @return Number of allocated chunks. + */ + size_t get_chunks_count() const { + return _chunks.size(); + } + + /** + * @return Number of elemends add()'ed in the last chunk. + */ + size_t get_last_chunk_add_count() const { + return _last_idx_in_last_chunk; + } + + /** + * Getter for the chunk size set at the constructor. + * + * @return Return the size of chunks. + */ + size_t get_chunk_size() const { + return _chunk_size; + } + + /** + * Returns the pointer to the raw chunks data. + * + * @return T** pointer to raw data. + */ + T **data() noexcept { + return _chunks.data(); + } + + /** + * Returns the pointer to the raw chunks data, but cast to void**. + * This is so ``LGBM_DatasetCreateFromMats`` accepts it. + * + * @return void** pointer to raw data. + */ + void **data_as_void() noexcept { + return reinterpret_cast(_chunks.data()); + } + + /** + * Coalesces (copies chunked data) to a contiguous array of the same type. + * It assumes that ``other`` has enough space to receive that data. + * + * @param other array with elements T of size >= this->get_add_count(). + * @param all_valid_addresses + * If true exports values from all valid addresses independently of add() count. + * Otherwise, exports only up to `get_add_count()` addresses. + */ + void coalesce_to(T *other, bool all_valid_addresses = false) const { + const size_t full_chunks = this->get_chunks_count() - 1; + + // Copy full chunks: + size_t i = 0; + for (size_t chunk = 0; chunk < full_chunks; ++chunk) { + T* chunk_ptr = _chunks[chunk]; + for (size_t in_chunk_idx = 0; in_chunk_idx < _chunk_size; ++in_chunk_idx) { + other[i++] = chunk_ptr[in_chunk_idx]; + } + } + // Copy filled values from last chunk only: + const size_t last_chunk_elems_to_copy = all_valid_addresses ? _chunk_size : this->get_last_chunk_add_count(); + T* chunk_ptr = _chunks[full_chunks]; + for (size_t in_chunk_idx = 0; in_chunk_idx < last_chunk_elems_to_copy; ++in_chunk_idx) { + other[i++] = chunk_ptr[in_chunk_idx]; + } + } + + /** + * Return value from array of chunks. + * + * @param chunk_index index of the chunk + * @param index_within_chunk index within chunk + * @param on_fail_value sentinel value. If out of bounds returns that value. + * + * @return pointer or nullptr if index is out of bounds. + */ + T getitem(size_t chunk_index, size_t index_within_chunk, T on_fail_value) const noexcept { + if (within_bounds(chunk_index, index_within_chunk)) + return _chunks[chunk_index][index_within_chunk]; + else + return on_fail_value; + } + + /** + * Sets the value at a specific address in one of the chunks. + * + * @param chunk_index index of the chunk + * @param index_within_chunk index within chunk + * @param value value to store + * + * @return 0 = success, -1 = out of bounds access. + */ + int setitem(size_t chunk_index, size_t index_within_chunk, T value) noexcept { + if (within_bounds(chunk_index, index_within_chunk)) { + _chunks[chunk_index][index_within_chunk] = value; + return 0; + } else { + return -1; + } + } + + /** + * To reset storage call this. + * Will release existing resources and prepare for reuse. + */ + void clear() noexcept { + release(); + new_chunk(); + } + + /** + * Deletes all the allocated chunks. + * Do not use container after this! See ``clear()`` instead. + */ + void release() noexcept { + std::for_each(_chunks.begin(), _chunks.end(), [](T* c) { delete[] c; }); + _chunks.clear(); + _chunks.shrink_to_fit(); + _last_chunk_idx = 0; + _last_idx_in_last_chunk = 0; + } + + /** + * As the array is dynamic, checks whether a given address is currently within bounds. + * + * @param chunk_index index of the chunk + * @param index_within_chunk index within that chunk + * @return true if that chunk is already allocated and index_within_chunk < chunk size. + */ + inline bool within_bounds(size_t chunk_index, size_t index_within_chunk) const { + return (chunk_index < _chunks.size()) && (index_within_chunk < _chunk_size); + } + + /** + * Adds a new chunk to the array of chunks. Not thread-safe. + */ + void new_chunk() { + _chunks.push_back(new (std::nothrow) T[_chunk_size]); + + // Check memory allocation success: + if (!_chunks[_chunks.size()-1]) { + release(); + Log::Fatal("Memory exhausted! Cannot allocate new ChunkedArray chunk."); + } + } + + private: + const size_t _chunk_size; + std::vector _chunks; + + // For the add() interface & some of the get_*() queries: + size_t _last_chunk_idx; // { // Fast (common) path: For numeric inputs in RFC 7159 format: const bool fast_parse_succeeded = fast_double_parser::parse_number(str.c_str(), &tmp); - // Rare path: Not in RFC 7159 format. Possible "inf", "nan", etc. Fallback to standard library: + // Rare path: Not in RFC 7159 format. Possible "inf", "nan", etc. if (!fast_parse_succeeded) { - std::stringstream ss; - Common::C_stringstream(ss); - ss << str; - ss >> tmp; + std::string strlower(str); + std::transform(strlower.begin(), strlower.end(), strlower.begin(), [](int c) -> char { return static_cast(::tolower(c)); }); + if (strlower == std::string("inf")) + tmp = std::numeric_limits::infinity(); + else if (strlower == std::string("-inf")) + tmp = -std::numeric_limits::infinity(); + else if (strlower == std::string("nan")) + tmp = std::numeric_limits::quiet_NaN(); + else if (strlower == std::string("-nan")) + tmp = -std::numeric_limits::quiet_NaN(); + else + Log::Fatal("Failed to parse double: %s", str.c_str()); } return static_cast(tmp); diff --git a/python-package/MANIFEST.in b/python-package/MANIFEST.in index 7fad0fa42e82..7973e2f49f13 100644 --- a/python-package/MANIFEST.in +++ b/python-package/MANIFEST.in @@ -3,7 +3,7 @@ include LICENSE include *.rst *.txt recursive-include lightgbm VERSION.txt *.py *.so include compile/CMakeLists.txt -include compile/CMakeIntegratedOpenCL.cmake +include compile/cmake/IntegratedOpenCL.cmake recursive-include compile *.so recursive-include compile/Release *.dll include compile/external_libs/compute/CMakeLists.txt diff --git a/python-package/README.rst b/python-package/README.rst index 7f4d2d35726b..c6da99229b0e 100644 --- a/python-package/README.rst +++ b/python-package/README.rst @@ -204,6 +204,10 @@ You can use ``python setup.py bdist_wheel`` instead of ``python setup.py install Install Dask-package '''''''''''''''''''' +.. warning:: + + Dask-package is only tested on Linux. + To install all additional dependencies required for Dask-package, you can append ``[dask]`` to LightGBM package name: .. code:: sh diff --git a/python-package/lightgbm/basic.py b/python-package/lightgbm/basic.py index 0e3219208a1a..0bdf5057b833 100644 --- a/python-package/lightgbm/basic.py +++ b/python-package/lightgbm/basic.py @@ -9,7 +9,7 @@ from functools import wraps from logging import Logger from tempfile import NamedTemporaryFile -from typing import Any, Dict +from typing import Any, Dict, List, Set, Union import numpy as np import scipy.sparse @@ -2336,8 +2336,13 @@ def _free_buffer(self): self.__is_predicted_cur_iter = [] return self - def set_network(self, machines, local_listen_port=12400, - listen_time_out=120, num_machines=1): + def set_network( + self, + machines: Union[List[str], Set[str], str], + local_listen_port: int = 12400, + listen_time_out: int = 120, + num_machines: int = 1 + ) -> "Booster": """Set the network configuration. Parameters @@ -2356,6 +2361,8 @@ def set_network(self, machines, local_listen_port=12400, self : Booster Booster with set network. """ + if isinstance(machines, (list, set)): + machines = ','.join(machines) _safe_call(_LIB.LGBM_NetworkInit(c_str(machines), ctypes.c_int(local_listen_port), ctypes.c_int(listen_time_out), diff --git a/python-package/lightgbm/callback.py b/python-package/lightgbm/callback.py index c2db7a3cf991..b5f2438545a8 100644 --- a/python-package/lightgbm/callback.py +++ b/python-package/lightgbm/callback.py @@ -26,7 +26,7 @@ def __init__(self, best_iteration, best_score): # Callback environment used by callbacks CallbackEnv = collections.namedtuple( - "LightGBMCallbackEnv", + "CallbackEnv", ["model", "params", "iteration", diff --git a/python-package/lightgbm/compat.py b/python-package/lightgbm/compat.py index fbd5dd4e3c04..2b8781ff2b88 100644 --- a/python-package/lightgbm/compat.py +++ b/python-package/lightgbm/compat.py @@ -11,12 +11,12 @@ except ImportError: PANDAS_INSTALLED = False - class pd_Series: + class pd_Series: # type: ignore """Dummy class for pandas.Series.""" pass - class pd_DataFrame: + class pd_DataFrame: # type: ignore """Dummy class for pandas.DataFrame.""" pass @@ -49,7 +49,7 @@ class pd_DataFrame: except ImportError: DATATABLE_INSTALLED = False - class dt_DataTable: + class dt_DataTable: # type: ignore """Dummy class for datatable.DataTable.""" pass @@ -123,17 +123,17 @@ def _check_sample_weight(sample_weight, X, dtype=None): default_client = None wait = None - class dask_Array: + class dask_Array: # type: ignore """Dummy class for dask.array.Array.""" pass - class dask_DataFrame: + class dask_DataFrame: # type: ignore """Dummy class for dask.dataframe.DataFrame.""" pass - class dask_Series: + class dask_Series: # type: ignore """Dummy class for dask.dataframe.Series.""" pass diff --git a/python-package/lightgbm/dask.py b/python-package/lightgbm/dask.py index ee1daf7f6510..3510dea7fe92 100644 --- a/python-package/lightgbm/dask.py +++ b/python-package/lightgbm/dask.py @@ -9,7 +9,7 @@ import socket from collections import defaultdict from copy import deepcopy -from typing import Any, Callable, Dict, Iterable, List, Optional, Set, Type, Union +from typing import Any, Callable, Dict, List, Optional, Type, Union from urllib.parse import urlparse import numpy as np @@ -105,12 +105,17 @@ def _train_part( else: group = None + if 'init_score' in list_of_parts[0]: + init_score = _concat([x['init_score'] for x in list_of_parts]) + else: + init_score = None + try: model = model_factory(**params) if is_ranker: - model.fit(data, label, sample_weight=weight, group=group, **kwargs) + model.fit(data, label, sample_weight=weight, init_score=init_score, group=group, **kwargs) else: - model.fit(data, label, sample_weight=weight, **kwargs) + model.fit(data, label, sample_weight=weight, init_score=init_score, **kwargs) finally: _safe_call(_LIB.LGBM_NetworkFree()) @@ -148,6 +153,10 @@ def _machines_to_worker_map(machines: str, worker_addresses: List[str]) -> Dict[ Dictionary where keys are work addresses in the form expected by Dask and values are a port for LightGBM to use. """ machine_addresses = machines.split(",") + + if len(set(machine_addresses)) != len(machine_addresses): + raise ValueError(f"Found duplicates in 'machines' ({machines}). Each entry in 'machines' must be a unique IP-port combination.") + machine_to_port = defaultdict(set) for address in machine_addresses: host, port = address.split(":") @@ -168,6 +177,7 @@ def _train( params: Dict[str, Any], model_factory: Type[LGBMModel], sample_weight: Optional[_DaskCollection] = None, + init_score: Optional[_DaskCollection] = None, group: Optional[_DaskCollection] = None, **kwargs: Any ) -> LGBMModel: @@ -187,6 +197,8 @@ def _train( Class of the local underlying model. sample_weight : Dask Array, Dask DataFrame, Dask Series of shape = [n_samples] or None, optional (default=None) Weights of training data. + init_score : Dask Array, Dask DataFrame, Dask Series of shape = [n_samples] or None, optional (default=None) + Init score of training data. group : Dask Array, Dask DataFrame, Dask Series of shape = [n_samples] or None, optional (default=None) Group/query data. Only used in the learning-to-rank task. @@ -289,6 +301,11 @@ def _train( for i in range(n_parts): parts[i]['group'] = group_parts[i] + if init_score is not None: + init_score_parts = _split_to_parts(data=init_score, is_matrix=False) + for i in range(n_parts): + parts[i]['init_score'] = init_score_parts[i] + # Start computation in the background parts = list(map(delayed, parts)) parts = client.compute(parts) @@ -540,6 +557,7 @@ def _lgb_dask_fit( X: _DaskMatrixLike, y: _DaskCollection, sample_weight: Optional[_DaskCollection] = None, + init_score: Optional[_DaskCollection] = None, group: Optional[_DaskCollection] = None, **kwargs: Any ) -> "_DaskLGBMModel": @@ -556,6 +574,7 @@ def _lgb_dask_fit( params=params, model_factory=model_factory, sample_weight=sample_weight, + init_score=init_score, group=group, **kwargs ) @@ -657,6 +676,7 @@ def fit( X: _DaskMatrixLike, y: _DaskCollection, sample_weight: Optional[_DaskCollection] = None, + init_score: Optional[_DaskCollection] = None, **kwargs: Any ) -> "DaskLGBMClassifier": """Docstring is inherited from the lightgbm.LGBMClassifier.fit.""" @@ -665,6 +685,7 @@ def fit( X=X, y=y, sample_weight=sample_weight, + init_score=init_score, **kwargs ) @@ -672,11 +693,12 @@ def fit( X_shape="Dask Array or Dask DataFrame of shape = [n_samples, n_features]", y_shape="Dask Array, Dask DataFrame or Dask Series of shape = [n_samples]", sample_weight_shape="Dask Array, Dask DataFrame, Dask Series of shape = [n_samples] or None, optional (default=None)", + init_score_shape="Dask Array, Dask DataFrame, Dask Series of shape = [n_samples] or None, optional (default=None)", group_shape="Dask Array, Dask DataFrame, Dask Series of shape = [n_samples] or None, optional (default=None)" ) - # DaskLGBMClassifier does not support init_score, evaluation data, or early stopping - _base_doc = (_base_doc[:_base_doc.find('init_score :')] + # DaskLGBMClassifier does not support evaluation data, or early stopping + _base_doc = (_base_doc[:_base_doc.find('group :')] + _base_doc[_base_doc.find('verbose :'):]) # DaskLGBMClassifier support for callbacks and init_model is not tested @@ -808,6 +830,7 @@ def fit( X: _DaskMatrixLike, y: _DaskCollection, sample_weight: Optional[_DaskCollection] = None, + init_score: Optional[_DaskCollection] = None, **kwargs: Any ) -> "DaskLGBMRegressor": """Docstring is inherited from the lightgbm.LGBMRegressor.fit.""" @@ -816,6 +839,7 @@ def fit( X=X, y=y, sample_weight=sample_weight, + init_score=init_score, **kwargs ) @@ -823,11 +847,12 @@ def fit( X_shape="Dask Array or Dask DataFrame of shape = [n_samples, n_features]", y_shape="Dask Array, Dask DataFrame or Dask Series of shape = [n_samples]", sample_weight_shape="Dask Array, Dask DataFrame, Dask Series of shape = [n_samples] or None, optional (default=None)", + init_score_shape="Dask Array, Dask DataFrame, Dask Series of shape = [n_samples] or None, optional (default=None)", group_shape="Dask Array, Dask DataFrame, Dask Series of shape = [n_samples] or None, optional (default=None)" ) - # DaskLGBMRegressor does not support init_score, evaluation data, or early stopping - _base_doc = (_base_doc[:_base_doc.find('init_score :')] + # DaskLGBMRegressor does not support evaluation data, or early stopping + _base_doc = (_base_doc[:_base_doc.find('group :')] + _base_doc[_base_doc.find('verbose :'):]) # DaskLGBMRegressor support for callbacks and init_model is not tested @@ -945,14 +970,12 @@ def fit( **kwargs: Any ) -> "DaskLGBMRanker": """Docstring is inherited from the lightgbm.LGBMRanker.fit.""" - if init_score is not None: - raise RuntimeError('init_score is not currently supported in lightgbm.dask') - return self._lgb_dask_fit( model_factory=LGBMRanker, X=X, y=y, sample_weight=sample_weight, + init_score=init_score, group=group, **kwargs ) @@ -961,13 +984,11 @@ def fit( X_shape="Dask Array or Dask DataFrame of shape = [n_samples, n_features]", y_shape="Dask Array, Dask DataFrame or Dask Series of shape = [n_samples]", sample_weight_shape="Dask Array, Dask DataFrame, Dask Series of shape = [n_samples] or None, optional (default=None)", + init_score_shape="Dask Array, Dask DataFrame, Dask Series of shape = [n_samples] or None, optional (default=None)", group_shape="Dask Array, Dask DataFrame, Dask Series of shape = [n_samples] or None, optional (default=None)" ) - # DaskLGBMRanker does not support init_score, evaluation data, or early stopping - _base_doc = (_base_doc[:_base_doc.find('init_score :')] - + _base_doc[_base_doc.find('init_score :'):]) - + # DaskLGBMRanker does not support evaluation data, or early stopping _base_doc = (_base_doc[:_base_doc.find('eval_set :')] + _base_doc[_base_doc.find('verbose :'):]) diff --git a/python-package/lightgbm/libpath.py b/python-package/lightgbm/libpath.py index 6b51039f373e..f1eab8e38756 100644 --- a/python-package/lightgbm/libpath.py +++ b/python-package/lightgbm/libpath.py @@ -2,9 +2,10 @@ """Find the path to LightGBM dynamic library files.""" import os from platform import system +from typing import List -def find_lib_path(): +def find_lib_path() -> List[str]: """Find the path to LightGBM library files. Returns diff --git a/python-package/lightgbm/sklearn.py b/python-package/lightgbm/sklearn.py index d6b882114c6a..3c6fbe772863 100644 --- a/python-package/lightgbm/sklearn.py +++ b/python-package/lightgbm/sklearn.py @@ -189,7 +189,7 @@ def __call__(self, preds, dataset): The target values (class labels in classification, real numbers in regression). sample_weight : {sample_weight_shape} Weights of training data. - init_score : array-like of shape = [n_samples] or None, optional (default=None) + init_score : {init_score_shape} Init score of training data. group : {group_shape} Group/query data. @@ -706,6 +706,7 @@ def _get_meta_data(collection, name, i): X_shape="array-like or sparse matrix of shape = [n_samples, n_features]", y_shape="array-like of shape = [n_samples]", sample_weight_shape="array-like of shape = [n_samples] or None, optional (default=None)", + init_score_shape="array-like of shape = [n_samples] or None, optional (default=None)", group_shape="array-like or None, optional (default=None)" ) + "\n\n" + _lgbmmodel_doc_custom_eval_note diff --git a/python-package/setup.py b/python-package/setup.py index de62a89ef55b..2a4c99588d42 100644 --- a/python-package/setup.py +++ b/python-package/setup.py @@ -83,8 +83,10 @@ def copy_files_helper(folder_name): os.path.join(CURRENT_DIR, "compile", "CMakeLists.txt"), verbose=0) if integrated_opencl: - copy_file(os.path.join(CURRENT_DIR, os.path.pardir, "CMakeIntegratedOpenCL.cmake"), - os.path.join(CURRENT_DIR, "compile", "CMakeIntegratedOpenCL.cmake"), + if not os.path.exists(os.path.join(CURRENT_DIR, "compile", "cmake")): + os.makedirs(os.path.join(CURRENT_DIR, "compile", "cmake")) + copy_file(os.path.join(CURRENT_DIR, os.path.pardir, "cmake", "IntegratedOpenCL.cmake"), + os.path.join(CURRENT_DIR, "compile", "cmake", "IntegratedOpenCL.cmake"), verbose=0) diff --git a/src/io/metadata.cpp b/src/io/metadata.cpp index 63a1690906a2..49fc834b87df 100644 --- a/src/io/metadata.cpp +++ b/src/io/metadata.cpp @@ -277,6 +277,10 @@ void Metadata::CheckOrPartition(data_size_t num_all_data, const std::vector 0) { + Log::Debug("Number of queries in %s: %i. Average number of rows per query: %f.", + data_filename_.c_str(), static_cast(num_queries_), static_cast(num_data_) / num_queries_); + } } void Metadata::SetInitScore(const double* init_score, data_size_t len) { diff --git a/src/io/tree.cpp b/src/io/tree.cpp index 50c1434019d4..67e02af20cd8 100644 --- a/src/io/tree.cpp +++ b/src/io/tree.cpp @@ -698,6 +698,8 @@ Tree::Tree(const char* str, size_t* used_len) { int is_linear_int; Common::Atoi(key_vals["is_linear"].c_str(), &is_linear_int); is_linear_ = static_cast(is_linear_int); + } else { + is_linear_ = false; } if ((num_leaves_ <= 1) && !is_linear_) { diff --git a/src/metric/dcg_calculator.cpp b/src/metric/dcg_calculator.cpp index 58843d89f9e1..1d648bfafd40 100644 --- a/src/metric/dcg_calculator.cpp +++ b/src/metric/dcg_calculator.cpp @@ -152,6 +152,19 @@ void DCGCalculator::CalDCG(const std::vector& ks, const label_t* la } } +void DCGCalculator::CheckMetadata(const Metadata& metadata, data_size_t num_queries) { + const data_size_t* query_boundaries = metadata.query_boundaries(); + if (num_queries > 0 && query_boundaries != nullptr) { + for (data_size_t i = 0; i < num_queries; i++) { + data_size_t num_rows = query_boundaries[i + 1] - query_boundaries[i]; + if (num_rows > kMaxPosition) { + Log::Fatal("Number of rows %i exceeds upper limit of %i for a query", static_cast(num_rows), static_cast(kMaxPosition)); + } + } + } +} + + void DCGCalculator::CheckLabel(const label_t* label, data_size_t num_data) { for (data_size_t i = 0; i < num_data; ++i) { label_t delta = std::fabs(label[i] - static_cast(label[i])); diff --git a/src/metric/rank_metric.hpp b/src/metric/rank_metric.hpp index 3b3afb547eb9..58804f415278 100644 --- a/src/metric/rank_metric.hpp +++ b/src/metric/rank_metric.hpp @@ -37,13 +37,14 @@ class NDCGMetric:public Metric { num_data_ = num_data; // get label label_ = metadata.label(); + num_queries_ = metadata.num_queries(); + DCGCalculator::CheckMetadata(metadata, num_queries_); DCGCalculator::CheckLabel(label_, num_data_); // get query boundaries query_boundaries_ = metadata.query_boundaries(); if (query_boundaries_ == nullptr) { Log::Fatal("The NDCG metric requires query information"); } - num_queries_ = metadata.num_queries(); // get query weights query_weights_ = metadata.query_weights(); if (query_weights_ == nullptr) { diff --git a/src/objective/rank_objective.hpp b/src/objective/rank_objective.hpp index a720a69a3148..9bd7b7d99cf6 100644 --- a/src/objective/rank_objective.hpp +++ b/src/objective/rank_objective.hpp @@ -120,6 +120,7 @@ class LambdarankNDCG : public RankingObjective { void Init(const Metadata& metadata, data_size_t num_data) override { RankingObjective::Init(metadata, num_data); + DCGCalculator::CheckMetadata(metadata, num_queries_); DCGCalculator::CheckLabel(label_, num_data_); inverse_max_dcgs_.resize(num_queries_); #pragma omp parallel for schedule(static) diff --git a/swig/ChunkedArray_API_extensions.i b/swig/ChunkedArray_API_extensions.i new file mode 100644 index 000000000000..4161169ac6a7 --- /dev/null +++ b/swig/ChunkedArray_API_extensions.i @@ -0,0 +1,23 @@ +/** + * Wrap chunked_array.hpp class for SWIG usage. + * + * Author: Alberto Ferreira + */ + +%{ +#include "../include/LightGBM/utils/chunked_array.hpp" +%} + +%include "../include/LightGBM/utils/chunked_array.hpp" + +using LightGBM::ChunkedArray; + +%template(int32ChunkedArray) ChunkedArray; +/* Unfortunately, for the time being, + * SWIG has issues generating the overloads to coalesce_to() + * for larger integral types + * so we won't support that for now: + */ +//%template(int64ChunkedArray) ChunkedArray; +%template(floatChunkedArray) ChunkedArray; +%template(doubleChunkedArray) ChunkedArray; diff --git a/swig/StringArray.hpp b/swig/StringArray.hpp index 397f2c46c8be..c579870e7b8a 100644 --- a/swig/StringArray.hpp +++ b/swig/StringArray.hpp @@ -1,13 +1,16 @@ /*! * Copyright (c) 2020 Microsoft Corporation. All rights reserved. * Licensed under the MIT License. See LICENSE file in the project root for license information. + * + * Author: Alberto Ferreira */ -#ifndef __STRING_ARRAY_H__ -#define __STRING_ARRAY_H__ +#ifndef LIGHTGBM_SWIG_STRING_ARRAY_H_ +#define LIGHTGBM_SWIG_STRING_ARRAY_H_ +#include #include +#include #include -#include /** * Container that manages an array of fixed-length strings. @@ -22,18 +25,15 @@ * The class also takes care of allocation of the underlying * char* memory. */ -class StringArray -{ - public: +class StringArray { + public: StringArray(size_t num_elements, size_t string_size) : _string_size(string_size), - _array(num_elements + 1, nullptr) - { + _array(num_elements + 1, nullptr) { _allocate_strings(num_elements, string_size); } - ~StringArray() - { + ~StringArray() { _release_strings(); } @@ -43,8 +43,7 @@ class StringArray * * @return char** pointer to raw data (null-terminated). */ - char **data() noexcept - { + char **data() noexcept { return _array.data(); } @@ -56,8 +55,7 @@ class StringArray * @param index Index of the element to retrieve. * @return pointer or nullptr if index is out of bounds. */ - char *getitem(size_t index) noexcept - { + char *getitem(size_t index) noexcept { if (_in_bounds(index)) return _array[index]; else @@ -77,11 +75,9 @@ class StringArray * into the target string (_string_size), it errors out * and returns -1. */ - int setitem(size_t index, std::string content) noexcept - { - if (_in_bounds(index) && content.size() < _string_size) - { - std::strcpy(_array[index], content.c_str()); + int setitem(size_t index, const std::string &content) noexcept { + if (_in_bounds(index) && content.size() < _string_size) { + std::strcpy(_array[index], content.c_str()); // NOLINT return 0; } else { return -1; @@ -91,13 +87,11 @@ class StringArray /** * @return number of stored strings. */ - size_t get_num_elements() noexcept - { + size_t get_num_elements() noexcept { return _array.size() - 1; } - private: - + private: /** * Returns true if and only if within bounds. * Notice that it excludes the last element of _array (NULL). @@ -105,8 +99,7 @@ class StringArray * @param index index of the element * @return bool true if within bounds */ - bool _in_bounds(size_t index) noexcept - { + bool _in_bounds(size_t index) noexcept { return index < get_num_elements(); } @@ -120,15 +113,13 @@ class StringArray * @param num_elements Number of strings to store in the array. * @param string_size The size of each string in the array. */ - void _allocate_strings(size_t num_elements, size_t string_size) - { - for (size_t i = 0; i < num_elements; ++i) - { + void _allocate_strings(size_t num_elements, size_t string_size) { + for (size_t i = 0; i < num_elements; ++i) { // Leave space for \0 terminator: _array[i] = new (std::nothrow) char[string_size + 1]; // Check memory allocation: - if (! _array[i]) { + if (!_array[i]) { _release_strings(); throw std::bad_alloc(); } @@ -138,8 +129,7 @@ class StringArray /** * Deletes the allocated strings. */ - void _release_strings() noexcept - { + void _release_strings() noexcept { std::for_each(_array.begin(), _array.end(), [](char* c) { delete[] c; }); } @@ -147,4 +137,4 @@ class StringArray std::vector _array; }; -#endif // __STRING_ARRAY_H__ +#endif // LIGHTGBM_SWIG_STRING_ARRAY_H_ diff --git a/swig/lightgbmlib.i b/swig/lightgbmlib.i index 057d5c5b3a3f..67937c43ba69 100644 --- a/swig/lightgbmlib.i +++ b/swig/lightgbmlib.i @@ -282,3 +282,4 @@ %include "pointer_manipulation.i" %include "StringArray_API_extensions.i" +%include "ChunkedArray_API_extensions.i" diff --git a/tests/cpp_test/test_chunked_array.cpp b/tests/cpp_test/test_chunked_array.cpp new file mode 100644 index 000000000000..e7d15556643e --- /dev/null +++ b/tests/cpp_test/test_chunked_array.cpp @@ -0,0 +1,262 @@ +/*! + * Copyright (c) 2021 Microsoft Corporation. All rights reserved. + * Licensed under the MIT License. See LICENSE file in the project root for license information. + * + * Author: Alberto Ferreira + */ +#include +#include "../include/LightGBM/utils/chunked_array.hpp" + +using LightGBM::ChunkedArray; + +/*! + Helper util to compare two vectors. + + Don't compare floating point vectors this way! +*/ +template +testing::AssertionResult are_vectors_equal(const std::vector &a, const std::vector &b) { + if (a.size() != b.size()) { + return testing::AssertionFailure() + << "Vectors differ in size: " + << a.size() << " != " << b.size(); + } + + for (size_t i = 0; i < a.size(); ++i) { + if (a[i] != b[i]) { + return testing::AssertionFailure() + << "Vectors differ at least at position " << i << ": " + << a[i] << " != " << b[i]; + } + } + + return testing::AssertionSuccess(); +} + + +class ChunkedArrayTest : public testing::Test { + protected: + + void SetUp() override { + + } + + void add_items_to_array(const std::vector &vec, ChunkedArray &ca) { + for (auto v: vec) { + ca.add(v); + } + } + + /*! + Ensures that if coalesce_to() is called upon the ChunkedArray, + it would yield the same contents as vec + */ + testing::AssertionResult coalesced_output_equals_vec(const ChunkedArray &ca, const std::vector &vec, + const bool all_addresses=false) { + std::vector out(vec.size()); + ca.coalesce_to(out.data(), all_addresses); + return are_vectors_equal(out, vec); + } + + // Constants + const std::vector REF_VEC = {1, 5, 2, 4, 9, 8, 7}; + const size_t CHUNK_SIZE = 3; + const size_t OUT_OF_BOUNDS_OFFSET = 4; + + ChunkedArray ca_ = ChunkedArray(CHUNK_SIZE); // ca(0), std::runtime_error); +} + +/*! get_chunk_size() should return the size used in the constructor */ +TEST_F(ChunkedArrayTest, constructorWithChunkSize) { + for (size_t chunk_size = 1; chunk_size < 10; ++chunk_size) { + ChunkedArray ca(chunk_size); + ASSERT_EQ(ca.get_chunk_size(), chunk_size); + } +} + +/*! + get_chunk_size() should return the size used in the constructor + independently of array manipulations. +*/ +TEST_F(ChunkedArrayTest, getChunkSizeIsConstant) { + for (size_t i = 0; i < 3 * CHUNK_SIZE; ++i) { + ASSERT_EQ(ca_.get_chunk_size(), CHUNK_SIZE); + ca_.add(0); + } +} + + +/*! + get_add_count() should return the number of add calls, + independently of the number of chunks used. +*/ +TEST_F(ChunkedArrayTest, getChunksCount) { + ASSERT_EQ(ca_.get_chunks_count(), 1); // ChunkedArray always starts with 1 chunk. + + for (size_t i = 0; i < 3 * CHUNK_SIZE; ++i) { + ca_.add(0); + int expected_chunks = int(i/CHUNK_SIZE) + 1; + ASSERT_EQ(ca_.get_chunks_count(), expected_chunks) << "with " << i << " add() call(s) " + << "and CHUNK_SIZE==" << CHUNK_SIZE << "."; + } +} + +/*! + get_add_count() should return the number of add calls, + independently of the number of chunks used. +*/ +TEST_F(ChunkedArrayTest, getAddCount) { + for (size_t i = 0; i < 3 * CHUNK_SIZE; ++i) { + ASSERT_EQ(ca_.get_add_count(), i); + ca_.add(0); + } +} + +/*! + Ensure coalesce_to() works and dumps all the inserted data correctly. + + If the ChunkedArray is created from a sequence of add() calls, coalescing to + an output array after multiple add operations should yield the same + exact data at both input and output. +*/ +TEST_F(ChunkedArrayTest, coalesceTo) { + std::vector out(REF_VEC.size()); + add_items_to_array(REF_VEC, ca_); + + ca_.coalesce_to(out.data()); + + ASSERT_TRUE(are_vectors_equal(REF_VEC, out)); +} + +/*! + After clear the ChunkedArray() should still be usable. +*/ +TEST_F(ChunkedArrayTest, clear) { + const std::vector ref_vec2 = {1, 2, 5, -1}; + add_items_to_array(REF_VEC, ca_); + // Start with some content: + ASSERT_TRUE(coalesced_output_equals_vec(ca_, REF_VEC)); + + // Clear & re-use: + ca_.clear(); + add_items_to_array(ref_vec2, ca_); + + // Output should match new content: + ASSERT_TRUE(coalesced_output_equals_vec(ca_, ref_vec2)); +} + +/*! + Ensure ChunkedArray is safe against double-frees. +*/ +TEST_F(ChunkedArrayTest, doubleFreeSafe) { + ca_.release(); // Cannot be used any longer from now on. + ca_.release(); // Ensure we don't segfault. + + SUCCEED(); +} + +/*! + Ensure size computations in the getters are correct. +*/ +TEST_F(ChunkedArrayTest, totalArraySizeMatchesLastChunkAddCount) { + add_items_to_array(REF_VEC, ca_); + + const size_t first_chunks_add_count = (ca_.get_chunks_count() - 1) * ca_.get_chunk_size(); + const size_t last_chunk_add_count = ca_.get_last_chunk_add_count(); + + EXPECT_EQ(first_chunks_add_count, int(REF_VEC.size()/CHUNK_SIZE) * CHUNK_SIZE); + EXPECT_EQ(last_chunk_add_count, REF_VEC.size() % CHUNK_SIZE); + EXPECT_EQ(first_chunks_add_count + last_chunk_add_count, ca_.get_add_count()); +} + +/*! + Assert all values are correct and at the expected addresses throughout the + several chunks. + + This uses getitem() to reach each individual address of any of the chunks. + + A sentinel value of -1 is used to check for invalid addresses. + This would occur if there was an improper data layout with the chunks. +*/ +TEST_F(ChunkedArrayTest, dataLayoutTestThroughGetitem) { + add_items_to_array(REF_VEC, ca_); + + for (size_t i = 0, chunk = 0, in_chunk_idx = 0; i < REF_VEC.size(); ++i) { + int value = ca_.getitem(chunk, in_chunk_idx, -1); // -1 works as sentinel value (bad layout found) + + EXPECT_EQ(value, REF_VEC[i]) << " for address (chunk,in_chunk_idx) = (" << chunk << "," << in_chunk_idx << ")"; + + if (++in_chunk_idx == ca_.get_chunk_size()) { + in_chunk_idx = 0; + ++chunk; + } + } +} + +/*! + Perform an array of setitem & getitem at valid and invalid addresses. + We use several random addresses and trials to avoid writing much code. + + By testing a random number of addresses many more times than the size of the test space + we are almost guaranteed to cover all possible search addresses. + + We also gradually add more chunks to the ChunkedArray and re-run more trials + to ensure the valid/invalid addresses are updated. + + With each valid update we add to a "memory" vector the history of all the insertions. + This is used at the end to ensure all values were stored properly, including after + value overrides. +*/ +TEST_F(ChunkedArrayTest, testDataLayoutWithAdvancedInsertionAPI) { + const size_t MAX_CHUNKS_SEARCH = 5; + const size_t MAX_IN_CHUNK_SEARCH_IDX = 2 * CHUNK_SIZE; + // Number of trials for each new ChunkedArray configuration. Pass 100 times over the search space: + const size_t N_TRIALS = MAX_CHUNKS_SEARCH * MAX_IN_CHUNK_SEARCH_IDX * 100; + std::vector overriden_trials_values(MAX_CHUNKS_SEARCH * CHUNK_SIZE); + std::vector overriden_trials_mask(MAX_CHUNKS_SEARCH * CHUNK_SIZE, false); + + // Each outer loop iteration changes the test by adding +1 chunk. We start with 1 chunk only: + for (size_t chunks = 1; chunks < MAX_CHUNKS_SEARCH; ++chunks) { + EXPECT_EQ(ca_.get_chunks_count(), chunks); + + // Sweep valid and invalid addresses with a ChunkedArray with `chunks` chunks: + for (size_t trial = 0; trial < N_TRIALS; ++trial) { + // Compute a new trial address & value & if it is a valid address: + const size_t trial_chunk = std::rand() % MAX_CHUNKS_SEARCH; + const size_t trial_in_chunk_idx = std::rand() % MAX_IN_CHUNK_SEARCH_IDX; + const int trial_value = std::rand() % 99999; + const bool valid_address = (trial_chunk < chunks) & (trial_in_chunk_idx < CHUNK_SIZE); + + // Insert item. If at a valid address, 0 is returned, otherwise, -1 is returned: + EXPECT_EQ(ca_.setitem(trial_chunk, trial_in_chunk_idx, trial_value), + valid_address ? 0 : -1); + // If at valid address, check that the stored value is correct & remember it for the future: + if (valid_address) { + // Check the just-stored value with getitem(): + EXPECT_EQ(ca_.getitem(trial_chunk, trial_in_chunk_idx, -1), trial_value); // -1 is the sentinel value. + + // Also store the just-stored value for future tracking: + overriden_trials_values[trial_chunk*CHUNK_SIZE + trial_in_chunk_idx] = trial_value; + overriden_trials_mask[trial_chunk*CHUNK_SIZE + trial_in_chunk_idx] = true; + } + } + + ca_.new_chunk(); // Just finished a round of trials. Now add a new chunk. Valid addresses will be expanded. + } + + // Final check: ensure even with overrides, all valid insertions store the latest value at that address: + std::vector coalesced_out(MAX_CHUNKS_SEARCH * CHUNK_SIZE, -1); + ca_.coalesce_to(coalesced_out.data(), true); // Export all valid addresses. + for (size_t i = 0; i < overriden_trials_mask.size(); ++i) { + if (overriden_trials_mask[i]) { + EXPECT_EQ(ca_.getitem(i/CHUNK_SIZE, i % CHUNK_SIZE, -1), overriden_trials_values[i]); + EXPECT_EQ(coalesced_out[i], overriden_trials_values[i]); + } + } +} diff --git a/tests/cpp_test/test_main.cpp b/tests/cpp_test/test_main.cpp new file mode 100644 index 000000000000..e84c8142b52c --- /dev/null +++ b/tests/cpp_test/test_main.cpp @@ -0,0 +1,11 @@ +/*! + * Copyright (c) 2021 Microsoft Corporation. All rights reserved. + * Licensed under the MIT License. See LICENSE file in the project root for license information. + */ +#include + +int main(int argc, char** argv) { + testing::InitGoogleTest(&argc, argv); + testing::FLAGS_gtest_death_test_style = "threadsafe"; + return RUN_ALL_TESTS(); +} diff --git a/tests/python_package_test/test_dask.py b/tests/python_package_test/test_dask.py index 5f7784190e4b..103a5450291e 100644 --- a/tests/python_package_test/test_dask.py +++ b/tests/python_package_test/test_dask.py @@ -3,6 +3,7 @@ import inspect import pickle +import random import socket from itertools import groupby from os import getenv @@ -42,10 +43,21 @@ # see https://distributed.dask.org/en/latest/api.html#distributed.Client.close CLIENT_CLOSE_TIMEOUT = 120 -tasks = ['classification', 'regression', 'ranking'] +tasks = ['binary-classification', 'multiclass-classification', 'regression', 'ranking'] data_output = ['array', 'scipy_csr_matrix', 'dataframe', 'dataframe-with-categorical'] -data_centers = [[[-4, -4], [4, 4]], [[-4, -4], [4, 4], [-4, 4]]] group_sizes = [5, 5, 5, 10, 10, 10, 20, 20, 20, 50, 50] +task_to_dask_factory = { + 'regression': lgb.DaskLGBMRegressor, + 'binary-classification': lgb.DaskLGBMClassifier, + 'multiclass-classification': lgb.DaskLGBMClassifier, + 'ranking': lgb.DaskLGBMRanker +} +task_to_local_factory = { + 'regression': lgb.LGBMRegressor, + 'binary-classification': lgb.LGBMClassifier, + 'multiclass-classification': lgb.LGBMClassifier, + 'ranking': lgb.LGBMRanker +} pytestmark = [ pytest.mark.skipif(getenv('TASK', '') == 'mpi', reason='Fails to run with MPI interface'), @@ -119,11 +131,24 @@ def _create_ranking_data(n_samples=100, output='array', chunk_size=50, **kwargs) return X, y, w, g_rle, dX, dy, dw, dg -def _create_data(objective, n_samples=100, centers=2, output='array', chunk_size=50): - if objective == 'classification': +def _create_data(objective, n_samples=100, output='array', chunk_size=50, **kwargs): + if objective.endswith('classification'): + if objective == 'binary-classification': + centers = [[-4, -4], [4, 4]] + elif objective == 'multiclass-classification': + centers = [[-4, -4], [4, 4], [-4, 4]] + else: + raise ValueError(f"Unknown classification task '{objective}'") X, y = make_blobs(n_samples=n_samples, centers=centers, random_state=42) elif objective == 'regression': X, y = make_regression(n_samples=n_samples, random_state=42) + elif objective == 'ranking': + return _create_ranking_data( + n_samples=n_samples, + output=output, + chunk_size=chunk_size, + **kwargs + ) else: raise ValueError("Unknown objective '%s'" % objective) rnd = np.random.RandomState(42) @@ -165,7 +190,7 @@ def _create_data(objective, n_samples=100, centers=2, output='array', chunk_size else: raise ValueError("Unknown output type '%s'" % output) - return X, y, weights, dX, dy, dw + return X, y, weights, None, dX, dy, dw, None def _r2_score(dy_true, dy_pred): @@ -205,12 +230,11 @@ def _unpickle(filepath, serializer): @pytest.mark.parametrize('output', data_output) -@pytest.mark.parametrize('centers', data_centers) -def test_classifier(output, centers, client): - X, y, w, dX, dy, dw = _create_data( - objective='classification', - output=output, - centers=centers +@pytest.mark.parametrize('task', ['binary-classification', 'multiclass-classification']) +def test_classifier(output, task, client): + X, y, w, _, dX, dy, dw, _ = _create_data( + objective=task, + output=output ) params = { @@ -272,12 +296,11 @@ def test_classifier(output, centers, client): @pytest.mark.parametrize('output', data_output) -@pytest.mark.parametrize('centers', data_centers) -def test_classifier_pred_contrib(output, centers, client): - X, y, w, dX, dy, dw = _create_data( - objective='classification', - output=output, - centers=centers +@pytest.mark.parametrize('task', ['binary-classification', 'multiclass-classification']) +def test_classifier_pred_contrib(output, task, client): + X, y, w, _, dX, dy, dw, _ = _create_data( + objective=task, + output=output ) params = { @@ -353,7 +376,7 @@ def test_find_random_open_port(client): def test_training_does_not_fail_on_port_conflicts(client): - _, _, _, dX, dy, dw = _create_data('classification', output='array') + _, _, _, _, dX, dy, dw, _ = _create_data('binary-classification', output='array') lightgbm_default_port = 12400 with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: @@ -377,7 +400,7 @@ def test_training_does_not_fail_on_port_conflicts(client): @pytest.mark.parametrize('output', data_output) def test_regressor(output, client): - X, y, w, dX, dy, dw = _create_data( + X, y, w, _, dX, dy, dw, _ = _create_data( objective='regression', output=output ) @@ -452,7 +475,7 @@ def test_regressor(output, client): @pytest.mark.parametrize('output', data_output) def test_regressor_pred_contrib(output, client): - X, y, w, dX, dy, dw = _create_data( + X, y, w, _, dX, dy, dw, _ = _create_data( objective='regression', output=output ) @@ -502,7 +525,7 @@ def test_regressor_pred_contrib(output, client): @pytest.mark.parametrize('output', data_output) @pytest.mark.parametrize('alpha', [.1, .5, .9]) def test_regressor_quantile(output, client, alpha): - X, y, w, dX, dy, dw = _create_data( + X, y, w, _, dX, dy, dw, _ = _create_data( objective='regression', output=output ) @@ -551,18 +574,19 @@ def test_regressor_quantile(output, client, alpha): @pytest.mark.parametrize('output', ['array', 'dataframe', 'dataframe-with-categorical']) @pytest.mark.parametrize('group', [None, group_sizes]) def test_ranker(output, client, group): - if output == 'dataframe-with-categorical': - X, y, w, g, dX, dy, dw, dg = _create_ranking_data( + X, y, w, g, dX, dy, dw, dg = _create_data( + objective='ranking', output=output, group=group, n_features=1, n_informative=1 ) else: - X, y, w, g, dX, dy, dw, dg = _create_ranking_data( + X, y, w, g, dX, dy, dw, dg = _create_data( + objective='ranking', output=output, - group=group, + group=group ) # rebalance small dask.Array dataset for better performance. @@ -634,22 +658,12 @@ def test_ranker(output, client, group): @pytest.mark.parametrize('task', tasks) def test_training_works_if_client_not_provided_or_set_after_construction(task, client): - if task == 'ranking': - _, _, _, _, dX, dy, _, dg = _create_ranking_data( - output='array', - group=None - ) - model_factory = lgb.DaskLGBMRanker - else: - _, _, _, dX, dy, _ = _create_data( - objective=task, - output='array', - ) - dg = None - if task == 'classification': - model_factory = lgb.DaskLGBMClassifier - elif task == 'regression': - model_factory = lgb.DaskLGBMRegressor + _, _, _, _, dX, dy, _, dg = _create_data( + objective=task, + output='array', + group=None + ) + model_factory = task_to_dask_factory[task] params = { "time_out": 5, @@ -711,187 +725,166 @@ def test_training_works_if_client_not_provided_or_set_after_construction(task, c @pytest.mark.parametrize('set_client', [True, False]) def test_model_and_local_version_are_picklable_whether_or_not_client_set_explicitly(serializer, task, set_client, tmp_path): - with LocalCluster(n_workers=2, threads_per_worker=1) as cluster1: - with Client(cluster1) as client1: + with LocalCluster(n_workers=2, threads_per_worker=1) as cluster1, Client(cluster1) as client1: + # data on cluster1 + X_1, _, _, _, dX_1, dy_1, _, dg_1 = _create_data( + objective=task, + output='array', + group=None + ) + + with LocalCluster(n_workers=2, threads_per_worker=1) as cluster2, Client(cluster2) as client2: + # create identical data on cluster2 + X_2, _, _, _, dX_2, dy_2, _, dg_2 = _create_data( + objective=task, + output='array', + group=None + ) - # data on cluster1 - if task == 'ranking': - X_1, _, _, _, dX_1, dy_1, _, dg_1 = _create_ranking_data( - output='array', - group=None - ) + model_factory = task_to_dask_factory[task] + + params = { + "time_out": 5, + "n_estimators": 1, + "num_leaves": 2 + } + + # at this point, the result of default_client() is client2 since it was the most recently + # created. So setting client to client1 here to test that you can select a non-default client + assert default_client() == client2 + if set_client: + params.update({"client": client1}) + + # unfitted model should survive pickling round trip, and pickling + # shouldn't have side effects on the model object + dask_model = model_factory(**params) + local_model = dask_model.to_local() + if set_client: + assert dask_model.client == client1 else: - X_1, _, _, dX_1, dy_1, _ = _create_data( - objective=task, - output='array', - ) - dg_1 = None - - with LocalCluster(n_workers=2, threads_per_worker=1) as cluster2: - with Client(cluster2) as client2: - - # create identical data on cluster2 - if task == 'ranking': - X_2, _, _, _, dX_2, dy_2, _, dg_2 = _create_ranking_data( - output='array', - group=None - ) - else: - X_2, _, _, dX_2, dy_2, _ = _create_data( - objective=task, - output='array', - ) - dg_2 = None - - if task == 'ranking': - model_factory = lgb.DaskLGBMRanker - elif task == 'classification': - model_factory = lgb.DaskLGBMClassifier - elif task == 'regression': - model_factory = lgb.DaskLGBMRegressor - - params = { - "time_out": 5, - "n_estimators": 1, - "num_leaves": 2 - } - - # at this point, the result of default_client() is client2 since it was the most recently - # created. So setting client to client1 here to test that you can select a non-default client - assert default_client() == client2 - if set_client: - params.update({"client": client1}) - - # unfitted model should survive pickling round trip, and pickling - # shouldn't have side effects on the model object - dask_model = model_factory(**params) - local_model = dask_model.to_local() - if set_client: - assert dask_model.client == client1 - else: - assert dask_model.client is None - - with pytest.raises(lgb.compat.LGBMNotFittedError, match='Cannot access property client_ before calling fit'): - dask_model.client_ - - assert "client" not in local_model.get_params() - assert getattr(local_model, "client", None) is None - - tmp_file = str(tmp_path / "model-1.pkl") - _pickle( - obj=dask_model, - filepath=tmp_file, - serializer=serializer - ) - model_from_disk = _unpickle( - filepath=tmp_file, - serializer=serializer - ) - - local_tmp_file = str(tmp_path / "local-model-1.pkl") - _pickle( - obj=local_model, - filepath=local_tmp_file, - serializer=serializer - ) - local_model_from_disk = _unpickle( - filepath=local_tmp_file, - serializer=serializer - ) - - assert model_from_disk.client is None - - if set_client: - assert dask_model.client == client1 - else: - assert dask_model.client is None - - with pytest.raises(lgb.compat.LGBMNotFittedError, match='Cannot access property client_ before calling fit'): - dask_model.client_ - - # client will always be None after unpickling - if set_client: - from_disk_params = model_from_disk.get_params() - from_disk_params.pop("client", None) - dask_params = dask_model.get_params() - dask_params.pop("client", None) - assert from_disk_params == dask_params - else: - assert model_from_disk.get_params() == dask_model.get_params() - assert local_model_from_disk.get_params() == local_model.get_params() - - # fitted model should survive pickling round trip, and pickling - # shouldn't have side effects on the model object - if set_client: - dask_model.fit(dX_1, dy_1, group=dg_1) - else: - dask_model.fit(dX_2, dy_2, group=dg_2) - local_model = dask_model.to_local() - - assert "client" not in local_model.get_params() - with pytest.raises(AttributeError): - local_model.client - local_model.client_ - - tmp_file2 = str(tmp_path / "model-2.pkl") - _pickle( - obj=dask_model, - filepath=tmp_file2, - serializer=serializer - ) - fitted_model_from_disk = _unpickle( - filepath=tmp_file2, - serializer=serializer - ) - - local_tmp_file2 = str(tmp_path / "local-model-2.pkl") - _pickle( - obj=local_model, - filepath=local_tmp_file2, - serializer=serializer - ) - local_fitted_model_from_disk = _unpickle( - filepath=local_tmp_file2, - serializer=serializer - ) - - if set_client: - assert dask_model.client == client1 - assert dask_model.client_ == client1 - else: - assert dask_model.client is None - assert dask_model.client_ == default_client() - assert dask_model.client_ == client2 - - assert isinstance(fitted_model_from_disk, model_factory) - assert fitted_model_from_disk.client is None - assert fitted_model_from_disk.client_ == default_client() - assert fitted_model_from_disk.client_ == client2 - - # client will always be None after unpickling - if set_client: - from_disk_params = fitted_model_from_disk.get_params() - from_disk_params.pop("client", None) - dask_params = dask_model.get_params() - dask_params.pop("client", None) - assert from_disk_params == dask_params - else: - assert fitted_model_from_disk.get_params() == dask_model.get_params() - assert local_fitted_model_from_disk.get_params() == local_model.get_params() - - if set_client: - preds_orig = dask_model.predict(dX_1).compute() - preds_loaded_model = fitted_model_from_disk.predict(dX_1).compute() - preds_orig_local = local_model.predict(X_1) - preds_loaded_model_local = local_fitted_model_from_disk.predict(X_1) - else: - preds_orig = dask_model.predict(dX_2).compute() - preds_loaded_model = fitted_model_from_disk.predict(dX_2).compute() - preds_orig_local = local_model.predict(X_2) - preds_loaded_model_local = local_fitted_model_from_disk.predict(X_2) - - assert_eq(preds_orig, preds_loaded_model) - assert_eq(preds_orig_local, preds_loaded_model_local) + assert dask_model.client is None + + with pytest.raises(lgb.compat.LGBMNotFittedError, match='Cannot access property client_ before calling fit'): + dask_model.client_ + + assert "client" not in local_model.get_params() + assert getattr(local_model, "client", None) is None + + tmp_file = str(tmp_path / "model-1.pkl") + _pickle( + obj=dask_model, + filepath=tmp_file, + serializer=serializer + ) + model_from_disk = _unpickle( + filepath=tmp_file, + serializer=serializer + ) + + local_tmp_file = str(tmp_path / "local-model-1.pkl") + _pickle( + obj=local_model, + filepath=local_tmp_file, + serializer=serializer + ) + local_model_from_disk = _unpickle( + filepath=local_tmp_file, + serializer=serializer + ) + + assert model_from_disk.client is None + + if set_client: + assert dask_model.client == client1 + else: + assert dask_model.client is None + + with pytest.raises(lgb.compat.LGBMNotFittedError, match='Cannot access property client_ before calling fit'): + dask_model.client_ + + # client will always be None after unpickling + if set_client: + from_disk_params = model_from_disk.get_params() + from_disk_params.pop("client", None) + dask_params = dask_model.get_params() + dask_params.pop("client", None) + assert from_disk_params == dask_params + else: + assert model_from_disk.get_params() == dask_model.get_params() + assert local_model_from_disk.get_params() == local_model.get_params() + + # fitted model should survive pickling round trip, and pickling + # shouldn't have side effects on the model object + if set_client: + dask_model.fit(dX_1, dy_1, group=dg_1) + else: + dask_model.fit(dX_2, dy_2, group=dg_2) + local_model = dask_model.to_local() + + assert "client" not in local_model.get_params() + with pytest.raises(AttributeError): + local_model.client + local_model.client_ + + tmp_file2 = str(tmp_path / "model-2.pkl") + _pickle( + obj=dask_model, + filepath=tmp_file2, + serializer=serializer + ) + fitted_model_from_disk = _unpickle( + filepath=tmp_file2, + serializer=serializer + ) + + local_tmp_file2 = str(tmp_path / "local-model-2.pkl") + _pickle( + obj=local_model, + filepath=local_tmp_file2, + serializer=serializer + ) + local_fitted_model_from_disk = _unpickle( + filepath=local_tmp_file2, + serializer=serializer + ) + + if set_client: + assert dask_model.client == client1 + assert dask_model.client_ == client1 + else: + assert dask_model.client is None + assert dask_model.client_ == default_client() + assert dask_model.client_ == client2 + + assert isinstance(fitted_model_from_disk, model_factory) + assert fitted_model_from_disk.client is None + assert fitted_model_from_disk.client_ == default_client() + assert fitted_model_from_disk.client_ == client2 + + # client will always be None after unpickling + if set_client: + from_disk_params = fitted_model_from_disk.get_params() + from_disk_params.pop("client", None) + dask_params = dask_model.get_params() + dask_params.pop("client", None) + assert from_disk_params == dask_params + else: + assert fitted_model_from_disk.get_params() == dask_model.get_params() + assert local_fitted_model_from_disk.get_params() == local_model.get_params() + + if set_client: + preds_orig = dask_model.predict(dX_1).compute() + preds_loaded_model = fitted_model_from_disk.predict(dX_1).compute() + preds_orig_local = local_model.predict(X_1) + preds_loaded_model_local = local_fitted_model_from_disk.predict(X_1) + else: + preds_orig = dask_model.predict(dX_2).compute() + preds_loaded_model = fitted_model_from_disk.predict(dX_2).compute() + preds_orig_local = local_model.predict(X_2) + preds_loaded_model_local = local_fitted_model_from_disk.predict(X_2) + + assert_eq(preds_orig, preds_loaded_model) + assert_eq(preds_orig_local, preds_loaded_model_local) def test_warns_and_continues_on_unrecognized_tree_learner(client): @@ -964,26 +957,14 @@ def collection_to_single_partition(collection): return collection.rechunk(*collection.shape) return collection.repartition(npartitions=1) - if task == 'ranking': - X, y, w, g, dX, dy, dw, dg = _create_ranking_data( - output=output, - group=None - ) - dask_model_factory = lgb.DaskLGBMRanker - local_model_factory = lgb.LGBMRanker - else: - X, y, w, dX, dy, dw = _create_data( - objective=task, - output=output - ) - g = None - dg = None - if task == 'classification': - dask_model_factory = lgb.DaskLGBMClassifier - local_model_factory = lgb.LGBMClassifier - elif task == 'regression': - dask_model_factory = lgb.DaskLGBMRegressor - local_model_factory = lgb.LGBMRegressor + X, y, w, g, dX, dy, dw, dg = _create_data( + objective=task, + output=output, + group=None + ) + + dask_model_factory = task_to_dask_factory[task] + local_model_factory = task_to_local_factory[task] dX = collection_to_single_partition(dX) dy = collection_to_single_partition(dy) @@ -1022,24 +1003,16 @@ def test_network_params_not_required_but_respected_if_given(client, task, output if task == 'ranking' and output == 'scipy_csr_matrix': pytest.skip('LGBMRanker is not currently tested on sparse matrices') - if task == 'ranking': - _, _, _, _, dX, dy, _, dg = _create_ranking_data( - output=output, - group=None, - chunk_size=10, - ) - dask_model_factory = lgb.DaskLGBMRanker - else: - _, _, _, dX, dy, _ = _create_data( - objective=task, - output=output, - chunk_size=10, - ) - dg = None - if task == 'classification': - dask_model_factory = lgb.DaskLGBMClassifier - elif task == 'regression': - dask_model_factory = lgb.DaskLGBMRegressor + client.wait_for_workers(2) + + _, _, _, _, dX, dy, _, dg = _create_data( + objective=task, + output=output, + chunk_size=10, + group=None + ) + + dask_model_factory = task_to_dask_factory[task] # rebalance data to be sure that each worker has a piece of the data if output == 'array': @@ -1096,30 +1069,21 @@ def test_machines_should_be_used_if_provided(task, output): pytest.skip('LGBMRanker is not currently tested on sparse matrices') with LocalCluster(n_workers=2) as cluster, Client(cluster) as client: - if task == 'ranking': - _, _, _, _, dX, dy, _, dg = _create_ranking_data( - output=output, - group=None, - chunk_size=10, - ) - dask_model_factory = lgb.DaskLGBMRanker - else: - _, _, _, dX, dy, _ = _create_data( - objective=task, - output=output, - chunk_size=10, - ) - dg = None - if task == 'classification': - dask_model_factory = lgb.DaskLGBMClassifier - elif task == 'regression': - dask_model_factory = lgb.DaskLGBMRegressor + _, _, _, _, dX, dy, _, dg = _create_data( + objective=task, + output=output, + chunk_size=10, + group=None + ) + + dask_model_factory = task_to_dask_factory[task] # rebalance data to be sure that each worker has a piece of the data if output == 'array': client.rebalance() n_workers = len(client.scheduler_info()['workers']) + assert n_workers > 1 open_ports = [lgb.dask._find_random_open_port() for _ in range(n_workers)] dask_model = dask_model_factory( n_estimators=5, @@ -1138,6 +1102,17 @@ def test_machines_should_be_used_if_provided(task, output): s.bind(('127.0.0.1', open_ports[0])) dask_model.fit(dX, dy, group=dg) + # an informative error should be raised if "machines" has duplicates + one_open_port = lgb.dask._find_random_open_port() + dask_model.set_params( + machines=",".join([ + "127.0.0.1:" + str(one_open_port) + for _ in range(n_workers) + ]) + ) + with pytest.raises(ValueError, match="Found duplicates in 'machines'"): + dask_model.fit(dX, dy, group=dg) + @pytest.mark.parametrize( "classes", @@ -1195,22 +1170,14 @@ def test_training_succeeds_when_data_is_dataframe_and_label_is_column_array( task, client, ): - if task == 'ranking': - _, _, _, _, dX, dy, dw, dg = _create_ranking_data( - output='dataframe', - group=None - ) - model_factory = lgb.DaskLGBMRanker - else: - _, _, _, dX, dy, dw = _create_data( - objective=task, - output='dataframe', - ) - dg = None - if task == 'classification': - model_factory = lgb.DaskLGBMClassifier - elif task == 'regression': - model_factory = lgb.DaskLGBMRegressor + _, _, _, _, dX, dy, dw, dg = _create_data( + objective=task, + output='dataframe', + group=None + ) + + model_factory = task_to_dask_factory[task] + dy = dy.to_dask_array(lengths=True) dy_col_array = dy.reshape(-1, 1) assert len(dy_col_array.shape) == 2 and dy_col_array.shape[1] == 1 @@ -1228,6 +1195,45 @@ def test_training_succeeds_when_data_is_dataframe_and_label_is_column_array( client.close(timeout=CLIENT_CLOSE_TIMEOUT) +@pytest.mark.parametrize('task', tasks) +@pytest.mark.parametrize('output', data_output) +def test_init_score(task, output, client): + if task == 'ranking' and output == 'scipy_csr_matrix': + pytest.skip('LGBMRanker is not currently tested on sparse matrices') + + _, _, _, _, dX, dy, dw, dg = _create_data( + objective=task, + output=output, + group=None + ) + + model_factory = task_to_dask_factory[task] + + params = { + 'n_estimators': 1, + 'num_leaves': 2, + 'time_out': 5 + } + init_score = random.random() + # init_scores must be a 1D array, even for multiclass classification + # where you need to provide 1 score per class for each row in X + # https://github.com/microsoft/LightGBM/issues/4046 + size_factor = 1 + if task == 'multiclass-classification': + size_factor = 3 # number of classes + + if output.startswith('dataframe'): + init_scores = dy.map_partitions(lambda x: pd.Series([init_score] * x.size * size_factor)) + else: + init_scores = dy.map_blocks(lambda x: np.repeat(init_score, x.size * size_factor)) + model = model_factory(client=client, **params) + model.fit(dX, dy, sample_weight=dw, init_score=init_scores, group=dg) + # value of the root node is 0 when init_score is set + assert model.booster_.trees_to_dataframe()['value'][0] == 0 + + client.close(timeout=CLIENT_CLOSE_TIMEOUT) + + def sklearn_checks_to_run(): check_names = [ "check_estimator_get_tags_default_keys", From 87e333f30506955de967ee4d3eac752cce49aefe Mon Sep 17 00:00:00 2001 From: Akshita Dixit <56997545+akshitadixit@users.noreply.github.com> Date: Wed, 24 Mar 2021 08:49:59 +0530 Subject: [PATCH 7/7] Apply suggestions from code review Co-authored-by: Nikita Titov --- docs/Features.rst | 2 +- docs/GPU-Performance.rst | 2 +- docs/GPU-Windows.rst | 36 ++++++++++++++++++------------------ 3 files changed, 20 insertions(+), 20 deletions(-) diff --git a/docs/Features.rst b/docs/Features.rst index 6c7b07a8c813..64be80ba8b26 100644 --- a/docs/Features.rst +++ b/docs/Features.rst @@ -54,7 +54,7 @@ Leaf-wise may cause over-fitting when ``#data`` is small, so LightGBM includes t .. image:: ./_static/images/leaf-wise.png :align: center - :alt: A diagram depicting leaf wise tree growth in which only the node with the highest loss change is split and not bother with the rest of the nodes in the same level. This results in an asymmetrical tree where subsequent splitting is happenning only on one side of the tree. + :alt: A diagram depicting leaf wise tree growth in which only the node with the highest loss change is split and not bother with the rest of the nodes in the same level. This results in an asymmetrical tree where subsequent splitting is happening only on one side of the tree. Optimal Split for Categorical Features ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/docs/GPU-Performance.rst b/docs/GPU-Performance.rst index 186c5b7c4b18..ab7ff4137cf8 100644 --- a/docs/GPU-Performance.rst +++ b/docs/GPU-Performance.rst @@ -163,7 +163,7 @@ We record the wall clock time after 500 iterations, as shown in the figure below .. image:: ./_static/images/gpu-performance-comparison.png :align: center :target: ./_static/images/gpu-performance-comparison.png - :alt: A performance chart which is a record of the wall clock time after 500 iterations on G P U for Higgs, epsilon, Bosch, Microsoft L T R, Expo and Yahoo L T R and bin size of 63 performs comparitively better. + :alt: A performance chart which is a record of the wall clock time after 500 iterations on G P U for Higgs, epsilon, Bosch, Microsoft L T R, Expo and Yahoo L T R and bin size of 63 performs comparatively better. When using a GPU, it is advisable to use a bin size of 63 rather than 255, because it can speed up training significantly without noticeably affecting accuracy. On CPU, using a smaller bin size only marginally improves performance, sometimes even slows down training, diff --git a/docs/GPU-Windows.rst b/docs/GPU-Windows.rst index 58ea43b08ab4..90772ddaf2c3 100644 --- a/docs/GPU-Windows.rst +++ b/docs/GPU-Windows.rst @@ -46,21 +46,21 @@ To modify PATH, just follow the pictures after going to the ``Control Panel``: .. image:: ./_static/images/screenshot-system.png :align: center :target: ./_static/images/screenshot-system.png - :alt: A screenshot of the System option under System and Security of the Control Panel + :alt: A screenshot of the System option under System and Security of the Control Panel. Then, go to ``Advanced`` > ``Environment Variables...``: .. image:: ./_static/images/screenshot-advanced-system-settings.png :align: center :target: ./_static/images/screenshot-advanced-system-settings.png - :alt: A screenshot of the System Properties window + :alt: A screenshot of the System Properties window. Under ``System variables``, the variable ``Path``: .. image:: ./_static/images/screenshot-environment-variables.png :align: center :target: ./_static/images/screenshot-environment-variables.png - :alt: A screenshot of the Environment variables window with variable path selected under the system variables + :alt: A screenshot of the Environment variables window with variable path selected under the system variables. -------------- @@ -108,7 +108,7 @@ You may choose a version other than the most recent one if you need a previous M .. image:: ./_static/images/screenshot-mingw-installation.png :align: center :target: ./_static/images/screenshot-mingw-installation.png - :alt: A screenshot of the Min G W installation setup settings window + :alt: A screenshot of the Min G W installation setup settings window. Then, add to your PATH the following (to adjust to your MinGW version): @@ -127,7 +127,7 @@ You can check which MinGW version you are using by running the following in a co .. image:: ./_static/images/screenshot-r-mingw-used.png :align: center :target: ./_static/images/screenshot-r-mingw-used.png - :alt: A screenshot of the administrator command prompt where G C C version is being checked + :alt: A screenshot of the administrator command prompt where G C C version is being checked. To check whether you need 32-bit or 64-bit MinGW for R, install LightGBM as usual and check for the following: @@ -225,7 +225,7 @@ This is what you should (approximately) get at the end of Boost compilation: .. image:: ./_static/images/screenshot-boost-compiled.png :align: center :target: ./_static/images/screenshot-boost-compiled.png - :alt: A screenshot of the command prompt that ends with text that reads - updated 14621 targets + :alt: A screenshot of the command prompt that ends with text that reads - updated 14621 targets. If you are getting an error: @@ -249,7 +249,7 @@ Installing Git for Windows is straightforward, use the following `link`_. .. image:: ./_static/images/screenshot-git-for-windows.png :align: center :target: ./_static/images/screenshot-git-for-windows.png - :alt: A screenshot of the website to download git that shows various versions of git compatible with 32 bit and 64 bit Windows separately + :alt: A screenshot of the website to download git that shows various versions of git compatible with 32 bit and 64 bit Windows separately. Now, we can fetch LightGBM repository for GitHub. Run Git Bash and the following command: @@ -277,7 +277,7 @@ Installing CMake requires one download first and then a lot of configuration for .. image:: ./_static/images/screenshot-downloading-cmake.png :align: center :target: ./_static/images/screenshot-downloading-cmake.png - :alt: A screenshot of the binary distributions of C Make for downloading on 64 bit Windows + :alt: A screenshot of the binary distributions of C Make for downloading on 64 bit Windows. - Download `CMake`_ (3.8 or higher) @@ -296,12 +296,12 @@ Installing CMake requires one download first and then a lot of configuration for .. image:: ./_static/images/screenshot-create-directory.png :align: center :target: ./_static/images/screenshot-create-directory.png - :alt: A screenshot with a pop-up window that reads - Build directory does not exist, should I recreate it? + :alt: A screenshot with a pop-up window that reads - Build directory does not exist, should I create it? .. image:: ./_static/images/screenshot-mingw-makefiles-to-use.png :align: center :target: ./_static/images/screenshot-mingw-makefiles-to-use.png - :alt: A screenshot that asks to sepcify the generator for the project which should be selected as Min G W makefiles and selected as the use default native compilers option + :alt: A screenshot that asks to specify the generator for the project which should be selected as Min G W makefiles and selected as the use default native compilers option. - Lookup for ``USE_GPU`` and check the checkbox @@ -317,7 +317,7 @@ Installing CMake requires one download first and then a lot of configuration for .. image:: ./_static/images/screenshot-configured-lightgbm.png :align: center :target: ./_static/images/screenshot-configured-lightgbm.png - :alt: A screenshot of the C Make window after clicking on the configure button + :alt: A screenshot of the C Make window after clicking on the configure button. :: @@ -378,7 +378,7 @@ You can do everything in the Git Bash console you left open: .. image:: ./_static/images/screenshot-lightgbm-with-gpu-support-compiled.png :align: center :target: ./_static/images/screenshot-lightgbm-with-gpu-support-compiled.png - :alt: A screenshot of the git bash window with Light G B M successfully installed + :alt: A screenshot of the git bash window with Light G B M successfully installed. If everything was done correctly, you now compiled CLI LightGBM with GPU support! @@ -395,7 +395,7 @@ You can now test LightGBM directly in CLI in a **command prompt** (not Git Bash) .. image:: ./_static/images/screenshot-lightgbm-in-cli-with-gpu.png :align: center :target: ./_static/images/screenshot-lightgbm-in-cli-with-gpu.png - :alt: A screenshot of the command prompt where a binary classification model is being trained using Light G B M + :alt: A screenshot of the command prompt where a binary classification model is being trained using Light G B M. Congratulations for reaching this stage! @@ -411,7 +411,7 @@ Now that you compiled LightGBM, you try it... and you always see a segmentation .. image:: ./_static/images/screenshot-segmentation-fault.png :align: center :target: ./_static/images/screenshot-segmentation-fault.png - :alt: A screenshot of the command prompt where a segmentation fault has occurred while using Light G B M + :alt: A screenshot of the command prompt where a segmentation fault has occurred while using Light G B M. Please check if you are using the right device (``Using GPU device: ...``). You can find a list of your OpenCL devices using `GPUCapsViewer`_, and make sure you are using a discrete (AMD/NVIDIA) GPU if you have both integrated (Intel) and discrete GPUs installed. Also, try to set ``gpu_device_id = 0`` and ``gpu_platform_id = 0`` or ``gpu_device_id = -1`` and ``gpu_platform_id = -1`` to use the first platform and device or the default platform and device. @@ -426,7 +426,7 @@ You will have to redo the compilation steps for LightGBM to add debugging mode. .. image:: ./_static/images/screenshot-files-to-remove.png :align: center :target: ./_static/images/screenshot-files-to-remove.png - :alt: A screenshot of the Light G B M folder with 1 folder and 3 files selected to be removed + :alt: A screenshot of the Light G B M folder with 1 folder and 3 files selected to be removed. Once you removed the file, go into CMake, and follow the usual steps. Before clicking "Generate", click on "Add Entry": @@ -434,14 +434,14 @@ Before clicking "Generate", click on "Add Entry": .. image:: ./_static/images/screenshot-added-manual-entry-in-cmake.png :align: center :target: ./_static/images/screenshot-added-manual-entry-in-cmake.png - :alt: A screenshot of the Cache Entry popup where the name is set to C Make_Build_Type in all caps, the type is set to STRING in all caps and the value is set to Debug + :alt: A screenshot of the Cache Entry popup where the name is set to CMAKE_BUILD_TYPE in all caps, the type is set to STRING in all caps and the value is set to Debug. In addition, click on Configure and Generate: .. image:: ./_static/images/screenshot-configured-and-generated-cmake.png :align: center :target: ./_static/images/screenshot-configured-and-generated-cmake.png - :alt: A screenshot of the C Make window after clicking on configure and generate + :alt: A screenshot of the C Make window after clicking on configure and generate. And then, follow the regular LightGBM CLI installation from there. @@ -455,7 +455,7 @@ open a command prompt and run the following: .. image:: ./_static/images/screenshot-debug-run.png :align: center :target: ./_static/images/screenshot-debug-run.png - :alt: A screenshot of the command prompt after the command above is run + :alt: A screenshot of the command prompt after the command above is run. Type ``run`` and press the Enter key.