From 5bd1793b17780bb04f5ce11fae1ee0bba8bfb7ad Mon Sep 17 00:00:00 2001 From: James Bourbeau Date: Thu, 30 Jul 2020 11:03:54 -0500 Subject: [PATCH 1/4] Add imports to code snippet --- doc/tutorials/dask.rst | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/doc/tutorials/dask.rst b/doc/tutorials/dask.rst index 796b37459352..2e496a033997 100644 --- a/doc/tutorials/dask.rst +++ b/doc/tutorials/dask.rst @@ -33,8 +33,11 @@ illustrates the basic usage: .. code-block:: python - cluster = LocalCluster(n_workers=4, threads_per_worker=1) - client = Client(cluster) + import xgboost as xgb + import dask.distributed + + cluster = dask.distributed.LocalCluster(n_workers=4, threads_per_worker=1) + client = dask.distributed.Client(cluster) dtrain = xgb.dask.DaskDMatrix(client, X, y) # X and y are dask dataframes or arrays @@ -44,8 +47,8 @@ illustrates the basic usage: dtrain, num_boost_round=4, evals=[(dtrain, 'train')]) -Here we first create a cluster in single-node mode wtih ``distributed.LocalCluster``, then -connect a ``client`` to this cluster, setting up environment for later computation. +Here we first create a cluster in single-node mode wtih ``dask.distributed.LocalCluster``, then +connect a ``dask.distributed.Client`` to this cluster, setting up environment for later computation. Similar to non-distributed interface, we create a ``DMatrix`` object and pass it to ``train`` along with some other parameters. Except in dask interface, client is an extra argument for carrying out the computation, when set to ``None`` XGBoost will use the @@ -88,7 +91,7 @@ will override the configuration in Dask. For example: .. code-block:: python - with LocalCluster(n_workers=7, threads_per_worker=4) as cluster: + with dask.distributed.LocalCluster(n_workers=7, threads_per_worker=4) as cluster: There are 4 threads allocated for each dask worker. Then by default XGBoost will use 4 threads in each process for both training and prediction. But if ``nthread`` parameter is From 1c701e3856bd15ef9be589ba1a83d335f428ccec Mon Sep 17 00:00:00 2001 From: James Bourbeau Date: Thu, 30 Jul 2020 11:40:12 -0500 Subject: [PATCH 2/4] Minor updates to dask documentation --- doc/tutorials/dask.rst | 92 +++++++++++++++++++++--------------------- 1 file changed, 46 insertions(+), 46 deletions(-) diff --git a/doc/tutorials/dask.rst b/doc/tutorials/dask.rst index 2e496a033997..8ab5cc32b74e 100644 --- a/doc/tutorials/dask.rst +++ b/doc/tutorials/dask.rst @@ -3,12 +3,12 @@ Distributed XGBoost with Dask ############################# `Dask `_ is a parallel computing library built on Python. Dask allows -easy management of distributed workers and excels handling large distributed data science +easy management of distributed workers and excels at handling large distributed data science workflows. The implementation in XGBoost originates from `dask-xgboost `_ with some extended functionalities and a different interface. Right now it is still under construction and may change (with proper -warnings) in the future. The tutorial here focus on basic usage of dask with CPU tree -algorithm. For an overview of GPU based training and internal working, see `A New, +warnings) in the future. The tutorial here focuses on basic usage of dask with CPU tree +algorithms. For an overview of GPU based training and internal workings, see `A New, Official Dask API for XGBoost `_. @@ -16,20 +16,21 @@ Official Dask API for XGBoost Requirements ************ -Dask is trivial to install using either pip or conda. `See here for official install -documentation `_. For accelerating XGBoost -with GPU, `dask-cuda `_ is recommended for creating -GPU clusters. +Dask can be installed using either pip or conda (see the dask `installation +documentation `_ for more information). For +accelerating XGBoost with GPUs, `dask-cuda `_ is +recommended for creating GPU clusters. ******** Overview ******** -There are 3 different components in dask from a user's perspective, namely a scheduler, -bunch of workers and some clients connecting to the scheduler. For using XGBoost with -dask, one needs to call XGBoost dask interface from the client side. A small example -illustrates the basic usage: +A dask cluster consists of three different components: a centralized scheduler, one or +more workers, and one or more clients which act as the user-facing entry point for submitting +tasks to the cluster. When using XGBoost with dask, one needs to call the XGBoost dask interface +from the client side. Below is a small example which illustrates basic usage of running XGBoost +on a dask cluster: .. code-block:: python @@ -47,23 +48,24 @@ illustrates the basic usage: dtrain, num_boost_round=4, evals=[(dtrain, 'train')]) -Here we first create a cluster in single-node mode wtih ``dask.distributed.LocalCluster``, then -connect a ``dask.distributed.Client`` to this cluster, setting up environment for later computation. -Similar to non-distributed interface, we create a ``DMatrix`` object and pass it to -``train`` along with some other parameters. Except in dask interface, client is an extra -argument for carrying out the computation, when set to ``None`` XGBoost will use the -default client returned from dask. +Here we first create a cluster in single-node mode with ``dask.distributed.LocalCluster``, then +connect a ``dask.distributed.Client`` to this cluster, setting up an environment for later computation. + +We then create a ``DMatrix`` object and pass it to ``train``, along with some other parameters, +much like XGBoost's normal, non-dask interface. The primary difference with XGBoost's dask interface is +we pass our dask client as an additional argument for carrying out the computation. Note that if +client is set to ``None``, XGBoost will use the default client returned by dask. There are two sets of APIs implemented in XGBoost. The first set is functional API -illustrated in above example. Given the data and a set of parameters, `train` function -returns a model and the computation history as Python dictionary +illustrated in above example. Given the data and a set of parameters, the ``train`` function +returns a model and the computation history as a Python dictionary: .. code-block:: python {'booster': Booster, 'history': dict} -For prediction, pass the ``output`` returned by ``train`` into ``xgb.dask.predict`` +For prediction, pass the ``output`` returned by ``train`` into ``xgb.dask.predict``: .. code-block:: python @@ -77,9 +79,8 @@ Or equivalently, pass ``output['booster']``: Here ``prediction`` is a dask ``Array`` object containing predictions from model. -Another set of API is a Scikit-Learn wrapper, which mimics the stateful Scikit-Learn -interface with ``DaskXGBClassifier`` and ``DaskXGBRegressor``. See ``xgboost/demo/dask`` -for more examples. +Alternatively, XGBoost also implements the Scikit-Learn interface with ``DaskXGBClassifier`` +and ``DaskXGBRegressor``. See ``xgboost/demo/dask`` for more examples. ******* Threads @@ -112,39 +113,38 @@ XGBoost will use 8 threads in each training process. Why is the initialization of ``DaskDMatrix`` so slow and throws weird errors ***************************************************************************** -The dask API in XGBoost requires construction of ``DaskDMatrix``. With ``Scikit-Learn`` -interface, ``DaskDMatrix`` is implicitly constructed for each input data during `fit` or -`predict`. You might have observed its construction is taking incredible amount of time, -and sometimes throws error that doesn't seem to be relevant to `DaskDMatrix`. Here is a -brief explanation for why. By default most of dask's computation is `lazy +The dask API in XGBoost requires construction of ``DaskDMatrix``. With the Scikit-Learn +interface, ``DaskDMatrix`` is implicitly constructed for all input data during the ``fit`` or +``predict`` steps. You might have observed that ``DaskDMatrix`` construction can take large amounts of time, +and sometimes throws errors that don't seem to be relevant to ``DaskDMatrix``. Here is a +brief explanation for why. By default most dask computations are `lazily evaluated `_, which -means the computation is not carried out until you explicitly ask for result, either by -calling `compute()` or `wait()`. See above link for details in dask, and `this wiki -`_ for general concept of lazy evaluation. -The `DaskDMatrix` constructor forces all lazy computation to materialize, which means it's +means that computation is not carried out until you explicitly ask for a result by, for example, +calling ``compute()``. See the previous link for details in dask, and `this wiki +`_ for information on the general concept of lazy evaluation. +The ``DaskDMatrix`` constructor forces lazy computations to be evaluated, which means it's where all your earlier computation actually being carried out, including operations like -`dd.read_csv()`. To isolate the computation in `DaskDMatrix` from other lazy -computations, one can explicitly wait for results of input data before calling constructor -of `DaskDMatrix`. Also dask's `web interface -`_ can be used to monitor what operations -are currently being performed. +``dd.read_csv()``. To isolate the computation in ``DaskDMatrix`` from other lazy +computations, one can explicitly wait for results of input data before constructing a ``DaskDMatrix``. +Also dask's `diagnostics dashboard `_ can be used to +monitor what operations are currently being performed. *********** Limitations *********** -Basic functionalities including training and generating predictions for regression and -classification are implemented. But there are still some other limitations we haven't -addressed yet. +Basic functionality including model training and generating classification and regression predictions +have been implemented. However, there are still some other limitations we haven't +addressed yet: -- Label encoding for Scikit-Learn classifier may not be supported. Meaning that user need +- Label encoding for the ``DaskXGBClassifier`` classifier may not be supported. So users need to encode their training labels into discrete values first. -- Ranking is not supported right now. +- Ranking is not yet supported. - Empty worker is not well supported by classifier. If the training hangs for classifier with a warning about empty DMatrix, please consider balancing your data first. But regressor works fine with empty DMatrix. - Callback functions are not tested. -- Only ``GridSearchCV`` from ``scikit-learn`` is supported for dask interface. Meaning - that we can distribute data among workers but have to train one model at a time. If you - want to scale up grid searching with model parallelism by ``dask-ml``, please consider - using normal ``scikit-learn`` interface like `xgboost.XGBRegressor` for now. +- Only ``GridSearchCV`` from Scikit-Learn is supported. Meaning that we can distribute data + among workers but have to train one model at a time. If you want to scale up grid searching with + model parallelism with `Dask-ML `_, please consider using XGBoost's non-dask + Scikit-Learn interface, for example ``xgboost.XGBRegressor`, for now. From fa047e3d6cbb5c279a2d5a41794eee8c2f0585af Mon Sep 17 00:00:00 2001 From: James Bourbeau Date: Thu, 30 Jul 2020 11:52:30 -0500 Subject: [PATCH 3/4] More updates --- doc/tutorials/dask.rst | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/doc/tutorials/dask.rst b/doc/tutorials/dask.rst index f899c0da4909..cf461c5a54f9 100644 --- a/doc/tutorials/dask.rst +++ b/doc/tutorials/dask.rst @@ -121,21 +121,21 @@ Working with asyncio .. versionadded:: 1.2.0 -XGBoost dask interface supports the new ``asyncio`` in Python and can be integrated into +XGBoost's dask interface supports the new ``asyncio`` in Python and can be integrated into asynchronous workflows. For using dask with asynchronous operations, please refer to -`dask example `_ and document in -`distributed `_. As XGBoost -takes ``Client`` object as an argument for both training and prediction, so when -``asynchronous=True`` is specified when creating ``Client``, the dask interface can adapt -the change accordingly. All functions provided by the functional interface returns a -coroutine when called in async function, and hence require awaiting to get the result, -including ``DaskDMatrix``. +`this dask example `_ and document in +`distributed `_. To use XGBoost's +dask interface asynchronously, the ``client`` which is passed as an argument for training and +prediction must be operating in asynchronous mode by specifying ``asynchronous=True`` when the +``client`` is created (example below). All functions (including ``DaskDMatrix``) provided +by the functional interface will then return coroutines which can then be awaited to retrieve +their result. Functional interface: .. code-block:: python - async with Client(scheduler_address, asynchronous=True) as client: + async with dask.distributed.Client(scheduler_address, asynchronous=True) as client: X, y = generate_array() m = await xgb.dask.DaskDMatrix(client, X, y) output = await xgb.dask.train(client, {}, dtrain=m) @@ -148,13 +148,13 @@ Functional interface: print(await client.compute(with_m)) -While for Scikit Learn interface, trivial methods like ``set_params`` and accessing class +While for the Scikit-Learn interface, trivial methods like ``set_params`` and accessing class attributes like ``evals_result_`` do not require ``await``. Other methods involving actual computation will return a coroutine and hence require awaiting: .. code-block:: python - async with Client(scheduler_address, asynchronous=True) as client: + async with dask.distributed.Client(scheduler_address, asynchronous=True) as client: X, y = generate_array() regressor = await xgb.dask.DaskXGBRegressor(verbosity=1, n_estimators=2) regressor.set_params(tree_method='hist') # trivial method, synchronous operation From 9895564b3718caaae7bffc297e3b537fc5f9dd03 Mon Sep 17 00:00:00 2001 From: James Bourbeau Date: Thu, 30 Jul 2020 11:54:56 -0500 Subject: [PATCH 4/4] Add missing backtick --- doc/tutorials/dask.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/tutorials/dask.rst b/doc/tutorials/dask.rst index cf461c5a54f9..264bceec47c1 100644 --- a/doc/tutorials/dask.rst +++ b/doc/tutorials/dask.rst @@ -207,4 +207,4 @@ addressed yet: - Only ``GridSearchCV`` from Scikit-Learn is supported. Meaning that we can distribute data among workers but have to train one model at a time. If you want to scale up grid searching with model parallelism with `Dask-ML `_, please consider using XGBoost's non-dask - Scikit-Learn interface, for example ``xgboost.XGBRegressor`, for now. + Scikit-Learn interface, for example ``xgboost.XGBRegressor``, for now.