Use DMatrix Proxy for implementing data callback. #5629

trivialfis · 2020-05-05T03:07:49Z

@RAMitchell @hcho3 The second prototype.

Please ignore the code in quantile, it's copied and pasted from hist_util.cu.

Depends on #5623 . First prototype is in #5630 .

hcho3

I like the example in data_iterator.py. The custom data iterator will be quite useful for users who want to integrate a custom data source.

See my questions, especially about data erasure.

hcho3 · 2020-05-12T01:23:41Z

python-package/xgboost/core.py

+                next_callback,
+                ctypes.c_float(missing),
+                ctypes.c_int(nthread),
+                ctypes.c_int(256),


Are we okay with hardcoding 256 for max_bin here?

hcho3 · 2020-05-12T01:26:33Z

python-package/xgboost/dask.py

+    '''
+    def __init__(self, data, label=None, weight=None, base_margin=None,
+                 label_lower_bound=None, label_upper_bound=None):
+        '''Generate some random data for demostration.


Does this comment make sense here? This is not a demo.

hcho3 · 2020-05-12T01:39:37Z

src/data/iterative_device_dmatrix.cu

+#define DISPATCH_MEM(__Proxy, __Fn)                                     \
+  [](DMatrixProxy const* proxy) -> decltype(                            \
+      (dmlc::get<CupyAdapterBatch>(proxy->Value())).__Fn()) {           \
+    if (proxy->Value().type() == typeid(CupyAdapterBatch)) {            \


Is it correct to say that the proxy matrix performs type erasure for the batch, and now we're trying to recover the type of the batch dynamically?

From the way I see it, we have an implicit list of allowable types, namely CupyAdapterBatch and CudfAdapterBatch. It may be better to add a type ID field in the DMatrix proxy. Take a look at PackedFunc from TVM project, which uses type erasure to expose a generic function type. The PackedFunc uses type ID to distinguish between underlying types.

Also, is it possible to avoid macro here? Why use macro?

I will look into the pack function today.

Cool, I felt that PackedFunc may give us hint for simplifying the proxy.

hcho3 · 2020-05-12T01:47:39Z

src/data/ellpack_page.cuh

                           size_t row_stride);
+
+  template <typename AdapterBatch>
+  explicit EllpackPageImpl(AdapterBatch batch, float missing, int device, bool is_dense, int nthread,


What's your rationale for adding a second interface for EllpackPageImpl()? Is it to specify a custom cuts?

hcho3 · 2020-05-12T01:48:22Z

src/data/ellpack_page.cu

+    CopyDataRowMajor(batch, this, device, missing);
+  } else {
+    // CopyDataColumnMajor(adapter, batch, this, missing);
+    LOG(FATAL) << "Not implemented";


What makes the second EllpackPageImpl different from the first EllpackPageImpl, such that column major is not supported?

include/xgboost/c/callback.h

hcho3

Approving the general idea.

* Add new iterative DMatrix. * Add new proxy DMatrix. * Add dask interface.

trivialfis · 2020-07-07T18:03:07Z

One last piece would be the dask interface then the big series will be over.

trivialfis force-pushed the use-dmatrix-for-callback branch from d405ed8 to 6c33130 Compare May 6, 2020 12:07

hcho3 mentioned this pull request May 9, 2020

Callback functions for input data. #5630

Closed

hcho3 self-requested a review May 9, 2020 08:29

hcho3 reviewed May 12, 2020

View reviewed changes

hcho3 approved these changes May 12, 2020

View reviewed changes

trivialfis force-pushed the use-dmatrix-for-callback branch from 6c33130 to 155be2b Compare May 14, 2020 11:03

trivialfis added 2 commits May 21, 2020 17:36

Implement incremental building of device DMatrix.

094a3c7

* Add new iterative DMatrix. * Add new proxy DMatrix. * Add dask interface.

Initial commit for jvm.

c1d2c27

trivialfis force-pushed the use-dmatrix-for-callback branch from 87975fe to c1d2c27 Compare May 21, 2020 09:36

This was referenced Jun 2, 2020

Expose device sketching in header. #5747

Merged

Add helper for generating batches of data. #5756

Merged

Implement weighted sketching for adapter. #5760

Merged

hcho3 mentioned this pull request Jun 11, 2020

[Roadmap] 1.2.0 Roadmap #5734

Closed

14 tasks

trivialfis mentioned this pull request Jun 12, 2020

Accept iterator in device dmatrix. #5783

Merged

trivialfis closed this Jul 18, 2020

trivialfis deleted the use-dmatrix-for-callback branch July 18, 2020 01:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use DMatrix Proxy for implementing data callback. #5629

Use DMatrix Proxy for implementing data callback. #5629

trivialfis commented May 5, 2020 •

edited

Loading

hcho3 left a comment

hcho3 May 12, 2020

hcho3 May 12, 2020

hcho3 May 12, 2020

trivialfis May 12, 2020

hcho3 May 12, 2020

hcho3 May 12, 2020

hcho3 May 12, 2020

hcho3 left a comment

trivialfis commented Jul 7, 2020

Use DMatrix Proxy for implementing data callback. #5629

Use DMatrix Proxy for implementing data callback. #5629

Conversation

trivialfis commented May 5, 2020 • edited Loading

hcho3 left a comment

Choose a reason for hiding this comment

hcho3 May 12, 2020

Choose a reason for hiding this comment

hcho3 May 12, 2020

Choose a reason for hiding this comment

hcho3 May 12, 2020

Choose a reason for hiding this comment

trivialfis May 12, 2020

Choose a reason for hiding this comment

hcho3 May 12, 2020

Choose a reason for hiding this comment

hcho3 May 12, 2020

Choose a reason for hiding this comment

hcho3 May 12, 2020

Choose a reason for hiding this comment

hcho3 left a comment

Choose a reason for hiding this comment

trivialfis commented Jul 7, 2020

trivialfis commented May 5, 2020 •

edited

Loading