Optimized BuildHist function #5156

SmirnovEgorRu · 2019-12-24T23:13:21Z

Optimizations for Histogram building. A part of the issue #5104
The PR contains changes from #5138, it will be rebased after #5138 merging.

SmirnovEgorRu · 2019-12-31T00:05:04Z

Current performance using previous commit #5138 also:

higgs1m	ApplySplit	EvaluateSplit	BuildHist	SyncHistogram	Prediction	Total, sec
Master	33	29	90	26	3	185
Before reverting	3.7	3.5	6.2	0.0	1.6	17.7
This PR	27.7	1.7	9.5	2.1	1.6	47.3

airline-ohe	ApplySplit	EvaluateSplit	BuildHist	SyncHistogram	Prediction	Total, sec
Master	26	27	67	12	2	157
Before reverting	9.0	6.1	28.8	0.0	0.7	63.7
This PR	21.4	2.9	42.2	1.0	0.7	93.5

SmirnovEgorRu · 2019-12-31T00:09:51Z

@hcho3, @trivialfis, I finalized the PR from my side. Could you, please, look at this?
It contains changes from #5138, I will do rebase after merging previous one into master.

hcho3 · 2020-01-08T04:44:13Z

One distributed test is stuck: https://xgboost-ci.net/blue/organizations/jenkins/xgboost/detail/PR-5156/8/pipeline/112. I had to kill it by hand. I'm looking at the code now to see what went wrong; probably a worker is not calling AllReduce().

SmirnovEgorRu · 2020-01-09T23:45:54Z

@hcho3, I have fixed the issue.
I also tested performance of distributed mode, for small data sets - it is similar as before, for large - gain is observed. It's reasonable, because # of "AllReduce" calls are the same.

CC @trivialfis

hcho3

LGTM. Also thanks for adding tests.

src/tree/updater_quantile_hist.h

src/tree/updater_quantile_hist.cc

src/common/hist_util.h

hcho3 · 2020-01-16T04:45:01Z

Reminder to myself: Write a follow-up PR so that we can control how many threads common::ParallelFor2d will launch.

SmirnovEgorRu · 2020-01-16T11:23:01Z

@hcho3, I committed your comments and also added nthreads parameter for common::ParallelFor2d. CI is green.
@trivialfis, could you, please, review the PR?

trivialfis · 2020-01-16T13:00:53Z

Yup. Will review tonight. Sorry for the long wait.

trivialfis

Huge thanks for the effort! Overall looks good to me. I will run some benchmarks tomorrow for memory usage, distributed environment etc. Will merge if no regression is found.

trivialfis · 2020-01-16T17:36:05Z

@hcho3

Reminder to myself: Write a follow-up PR so that we can control how many threads common::ParallelFor2d will launch.

I'm a little bit concern that, it's possible for a user to change nthread during training.

hcho3 · 2020-01-17T02:53:42Z

@trivialfis I think nthread is a configurable parameter. Once it's set, I don't think it would change in the middle of training. See

xgboost/src/learner.cc

Lines 208 to 210 in e526871

    
           if (generic_parameters_.nthread != 0) { 
        
             omp_set_num_threads(generic_parameters_.nthread); 
        
           }

trivialfis · 2020-01-17T04:39:56Z

That's assuring.

trivialfis · 2020-01-17T10:00:10Z

@SmirnovEgorRu Could you please take a look into this dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/url/url_svmlight.tar.gz ? It's extremely sparse, my benchmark shows regression on both training time and memory usage:

Before:

System  : 3.7.3 (default, Apr  3 2019, 05:39:12)
[GCC 8.3.0]
Xgboost : 1.0.0-SNAPSHOT
LightGBM: None
CatBoost: None
#jobs   : 6
Running 'xgb-cpu' ...
{
  "url": {
    "xgb-cpu": {
      "accuracy": {
        "AUC": 0.9920843803290437,
        "Accuracy": 0.9776493762859277,
        "Log_Loss": 0.4229395458885813,
        "Precision": 0.9781071399946979,
        "Recall": 0.9887536809959334
      },
      "train_time": 172.10726167000007
    }
  }
}
Results written to file 'url.json'
Child terminated but following descendants are still running: 4483

Could not remove sub-cgroup /sys/fs/cgroup/memory/cgmemtime/4462: Device or resource busy
Child user:  400.110 s
Child sys :   70.966 s
Child wall:  229.677 s
Child high-water RSS                    :   34821560 KiB
Recursive and acc. high-water RSS+CACHE :   38068864 KiB

After:

System  : 3.7.3 (default, Apr  3 2019, 05:39:12)
[GCC 8.3.0]
Xgboost : 1.0.0-SNAPSHOT
LightGBM: None
CatBoost: None
#jobs   : 6
Running 'xgb-cpu' ...
{
  "url": {
    "xgb-cpu": {
      "accuracy": {
        "AUC": 0.9920843803290437,
        "Accuracy": 0.9776493762859277,
        "Log_Loss": 0.4229395458885813,
        "Precision": 0.9781071399946979,
        "Recall": 0.9887536809959334
      },
      "train_time": 203.096213423
    }
  }
}
Results written to file 'url.json'
Child terminated but following descendants are still running: 3351

Could not remove sub-cgroup /sys/fs/cgroup/memory/cgmemtime/3330: Device or resource busy
Child user:  408.533 s
Child sys :   82.425 s
Child wall:  263.838 s
Child high-water RSS                    :   46879656 KiB
Recursive and acc. high-water RSS+CACHE :   46983308 KiB

trivialfis · 2020-01-21T12:18:04Z

@SmirnovEgorRu BTW, the memory usage is measured by https://github.com/gsauthof/cgmemtime .

SmirnovEgorRu · 2020-01-26T19:35:06Z

@trivialfis @hcho3
Thank you for your tests, I added additional changes to reduce amount of partial histograms. As result for URL data set I obtained following numbers:

Branch	Time of Update, sec	Memory usage, KB
master	74.51	24638840
this PR	23.53	24109408

So, now it's better for both execution time and memory consumption.

SmirnovEgorRu · 2020-01-26T22:36:35Z

@hcho3, @trivialfis, CI is also green. Do you see any road blockers to merge the pull-request?

SmirnovEgorRu · 2020-01-29T01:31:14Z

@hcho3 @trivialfis, I have already created new PR #5244 which should finalize efforts on reverting the optimizations. Could you, please, accept or provide new comments for current PR to enable review of the next one?

hcho3

LGTM. Also thanks for splitting AddHistRows from BuildLocalHistograms.

hcho3 · 2020-01-29T01:42:56Z

src/common/threading_utils.h

+  {
+    size_t tid = omp_get_thread_num();
+    size_t chunck_size = num_blocks_in_space / nthreads + !!(num_blocks_in_space % nthreads);
+
+    size_t begin = chunck_size * tid;
+    size_t end   = std::min(begin + chunck_size, num_blocks_in_space);
+    for (auto i = begin; i < end; i++) {
+      func(space.GetFirstDimension(i), space.GetRange(i));
+    }


Why are we manually splitting the loop range here? Is it because Visual Studio doesn't support size_t for the loop variable?

Just because I should know which tasks are executed on which thread to allocate minimum possible amount of histograms (now it helps to achieve even less memory consumption on URL data set).
As I know - it's not exactly defined in OMP standard for "#pragma omp parallel for schedule(static)" and can't be different for various OMP implementations. So, I implemented this explicitly with "#pragma omp parallel" to have the same behavior on each platform.

hcho3 · 2020-01-29T03:07:11Z

src/common/hist_util.h

+            hist_allocated_additionally++;
+          }
+          // map pair {tid, nid} to index of allocated histogram from hist_memory_
+          tid_nid_to_hist_[{tid, nid}] = hist_total++;


I'd like to hear your reasoning: why did you choose std::map for tid_nid_to_hist_ but std::vector for threads_to_nids_map_? Is it due to memory efficiency?

Agree, that both of them can be implemented by std::vector and by std::map.
No significant difference between them.

One reason what I see to use std::map for tid_nid_to_hist_ instead of std::vector:
In this line:

const size_t idx = tid_nid_to_hist_.at({tid, nid});

if I have std::vector here - I need to add check, smth like this

CHECK_NE(idx, std::numeric_limits<size_t>::max());

And fill all elements in tid_nid_to_hist_ by std::numeric_limits<size_t>::max() initially.
In case of std::map - it would throw an exception after calling .at() method without any additional lines of code.

trivialfis · 2020-01-29T03:53:44Z

Will look into this once we branch out 1.0. Thanks for the patience.

SmirnovEgorRu · 2020-01-29T17:48:31Z

@trivialfis, let me understand. Do we plan to include the change to 1.0 version as it was discussed originally #5008 (comment) with @hcho3 ?

trivialfis · 2020-01-29T18:09:36Z

I don't plan to add major changes at the last minute. It's fine as we used to have a release every 2 or 3 months, and I would like to resume the pace once this 1.0 thing is over. So your changes should be available in short time even not in the next release. Besides there's nightly build available for download.

Will merge once we have a release branch.

hcho3 · 2020-01-29T18:48:28Z

@trivialfis I agree that, in general, we should not merge a major change right before a release. However, since you and I have approved the original version of this PR, can we merge this? The new version is only a little different from the original version, and the difference is confined in a small part of the codebase.

#5244 will have to wait, on the other hand.

trivialfis · 2020-01-29T19:04:43Z

Em .. Currently I'm at holiday, there's only a laptop available so I can't run any meaningful test. If you are confident then I will let you decide.

SmirnovEgorRu · 2020-01-29T19:18:33Z

@hcho3 @trivialfis, if you need some specific testing like running specified benchmarks or workloads to have more confidence - I'm happy to help here.

I am grateful for the chance to have this in the nearest XGB release

trivialfis · 2020-01-29T19:53:33Z

@SmirnovEgorRu Basically memory usage, computation time, accuracy (auc, rmse metrics) for these representative datasets (like higgs for dense wide columns, url for sparse), with restricted number of threads, max_depth, num_boosted_rounds etc, and maybe set cpu affinity env for OMP manually (not necessary but sometimes fun to see the difference). I usually do this myself as I can have a consistent environment for each run. For example the number posted in:

#5156 (comment)

and

#5244 (comment)

Seems to be running on different environments or with different parameters.

hcho3 · 2020-01-29T19:54:18Z

It would be great if you can run Higgs and Airline dataset.

hcho3 · 2020-01-29T20:07:33Z

@trivialfis I am confident about this PR. I’m inclined to merge, as long as @SmirnovEgorRu runs some more benchmarks as requested.

trivialfis · 2020-01-29T20:13:23Z

@hcho3 Got it. My concerns are mostly around the consistency of posted number across different PRs. As noted above, it would be nice to have benchmarks performed on same platform with fixed parameters. The variance can be difficult to control.

SmirnovEgorRu · 2020-01-30T00:14:56Z

@hcho3 @trivialfis , I prepared measurements on Higgs and Airline.

Higgs:
1000 iterations, depth = 8.

nthreads	1	8	24	48	96
This PR, sec	280.9	87.4	48.4	44.8	45.3
master, sec	281.6	145.9	135.6	167.6	376.5

Log-loss in all cases this PR/master:
LogLoss for train data set = 0.404697
LogLoss for test data set = 0.525144

niter	50	200	500	1000
This PR, sec	3.5	10.7	22.7	44.8
master, sec	13.7	41.4	91.8	167.6

(For this table, nthreads is fixed to 48)

Airline + one-hot-encoding:
1000 iterations, depth = 8.

nthreads	1	8	24	48	96
This PR, sec	953.7	211.4	105	94.65	87.09
master, sec	949.3	265.5	159.6	142.8	272.5

Log-loss in all cases this PR/master:
LogLoss for train data set = 0.383229
LogLoss for test data set = 0.461403

niter	50	200	500	1000
This PR, sec	28.82	40.03	60.41	94.65
master, sec	35.65	57.2	89.37	142.8

(For this table, nthreads is fixed to 48)

HW: AWS c5.metal, CLX 8275 @3.0GHz, 24 cores per socket, 2 sockets, HT: on, 96 threads totally
Scripts are used from: https://github.com/dmlc/xgboost-bench

P.S. @trivialfis , yes, you're right I used different HW parameters for URL measurements - just because due to HW unavailability in the last time. I will try to measure this on 8275 too.

hcho3 · 2020-01-30T00:21:29Z

@SmirnovEgorRu For the niter table, what is the number of threads you used? And are all the numbers end-to-end time?

SmirnovEgorRu · 2020-01-30T00:24:13Z

@hcho3, for niter table I used 48 threads (to utilize only HW cores and dont use HT).
Yes, these numbers are measurements of whole xgb.train(...) call.

hcho3 · 2020-01-30T00:28:05Z

And niter=1000 for the nthread table?

SmirnovEgorRu · 2020-01-30T00:30:03Z

@hcho3 , yes, I just used default parameters in the benchmarks.

hcho3 · 2020-01-30T00:34:08Z

Thanks for the clarification.

SmirnovEgorRu · 2020-01-30T01:11:46Z

@hcho3 @trivialfis,

For whole URL data set, for the same c5.metal AWS instance I obtained:

nthreads	8	24	48	96
This PR, sec	60.7	41.4	43.8	51.2
master, sec	179.6	123.1	159.5	179.7

nthreads	8	24	48	96
This PR, GB	18.835	20.156	22.232	26.282
master, GB	19.304	20.673	22.753	26.815

I use following line to fit URL:

output = xgb.train(params={'max_depth': 6, 'verbosity':3, 'tree_method':'hist'},
dtrain=dtrain, num_boost_round=10)

Accuracy parameters are the same also.

I hope this data is what you requested. Is it right?

hcho3 · 2020-01-30T01:16:04Z

@SmirnovEgorRu Yes, thanks for running the benchmarks.

SmirnovEgorRu mentioned this pull request Dec 24, 2019

CPU optimizations - 'hist' method #5104

Closed

SmirnovEgorRu force-pushed the opt_hist_2 branch 3 times, most recently from 57e6a93 to fc565ac Compare December 31, 2019 00:02

SmirnovEgorRu changed the title ~~[WIP] Optimize BuildHist function~~ Optimized BuildHist function Dec 31, 2019

SmirnovEgorRu force-pushed the opt_hist_2 branch 2 times, most recently from b8b7c67 to 48da1df Compare January 8, 2020 00:59

hcho3 self-requested a review January 8, 2020 04:02

SmirnovEgorRu force-pushed the opt_hist_2 branch 2 times, most recently from e21888b to 8b7acd6 Compare January 9, 2020 22:41

hcho3 approved these changes Jan 16, 2020

View reviewed changes

SmirnovEgorRu force-pushed the opt_hist_2 branch from 46735da to f88b064 Compare January 16, 2020 09:18

hcho3 approved these changes Jan 16, 2020

View reviewed changes

trivialfis approved these changes Jan 16, 2020

View reviewed changes

SmirnovEgorRu force-pushed the opt_hist_2 branch from f88b064 to 5f7603c Compare January 26, 2020 19:31

SmirnovEgorRu force-pushed the opt_hist_2 branch from 5f7603c to 02b7232 Compare January 26, 2020 19:50

Optimized BuildHist function

952c2aa

SmirnovEgorRu force-pushed the opt_hist_2 branch from 02b7232 to 952c2aa Compare January 26, 2020 20:58

SmirnovEgorRu mentioned this pull request Jan 29, 2020

Optimized ApplySplit and UpdatePredictCache functions on CPU #5244

Merged

hcho3 approved these changes Jan 29, 2020

View reviewed changes

hcho3 approved these changes Jan 30, 2020

View reviewed changes

hcho3 added the Blocking label Jan 30, 2020

trivialfis approved these changes Jan 30, 2020

View reviewed changes

hcho3 merged commit c671632 into dmlc:master Jan 30, 2020

lock bot locked as resolved and limited conversation to collaborators May 5, 2020

Optimized BuildHist function #5156

Optimized BuildHist function #5156

Conversation

SmirnovEgorRu commented Dec 24, 2019 • edited Loading

SmirnovEgorRu commented Dec 31, 2019

SmirnovEgorRu commented Dec 31, 2019

hcho3 commented Jan 8, 2020

SmirnovEgorRu commented Jan 9, 2020

hcho3 left a comment

Choose a reason for hiding this comment

hcho3 commented Jan 16, 2020

SmirnovEgorRu commented Jan 16, 2020

trivialfis commented Jan 16, 2020

trivialfis left a comment

Choose a reason for hiding this comment

trivialfis commented Jan 16, 2020

hcho3 commented Jan 17, 2020 • edited Loading

trivialfis commented Jan 17, 2020

trivialfis commented Jan 17, 2020 • edited Loading

trivialfis commented Jan 21, 2020

SmirnovEgorRu commented Jan 26, 2020

SmirnovEgorRu commented Jan 26, 2020

SmirnovEgorRu commented Jan 29, 2020

hcho3 left a comment • edited Loading

Choose a reason for hiding this comment

hcho3 Jan 29, 2020

Choose a reason for hiding this comment

SmirnovEgorRu Jan 29, 2020

Choose a reason for hiding this comment

hcho3 Jan 29, 2020

Choose a reason for hiding this comment

SmirnovEgorRu Jan 29, 2020

Choose a reason for hiding this comment

trivialfis commented Jan 29, 2020

SmirnovEgorRu commented Jan 29, 2020

trivialfis commented Jan 29, 2020 • edited Loading

hcho3 commented Jan 29, 2020 • edited Loading

trivialfis commented Jan 29, 2020 • edited Loading

SmirnovEgorRu commented Jan 29, 2020

trivialfis commented Jan 29, 2020 • edited Loading

hcho3 commented Jan 29, 2020

hcho3 commented Jan 29, 2020

trivialfis commented Jan 29, 2020 • edited Loading

SmirnovEgorRu commented Jan 30, 2020 • edited by hcho3 Loading

hcho3 commented Jan 30, 2020

SmirnovEgorRu commented Jan 30, 2020

hcho3 commented Jan 30, 2020

SmirnovEgorRu commented Jan 30, 2020

hcho3 commented Jan 30, 2020

SmirnovEgorRu commented Jan 30, 2020

hcho3 commented Jan 30, 2020

SmirnovEgorRu commented Dec 24, 2019 •

edited

Loading

hcho3 commented Jan 17, 2020 •

edited

Loading

trivialfis commented Jan 17, 2020 •

edited

Loading

hcho3 left a comment •

edited

Loading

trivialfis commented Jan 29, 2020 •

edited

Loading

hcho3 commented Jan 29, 2020 •

edited

Loading

trivialfis commented Jan 29, 2020 •

edited

Loading

trivialfis commented Jan 29, 2020 •

edited

Loading

trivialfis commented Jan 29, 2020 •

edited

Loading

SmirnovEgorRu commented Jan 30, 2020 •

edited by hcho3

Loading