-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized BuildHist function #5156
Conversation
57e6a93
to
fc565ac
Compare
Current performance using previous commit #5138 also:
|
@hcho3, @trivialfis, I finalized the PR from my side. Could you, please, look at this? |
b8b7c67
to
48da1df
Compare
One distributed test is stuck: https://xgboost-ci.net/blue/organizations/jenkins/xgboost/detail/PR-5156/8/pipeline/112. I had to kill it by hand. I'm looking at the code now to see what went wrong; probably a worker is not calling AllReduce(). |
e21888b
to
8b7acd6
Compare
@hcho3, I have fixed the issue. CC @trivialfis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Also thanks for adding tests.
Reminder to myself: Write a follow-up PR so that we can control how many threads |
46735da
to
f88b064
Compare
@hcho3, I committed your comments and also added nthreads parameter for common::ParallelFor2d. CI is green. |
Yup. Will review tonight. Sorry for the long wait. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge thanks for the effort! Overall looks good to me. I will run some benchmarks tomorrow for memory usage, distributed environment etc. Will merge if no regression is found.
I'm a little bit concern that, it's possible for a user to change |
@trivialfis I think Lines 208 to 210 in e526871
|
That's assuring. |
@SmirnovEgorRu Could you please take a look into this dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/url/url_svmlight.tar.gz ? It's extremely sparse, my benchmark shows regression on both training time and memory usage: Before:
After:
|
@SmirnovEgorRu BTW, the memory usage is measured by https://github.com/gsauthof/cgmemtime . |
f88b064
to
5f7603c
Compare
@trivialfis @hcho3
So, now it's better for both execution time and memory consumption. |
5f7603c
to
02b7232
Compare
02b7232
to
952c2aa
Compare
@hcho3, @trivialfis, CI is also green. Do you see any road blockers to merge the pull-request? |
@hcho3 @trivialfis, I have already created new PR #5244 which should finalize efforts on reverting the optimizations. Could you, please, accept or provide new comments for current PR to enable review of the next one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Also thanks for splitting AddHistRows from BuildLocalHistograms.
{ | ||
size_t tid = omp_get_thread_num(); | ||
size_t chunck_size = num_blocks_in_space / nthreads + !!(num_blocks_in_space % nthreads); | ||
|
||
size_t begin = chunck_size * tid; | ||
size_t end = std::min(begin + chunck_size, num_blocks_in_space); | ||
for (auto i = begin; i < end; i++) { | ||
func(space.GetFirstDimension(i), space.GetRange(i)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we manually splitting the loop range here? Is it because Visual Studio doesn't support size_t
for the loop variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because I should know which tasks are executed on which thread to allocate minimum possible amount of histograms (now it helps to achieve even less memory consumption on URL data set).
As I know - it's not exactly defined in OMP standard for "#pragma omp parallel for schedule(static)" and can't be different for various OMP implementations. So, I implemented this explicitly with "#pragma omp parallel" to have the same behavior on each platform.
hist_allocated_additionally++; | ||
} | ||
// map pair {tid, nid} to index of allocated histogram from hist_memory_ | ||
tid_nid_to_hist_[{tid, nid}] = hist_total++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to hear your reasoning: why did you choose std::map
for tid_nid_to_hist_
but std::vector
for threads_to_nids_map_
? Is it due to memory efficiency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, that both of them can be implemented by std::vector
and by std::map
.
No significant difference between them.
One reason what I see to use std::map for tid_nid_to_hist_
instead of std::vector
:
In this line:
const size_t idx = tid_nid_to_hist_.at({tid, nid});
if I have std::vector
here - I need to add check, smth like this
CHECK_NE(idx, std::numeric_limits<size_t>::max());
And fill all elements in tid_nid_to_hist_
by std::numeric_limits<size_t>::max()
initially.
In case of std::map
- it would throw an exception after calling .at()
method without any additional lines of code.
Will look into this once we branch out 1.0. Thanks for the patience. |
@trivialfis, let me understand. Do we plan to include the change to 1.0 version as it was discussed originally #5008 (comment) with @hcho3 ? |
I don't plan to add major changes at the last minute. It's fine as we used to have a release every 2 or 3 months, and I would like to resume the pace once this 1.0 thing is over. So your changes should be available in short time even not in the next release. Besides there's nightly build available for download. Will merge once we have a release branch. |
@trivialfis I agree that, in general, we should not merge a major change right before a release. However, since you and I have approved the original version of this PR, can we merge this? The new version is only a little different from the original version, and the difference is confined in a small part of the codebase. #5244 will have to wait, on the other hand. |
Em .. Currently I'm at holiday, there's only a laptop available so I can't run any meaningful test. If you are confident then I will let you decide. |
@hcho3 @trivialfis, if you need some specific testing like running specified benchmarks or workloads to have more confidence - I'm happy to help here. I am grateful for the chance to have this in the nearest XGB release |
@SmirnovEgorRu Basically memory usage, computation time, accuracy (auc, rmse metrics) for these representative datasets (like higgs for dense wide columns, url for sparse), with restricted number of threads, max_depth, num_boosted_rounds etc, and maybe set cpu affinity env for OMP manually (not necessary but sometimes fun to see the difference). I usually do this myself as I can have a consistent environment for each run. For example the number posted in: and Seems to be running on different environments or with different parameters. |
It would be great if you can run Higgs and Airline dataset. |
@trivialfis I am confident about this PR. I’m inclined to merge, as long as @SmirnovEgorRu runs some more benchmarks as requested. |
@hcho3 Got it. My concerns are mostly around the consistency of posted number across different PRs. As noted above, it would be nice to have benchmarks performed on same platform with fixed parameters. The variance can be difficult to control. |
@hcho3 @trivialfis , I prepared measurements on Higgs and Airline. Higgs:
Log-loss in all cases this PR/master:
(For this table, nthreads is fixed to 48) Airline + one-hot-encoding:
Log-loss in all cases this PR/master:
(For this table, nthreads is fixed to 48) HW: AWS c5.metal, CLX 8275 @3.0GHz, 24 cores per socket, 2 sockets, HT: on, 96 threads totally P.S. @trivialfis , yes, you're right I used different HW parameters for URL measurements - just because due to HW unavailability in the last time. I will try to measure this on 8275 too. |
@SmirnovEgorRu For the niter table, what is the number of threads you used? And are all the numbers end-to-end time? |
@hcho3, for niter table I used 48 threads (to utilize only HW cores and dont use HT). |
And niter=1000 for the nthread table? |
@hcho3 , yes, I just used default parameters in the benchmarks. |
Thanks for the clarification. |
For whole URL data set, for the same c5.metal AWS instance I obtained:
I use following line to fit URL: output = xgb.train(params={'max_depth': 6, 'verbosity':3, 'tree_method':'hist'},
dtrain=dtrain, num_boost_round=10) Accuracy parameters are the same also. I hope this data is what you requested. Is it right? |
@SmirnovEgorRu Yes, thanks for running the benchmarks. |
Optimizations for Histogram building. A part of the issue #5104
The PR contains changes from #5138, it will be rebased after #5138 merging.