Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[ENV] Enable MXNET_EXEC_NUM_TEMP to control space replicas, update gu… #204

Merged
merged 1 commit into from
Oct 4, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions doc/env_var.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ Usually you do not need to change these settings, but they are listed here for r
* MXNET_EXEC_MATCH_RANGE (default=10)
- The rough matching scale in symbolic execution memory allocator.
- Set this to 0 if we do not want to enable memory sharing between graph nodes(for debug purpose).
* MXNET_EXEC_NUM_TEMP (default=4)
- Maximum number of temp workspace we can allocate to each device.
- Set this to small number can save GPU memory.
- It will also likely to decrease level of parallelism, which is usually OK.
* MXNET_ENGINE_TYPE (default=ThreadedEnginePerDevice)
- The type of underlying execution engine of MXNet.
- List of choices
Expand All @@ -27,3 +31,15 @@ Usually you do not need to change these settings, but they are listed here for r
* MXNET_KVSTORE_BIGARRAY_BOUND (default=1e6)
- The minimum size of "big array".
- When the array size is bigger than this threshold, MXNET_KVSTORE_REDUCTION_NTHREADS threads will be used for reduction.

Settings for Minimum Memory Usage
---------------------------------
- Make sure ```min(MXNET_EXEC_NUM_TEMP, MXNET_GPU_WORKER_NTHREADS) = 1```
- The default setting satisfies this.

Settings for More GPU Parallelism
---------------------------------
- Set ```MXNET_GPU_WORKER_NTHREADS``` to larger number (e.g. 2)
- You may want to set ```MXNET_EXEC_NUM_TEMP``` to reduce memory usage.
- This may not speedup things as GPU can already be fully occupied with serialized jobs.

9 changes: 9 additions & 0 deletions src/common/utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
#include <utility>
#include <random>
#include <thread>
#include <algorithm>
#endif // DMLC_USE_CXX11

#include <dmlc/logging.h>
Expand All @@ -27,6 +28,14 @@ inline int GetNumThreadPerGPU() {
return dmlc::GetEnv("MXNET_GPU_WORKER_NTHREADS", 1);
}

// heuristic to get number of matching colors.
// this decides how much parallelism we can get in each GPU.
inline int GetExecNumMatchColor() {
// This is resource efficient option.
int num_match_color = dmlc::GetEnv("MXNET_EXEC_NUM_TEMP", 4);
return std::min(num_match_color, GetNumThreadPerGPU());
}

/*!
* \brief Random Engine
*/
Expand Down
7 changes: 2 additions & 5 deletions src/symbol/graph_executor.cc
Original file line number Diff line number Diff line change
Expand Up @@ -456,8 +456,6 @@ void GraphExecutor::InitDataEntryMemory() {
}

void GraphExecutor::InitResources() {
// maximum amount of color allowed in coloring algorithm
const uint32_t kMaxNumColor = 8;
// prepare for temp space allocation
std::vector<uint32_t> req_temp_cnt(topo_order_.size(), 0);
for (size_t i = 0; i < topo_order_.size(); ++i) {
Expand All @@ -471,9 +469,8 @@ void GraphExecutor::InitResources() {
CHECK_LE(cnt, 1) << "Node can only have one temp space request";
req_temp_cnt[nid] = cnt;
}
// restrict allocation to maximum number of parallelism per device
uint32_t num_color = std::min(static_cast<uint32_t>(common::GetNumThreadPerGPU()),
kMaxNumColor);

uint32_t num_color = static_cast<uint32_t>(common::GetExecNumMatchColor());
std::vector<uint32_t> req_temp_color;
// use graph coloring to find node that won't run in parallel
num_color = graph::ColorNodeGroup(graph_, topo_order_, req_temp_cnt,
Expand Down
4 changes: 1 addition & 3 deletions src/symbol/graph_memory_allocator.h
Original file line number Diff line number Diff line change
Expand Up @@ -119,9 +119,7 @@ GraphStorageAllocator::GraphStorageAllocator(
// if we set this to 1, this means no color based match.
// color based match will cost a bit more memory usually
// but also enables more parallelization.
num_match_color_ = dmlc::GetEnv("MXNET_EXEC_MATCH_NUM_COLOR", 4);
num_match_color_ = std::min(static_cast<uint32_t>(common::GetNumThreadPerGPU()),
num_match_color_);
num_match_color_ = static_cast<uint32_t>(common::GetExecNumMatchColor());
this->InitColor(topo_order);
}

Expand Down