Lazily create summary writer for TF2 logger. #5631

llan-ml · 2019-09-04T06:16:12Z

Why are these changes needed?

For tensorflow 2.0 with gpu support, the creation of summary writer will initialize visible gpu devices and other gpu-related settings, and then we cannot modify them anymore (see this link).

When sub-classing Trainable, we may need to declare gpu-related settings in self._setup. However, the settings will raise errors since Trainable invokes self._setup after logger creation.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://ray.readthedocs.io/en/latest/.

pcmoritz · 2019-09-04T06:49:10Z

Thanks, I rebased it! Can you have a look?

richardliaw · 2019-09-04T06:58:12Z

python/ray/tune/logger.py

@@ -181,10 +180,12 @@ def on_result(self, result):
        self._file_writer.flush()

    def flush(self):
-        self._file_writer.flush()
+        if hasattr(self, "_file_writer"):


can we not use hasattr and instead set a class variable _file_writer = None?

richardliaw · 2019-09-04T07:18:18Z

What happens if you just do this in _init?

with device("/cpu:0): 
    self._file_writer = tf.summary.create_file_writer(self.logdir)

llan-ml · 2019-09-04T07:35:07Z

I'm not sure what you mean. The logger is created in Trainable.__init__. Why we create a writer in _setup?

richardliaw · 2019-09-04T07:43:51Z

Sorry, I meant _init (edited above).

llan-ml · 2019-09-04T08:05:13Z

Actually, the operations in the function create_file_writer are already within the context tf.device("/cpu:0") (see this link).

The problem is that any tf-related operations will lead to initialize gpu devices as follows:

In [1]: import tensorflow as tf

In [2]: with tf.device("/cpu:0"):
   ...:     writer = tf.summary.create_file_writer("XXX")
   ...:
2019-09-04 15:57:20.280618: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-09-04 15:57:20.374666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:3b:00.0
2019-09-04 15:57:20.375411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:af:00.0
2019-09-04 15:57:20.375693: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-09-04 15:57:20.377132: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-09-04 15:57:20.378458: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-09-04 15:57:20.378817: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-09-04 15:57:20.380607: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-09-04 15:57:20.381984: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-09-04 15:57:20.386334: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-09-04 15:57:20.389086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1
2019-09-04 15:57:20.389989: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-09-04 15:57:20.431708: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz
2019-09-04 15:57:20.438480: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5580a65aba00 executing computations on platform Host. Devices:
2019-09-04 15:57:20.438537: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2019-09-04 15:57:21.367996: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5580a660e080 executing computations on platform CUDA. Devices:
2019-09-04 15:57:21.368060: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla V100-PCIE-16GB, Compute Capability 7.0
2019-09-04 15:57:21.368082: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (1): Tesla V100-PCIE-16GB, Compute Capability 7.0
2019-09-04 15:57:21.374333: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:3b:00.0
2019-09-04 15:57:21.379041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:af:00.0
2019-09-04 15:57:21.379117: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-09-04 15:57:21.379148: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-09-04 15:57:21.379176: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-09-04 15:57:21.379204: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-09-04 15:57:21.379231: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-09-04 15:57:21.379257: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-09-04 15:57:21.379285: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-09-04 15:57:21.384333: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1
2019-09-04 15:57:21.384407: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-09-04 15:57:21.388546: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-04 15:57:21.388575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 1
2019-09-04 15:57:21.388589: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N Y
2019-09-04 15:57:21.388602: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 1:   Y N
2019-09-04 15:57:21.392536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14926 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
2019-09-04 15:57:21.394271: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14926 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-16GB, pci bus id: 0000:af:00.0, compute capability: 7.0)

After initialized, any gpu-related configurations are not allowed. For example,

In [3]: gpus = tf.config.experimental.list_physical_devices('GPU')

In [4]: gpus
Out[4]:
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'),
 PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]

In [5]: for gpu in gpus:
   ...:     tf.config.experimental.set_memory_growth(gpu, True)

will raise RuntimeError: Physical devices cannot be modified after being initialized.

richardliaw · 2019-09-04T08:11:52Z

that's terrible... OK; do you know if this happens with tensorboardx too?

(will merge when tests pass)

richardliaw · 2019-09-04T08:12:27Z

as always, @llan-ml thank you very much for the contribution!

llan-ml · 2019-09-04T08:26:20Z

I'm not familiar with tensorboardx. But the issue is caused by the design of tf-2.0 summary writer, and also tf-2.0 does not expose some low-level apis like tf.Summary.Value to write summaries. So I guess tensorboardx will not suffer from this.

AmplabJenkins · 2019-09-04T11:15:37Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16775/
Test FAILed.

AmplabJenkins · 2019-09-04T11:22:30Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16773/
Test PASSed.

AmplabJenkins · 2019-09-04T13:20:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16776/
Test FAILed.

llan-ml and others added 2 commits September 4, 2019 13:10

Lazily create summary writer for TF2 logger.

ce14907

Merge branch 'master' into update-tf2-logger

92b67fa

pcmoritz approved these changes Sep 4, 2019

View reviewed changes

richardliaw requested changes Sep 4, 2019

View reviewed changes

fix

97841a3

richardliaw approved these changes Sep 4, 2019

View reviewed changes

pcmoritz merged commit 3ea9062 into ray-project:master Sep 4, 2019

pcmoritz pushed a commit that referenced this pull request Sep 4, 2019

Lazily create summary writer for TF2 logger. (#5631)

1823ea7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazily create summary writer for TF2 logger. #5631

Lazily create summary writer for TF2 logger. #5631

llan-ml commented Sep 4, 2019

pcmoritz commented Sep 4, 2019

richardliaw Sep 4, 2019

llan-ml Sep 4, 2019

richardliaw Sep 4, 2019

richardliaw commented Sep 4, 2019 •

edited

Loading

llan-ml commented Sep 4, 2019

richardliaw commented Sep 4, 2019

llan-ml commented Sep 4, 2019

richardliaw commented Sep 4, 2019 •

edited

Loading

richardliaw commented Sep 4, 2019

llan-ml commented Sep 4, 2019

AmplabJenkins commented Sep 4, 2019

AmplabJenkins commented Sep 4, 2019

AmplabJenkins commented Sep 4, 2019

Lazily create summary writer for TF2 logger. #5631

Lazily create summary writer for TF2 logger. #5631

Conversation

llan-ml commented Sep 4, 2019

Why are these changes needed?

Related issue number

Checks

pcmoritz commented Sep 4, 2019

richardliaw Sep 4, 2019

Choose a reason for hiding this comment

llan-ml Sep 4, 2019

Choose a reason for hiding this comment

richardliaw Sep 4, 2019

Choose a reason for hiding this comment

richardliaw commented Sep 4, 2019 • edited Loading

llan-ml commented Sep 4, 2019

richardliaw commented Sep 4, 2019

llan-ml commented Sep 4, 2019

richardliaw commented Sep 4, 2019 • edited Loading

richardliaw commented Sep 4, 2019

llan-ml commented Sep 4, 2019

AmplabJenkins commented Sep 4, 2019

AmplabJenkins commented Sep 4, 2019

AmplabJenkins commented Sep 4, 2019

richardliaw commented Sep 4, 2019 •

edited

Loading

richardliaw commented Sep 4, 2019 •

edited

Loading