Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazily create summary writer for TF2 logger. #5631

Merged
merged 3 commits into from
Sep 4, 2019

Conversation

llan-ml
Copy link
Contributor

@llan-ml llan-ml commented Sep 4, 2019

Why are these changes needed?

For tensorflow 2.0 with gpu support, the creation of summary writer will initialize visible gpu devices and other gpu-related settings, and then we cannot modify them anymore (see this link).

When sub-classing Trainable, we may need to declare gpu-related settings in self._setup. However, the settings will raise errors since Trainable invokes self._setup after logger creation.

Related issue number

Checks

@pcmoritz
Copy link
Contributor

pcmoritz commented Sep 4, 2019

Thanks, I rebased it! Can you have a look?

@@ -181,10 +180,12 @@ def on_result(self, result):
self._file_writer.flush()

def flush(self):
self._file_writer.flush()
if hasattr(self, "_file_writer"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not use hasattr and instead set a class variable _file_writer = None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks.

@richardliaw
Copy link
Contributor

richardliaw commented Sep 4, 2019

What happens if you just do this in _init?

with device("/cpu:0): 
    self._file_writer = tf.summary.create_file_writer(self.logdir)

@llan-ml
Copy link
Contributor Author

llan-ml commented Sep 4, 2019

I'm not sure what you mean. The logger is created in Trainable.__init__. Why we create a writer in _setup?

@richardliaw
Copy link
Contributor

Sorry, I meant _init (edited above).

@llan-ml
Copy link
Contributor Author

llan-ml commented Sep 4, 2019

Actually, the operations in the function create_file_writer are already within the context tf.device("/cpu:0") (see this link).

The problem is that any tf-related operations will lead to initialize gpu devices as follows:

In [1]: import tensorflow as tf

In [2]: with tf.device("/cpu:0"):
   ...:     writer = tf.summary.create_file_writer("XXX")
   ...:
2019-09-04 15:57:20.280618: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-09-04 15:57:20.374666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:3b:00.0
2019-09-04 15:57:20.375411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:af:00.0
2019-09-04 15:57:20.375693: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-09-04 15:57:20.377132: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-09-04 15:57:20.378458: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-09-04 15:57:20.378817: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-09-04 15:57:20.380607: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-09-04 15:57:20.381984: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-09-04 15:57:20.386334: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-09-04 15:57:20.389086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1
2019-09-04 15:57:20.389989: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-09-04 15:57:20.431708: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz
2019-09-04 15:57:20.438480: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5580a65aba00 executing computations on platform Host. Devices:
2019-09-04 15:57:20.438537: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2019-09-04 15:57:21.367996: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5580a660e080 executing computations on platform CUDA. Devices:
2019-09-04 15:57:21.368060: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla V100-PCIE-16GB, Compute Capability 7.0
2019-09-04 15:57:21.368082: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (1): Tesla V100-PCIE-16GB, Compute Capability 7.0
2019-09-04 15:57:21.374333: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:3b:00.0
2019-09-04 15:57:21.379041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:af:00.0
2019-09-04 15:57:21.379117: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-09-04 15:57:21.379148: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-09-04 15:57:21.379176: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-09-04 15:57:21.379204: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-09-04 15:57:21.379231: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-09-04 15:57:21.379257: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-09-04 15:57:21.379285: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-09-04 15:57:21.384333: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1
2019-09-04 15:57:21.384407: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-09-04 15:57:21.388546: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-04 15:57:21.388575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 1
2019-09-04 15:57:21.388589: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N Y
2019-09-04 15:57:21.388602: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 1:   Y N
2019-09-04 15:57:21.392536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14926 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
2019-09-04 15:57:21.394271: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14926 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-16GB, pci bus id: 0000:af:00.0, compute capability: 7.0)

After initialized, any gpu-related configurations are not allowed. For example,

In [3]: gpus = tf.config.experimental.list_physical_devices('GPU')

In [4]: gpus
Out[4]:
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'),
 PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]

In [5]: for gpu in gpus:
   ...:     tf.config.experimental.set_memory_growth(gpu, True)

will raise RuntimeError: Physical devices cannot be modified after being initialized.

@richardliaw
Copy link
Contributor

richardliaw commented Sep 4, 2019

that's terrible... OK; do you know if this happens with tensorboardx too?

(will merge when tests pass)

@richardliaw
Copy link
Contributor

as always, @llan-ml thank you very much for the contribution!

@llan-ml
Copy link
Contributor Author

llan-ml commented Sep 4, 2019

I'm not familiar with tensorboardx. But the issue is caused by the design of tf-2.0 summary writer, and also tf-2.0 does not expose some low-level apis like tf.Summary.Value to write summaries. So I guess tensorboardx will not suffer from this.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16775/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16773/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16776/
Test FAILed.

@pcmoritz pcmoritz merged commit 3ea9062 into ray-project:master Sep 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants