Skip to content

Latest commit

 

History

History
348 lines (283 loc) · 24.7 KB

20200624-pluggable-device-for-tensorflow.md

File metadata and controls

348 lines (283 loc) · 24.7 KB

Pluggable device for TensorFlow

Status Proposed
RFC # 262
Author(s) Zhoulong Jiang ([email protected]), Yiqiang Li ([email protected]), Eric Lin ([email protected]), Jianhui Li ([email protected])
Sponsor Anna Revinskaya ([email protected])
Updated 2020-08-13

Objective

Implement a pluggable device mechanism which allows to run existing TensorFlow programs on a new device without user changing the code. Users only need to install the plugin in a specified directory, and the mechanism is able to discover and plug in the capabilities offered by the plugin.

This RFC is based on the Modular TensorFlow RFC, which aims at extending the TensorFlow design to plug in capabilities like adding a new device support. The modular device interface is based on StreamExecutor C API RFC.

Motivation

When extending TensorFlow to support a new device, one needs to modify TensorFlow code and maintain a special TensorFlow build for the new device. Modular TensorFlow RFC design a plugin architecture for serveral TensorFlow components(Networking, Filesystems, Kernel, Graph and Accelerator backends). This RFC describes the Accelerator backends module in the TensorFlow proper side, by introducing pluggable device to the TensorFlow device classes.

The pluggable device discovery and initialization is transparent to end users. As long as the device plugin libraries follow the design described in this RFC, it can be plugged to TensorFlow proper and enable TensorFlow to run existing TensorFlow programs on a new device.

User Benefit

This RFC allows TensorFlow to transparently run TensorFlow programs on new devices, as long as users set up the system properly installing the device plugin.

Design Proposal

Design Overview

This RFC extends the TensorFlow device class hierarchy to add a standardized pluggable device named PluggableDevice which is built on top of StreamExecutor, and all new third-party devices who want to integrate with current TensorFlow stack only need to implement StreamExecutor C API(shown in Diagram 1).

  • PluggableDevice is defined in TensorFlow proper which inherits from LocalDevice. It is built on top of StreamExecutor C++ interface and manages PluggableDevice’s key abstractions like StreamExecutor, stream, memory and event.

  • PluggableDeviceExecutor is built on top of StreamExecutor C API (addressed in RFC), implements StreamExecutor .

  • PluggableDeviceExecutor Implementation is inside the TensorFlow plugin, which provides those C functions implementation defined in the StreamExecutor C API.

The pluggable device mechanism contains device discovery and creation process. Device discovery process loads the plugin and registers the device type as well as DeviceFactory. Device creation process creates a PluggableDevice object and finds or creates a PluggableDeviceExecutor object for each pluggable device.

With the RFC, existing TensorFlow GPU programs can run on a plugged device without user changing the code. The Diagram 2 describes the workflow of TensorFlow with device plugin, it shows how a simple GPU program runs on the pluggable device.

Supported user scenarios of PluggableDevice

This section describes the user scenarios that are supported/unsupported for PluggableDevice.

  • Supported scenario: Single PluggableDevice registered as "GPU" device type
    In the case of installing one plugin that registers its PluggableDevice as "GPU" device type, the default GPUDevice will be overrided by PluggableDevice when the plugin is loaded. When user specifies the "GPU" device for ops under with tf.device("gpu:0"), PluggableDevice registered will be selected to run those ops.
  • Supported scenario: Single PluggableDevice registered as a new device type.
    In the case of installing one plugin that registers its PluggableDevice as a new device type, e.g., "XPU" device type, user can speficify the "XPU" device for ops under with tf.device("xpu:0"), PluggableDevice registered will be selected to run those ops.
  • Supported scenario: Multiple PluggableDevices registered as different device types.
    In the case of installing multiple plugins that register PluggableDevices as different device types, e.g., one is registered as "GPU" device type and another is registered as "XPU" device type, these PluggableDevices can be registered successfully and user can specify the registered device type to run ops on different hardware. Same as scenario1, when one PluggableDevice is registered as "GPU" device type, the default GPUDevice will be overrided.
  • Non-Supported scenario: Multiple PluggableDevices registered as the same device type.
    In the case of installing multiple plugins that registers PluggableDevice as the same device type, e.g., more than one plugin registers its PluggableDevice as "GPU" device type, these plugins's initialization will fail due to registration conflict. Users need to manually select which platform they want to use(either unloads the conflicting plugin or reconfigures the plugin with python API).

Front-end device string

This section describes the device string for python users, pointing at previous user scenarios.

  • Device type
    Device type is user visible. User can specify the device type for the ops. e.g, "gpu", "xpu", "cpu".

    >> with tf.device("/gpu:0"):
       ...
    >> with tf.device("/xpu:0"):
       ...
    
  • Subdevice type
    Subdevice type is user visible, aims at differentiating gpu devices between different hardware vendors. Users can query the subdevice type if they want to know whether the GPU device is "NVIDIA_GPU", "INTEL_GPU" or "AMD_GPU", by invoking list_physical_devices.

    This RFC proposes that plugged gpu device will override the default gpu device. When users invoke list_physical_devices("GPU"), they can only see the plugged gpu device. If users want to use default GPU device(CUDA), they need to manually uninstall the plugin.

    For example, in the case of two GPUs in the same system, e.g. NVIDIA GPU + X GPU and installing the X GPU plugin, only X GPU is visible and usable. If users want to use Nvidia GPU, they need to uninstall the X GPU plugin.

    >> gpu_device = tf.config.experimental.list_physical_devices(`GPU`)
    >> print(gpu_device)
    [PhysicalDevice(name = `physical_device:GPU:0`), device_type = `GPU`, subdevice_type = `X_GPU`]
    >> with tf.device("/gpu:0"):
    >>   .. // place ops on PluggableDevice(X GPU)
    
  • Physical device name
    physical device name is user visible. User can query the physical device name(e.g. "Titan V") for the specified device instance through tf.config.experimental.get_device_details().

    >> gpu_device = tf.config.experimental.list_physical_devices(`GPU`)
    >> if gpu_device:
    >>    details = tf.config.experimental.get_device_details(gpu_device[0])
    >>    print(details.get(`device_name`)) 
    "TITAN_V, XXX"     
    

Device Discovery

Upon initialization of TensorFlow, it uses platform independent LoadLibrary() to load the dynamic library. The plugin library should be installed to the default plugin directory "…python_dir.../site-packages/tensorflow-plugins". The modular tensorflow RFC describes the process of loading plugins.

During the plugin library initialization, TensorFlow proper calls the SE_InitializePlugin API (part of StreamExecutor C API) to retrieve nescessary informations from the plugin to instantiate a StreamExecutor platform(se::platform class) and registers the platform to a global object se::MultiPlatformManager, TensorFlow proper gets a device type and a subdevice type from plugin through SE_InitializePlugin and then registers the PluggableDeviceFactorywith the registered device type. The device type string will be used to access PluggableDevice with tf.device() in python layer. The subdevice type is used for low-level specialization of GPU device(kernel, StreamExecutor, common runtime, grapper, placer..). If the users care whether they are running on NVIDIA GPU or X GPU(Intel GPU, Amd GPU..), they can call python API (such as tf.config.list_physical_devices) to get the subdevice type for identification. user can also use tf.config.get_device_details to get the real device name(e.g. "TITAN V")for the specified device.
Plugin authors need to implement SE_InitializePlugin and provide the necessary informations:

void SE_InitializePlugin(SE_PlatformRegistrationParams* params, TF_Status* status) {
  int32_t visible_device_count = get_plugin_device_count();
  
  std::string name = "X_GPU"; //StreamExecutor platform name && subdevice type
  std::string type = "GPU"; // device type

  params.params.id = plugin_id_value;
  params.params.visible_device_count = visible_device_count;
  params.params.create_device = create_device;
  params.params.destroy_device = destroy_device;
  params.params.create_stream_executor = create_stream_executor;
  params.params.destroy_stream_executor = destroy_stream_executor;
  params.params.name = name.c_str();
  params.params.name_len = name.size();
  params.params.type = type.c_str();
  params.params.type_len = type.size();
}

ListPhysicalDevice encodes the subdevice type string to the device type string.

Status PluggableDeviceFactory::ListPhysicalDevices(std::vector<string>* devices) {
  se::Platform* platform = se::MultiPlatformManager::PlatformWithName(sub_device_type_);
  for(int i = 0; i < platform->VisibleDeviceCount(); i++) {
    const string device_name = strcat("/physical_device:", device_type_, "/", sub_device_type_, ":", i);
    devices->push_back(device_name);
  }
  return Status::OK();
}

GetDeviceDetails retrieves the physical device name of the hardware from plugin.

Status PluggableDeviceFactory::GetDeviceDetails(int device_index, std::unordered_map<string, string>* details) {
 se::Platform* platfom = se::MultiPlatformManager::PlatformWithName(sub_device_type_);
 auto desc = platform->DescriptionForDevice(device_index).ConsumeValueOrDie();
 (*details)["device_name"] = desc->name(); // Titan V: XXX
 ...
}

Device Creation

PluggableDeviceFactory is introduced to create the PluggableDevice, following the LocalDevice design pattern. To support existing GPU programs running on a new device without user changing the code, plugin authors can register the "GPU" device type through SE_InitializePlugin and then TensorFlow proper will register the PluggableDeviceFactory for "GPU" type with higher priority than the default GPU device.
Plugin:

void SE_InitializePlugin(SE_PlatformRegistrationParams* params, TF_Status* status) {
    ...
    std::string type = "GPU"; 
    params.params.type = type.c_str();
    ...
  }

Proper:

  std::string platform_name_str(params.params.name, params.params.name_len);
  std::string type_str(params.params.type, params.params.type_len);
  DeviceFactory::Register(type_str, new PluggableDeviceFactory(platform_name_str), priority); 

For those vendors who don't want to use "GPU" type, it's optional to register a new device type.
Limitation: when multiple plugins are installed, their registered device types should be different, or it will get conflict and the device registration will fail. TensorFlow proper can provide a python API to let user select a plugin by specifing a higher priority.
For example:

  tf.load_plugin_with_highest_priority(path_to_plugin_lib)
  with tf.device("/GPU:0"):
    ...

When a session is created, PluggableDeviceFactory creates a PluggableDevice object for the plugged device. During the initialization of the PluggableDevice, a global object se::MultiPlatformManager will find the se::platform through its platform name / subdevice type which is registered from plugin, then stream executor platform (se::platform) further creates or finds a StreamExecutor object containing a PluggableDeviceExecutor, as well as multiple stream objects(a computation stream and several memory copy streams) supporting the StreamExecutor objects.

The section below shows some pseudo code to introduce some extensions inside the TensorFlow proper for the pluggable device creation. The implementation is based on StreamExecutor C API RFC.

  1. PluggableDeviceFactory creates and initializes a set of PluggableDevice instances when the session is created.
   PluggableDeviceFactory::CreateDevices(SessionOptions& options, const string& name_prefix, std::vector<std::unique_ptr<Device>>* devices) {
     for (int i = 0; i < options.device_count(); i++) {
      PluggableDevice pluggable_device = CreatePluggableDevice(options,i); //set allocator
      pluggable_device->Init(options, pluggable_device_platform_name_);
      devices.push_back(std::move(pluggable_device));
     }
   }
  1. PluggableDevice object binds a StreamExecutor and creates a set of Streams during the initialization. Streams include one compute stream and several memory copy streams.
   void PluggableDevice::Init(SessionOption& options, const string& platform_name) {  
     se::Platform* platform= se::MultiPlatformManager::PlatformWithName(platform_name);
     stream_executor_ = platform->ExecutorForDevice(pluggable_dev_id_);
     compute_stream_ = new se::Stream(stream_executor_);
     compute_stream_->Init();
     host_to_device_stream_ = new se::Stream(stream_executor_);
     host_to_device_stream_->Init();
     ...
   }  // create StreamExecutor
  1. PluggableDevicePlatform is responsible for the StreamExecutor creation. It creates an SP_StreamExecutor and SP_Device object through create_stream_executor and create_device which are registered in the SP_Platform. Then PluggableDeviceExecutor is constructed with SP_StreamExecutor and SP_Device object.
   StreamExecutor* PluggableDevicePlaform::ExeutorForDevice(int device_id) {
    auto config = get_plugin_config(device_id);
    SE_Options* se_option = get_se_option(device_id);
    TF_Status* status  = TF_NewStatus();
    SP_StreamExecutor* se = new SP_StreamExecutor{ SP_STREAMEXECUTOR_STRUCT_SIZE };
    platform_->create_stream_executor(se, status); // create SP_StreamExecutor
    SP_Device* se_device = new SP_Device{ SP_DEVICE_STRUCT_SIZE };
    platform_->create_device(se_device, se_options, status);//create SP_Device
    auto executor = absl::make_unique<StreamExecutor>(this, absl::make_unique<PluggableDeviceExecutor>(config, se, se_device));
    return std::move(executor);
   }

TensorFlow Proper

TensorFlow proper needs to be extended to support a new class PluggableDevice to represent a set of new third-party devices and a new stream executor platform (PluggableDevicePlatform) to create the device and related resources with the information registered from plugin.

Two sets of classes need to be defined in TensorFlow proper.

  • Set 1: PluggableDevice related classes
    • class PluggableDevice: a class represents a set of new third-party devices, its device_type attribute describes what kind of device this is. it can be "GPU" or other device type string. it also has an attribute: subdevice_type, subdevice_type is for low-level specialization of GPU device. It will be part of kernel dispatch key to avoid conflict issue with exiting GPU(CUDA) kernels. The subdevice_type is also used to check whether there is some CUDA specific logic code in grappler and common runtime when the device type is "GPU".
    • class PluggableDeviceFactory: a device factory to create the PluggableDevice
    • class PluggableDeviceBFCAllocator: a PluggableDevice memory allocator that implements a ‘best fit with coalescing’ algorithm. It extends the BFC algorithm, counter part of GPUBFCAllocator.
    • class PluggableDeviceMemAllocator: Suballocator for GPU memory.
    • class PluggableDeviceProcessState: manages per-process state when PluggableDevices are present
    • class PluggableDeviceHostAllocator: allocator for pinned CPU RAM that is made known to PluggableDevice for the purpose of efficient DMA with PluggableDevice.
    • class PluggableDeviceEventMgr: an object to keep track of pending Events in the StreamExecutor streams.
    • class PluggableDeviceContext: a wrapper of pluggable device specific context that can be passed to OpKernels.
  • Set 2: PluggableDevicePlatform related classes
    • class PluggableDevicePlatform: PluggableDevice-specific platform, its platform name is registered from plugin through SE_InitializePlugin, the platform object contains a C struct: SP_Platform* platform_, which is its internal implementation and as the C interface registered by device plugin.
    • class PluggableDeviceExecutor: The PluggableDevice implementation of the StreamExecutorInterface functionality, it contains two C structs: SP_StreamExecutor* executor_ and SP_Device* device_ , which as the StreamExecutor C interface registered by device plugin.
    • class PluggableDeviceStream: wraps a StreamHandle in order to satisfy the platform-independent StreamInterface. It returns SP_Stream which is treated as an opaque type to TensorFlow, whose structure is defined by the device plugin.
    • class PluggableDeviceTimer: wraps an opaque handle: SP_Timer to satisfy the platform-independent TimerInterface.
    • class PluggableDeviceEvent: wraps an opaque handle: SP_Event to satisfy the platform-independent EventInterface.

TensorFlow Plugin

Plugin authors need to provide those C functions implementation defined in StreamExecutor C API .

  • SP_StreamExecutor is defined as struct in the C API, both sides(TensorFlow proper and plugins) can access its members. TensorFlow proper creates a SP_StreamExecutor object and pass it to the plugin, then plugin fills it with its C API implementations.
   void create_stream_executor(SP_StreamExecutor* se, TF_Status* status) {
     se->memcpy_from_host = my_device_memory_from_host_function;
     se->allocate = my_allocate_function;
     …
   }//Init device
  • SP_Device is defined as struct in the C API, both sides(TensorFlow proper and plugins) can access its members. TensorFlow proper creates a SP_Device with device ordinal and plugin fills the corresponding device opaque handle and device name to the SP_Device.
   create_device(SP_Device* device, SE_Options* options, TF_Status* status) {
    device->device_handle = get_my_device_handle(SE_Options);
    ...
    return se;
   }
  • SP_Stream is defined in plugin and treated as an opaque struct in TensorFlow proper.
  void create_stream(SP_Device* device, SP_Stream* stream, TF_Status*) {
    (stream)->stream_handle = create_my_stream_handle(device);
    ..
  }

PluggableDevice kernel registration

This section shows an example of kernel registration for PluggableDevice. Kernel registration and implementation API is addressed in a separate RFC. To avoid kernel registration conflict with existing GPU(CUDA) kernels, plugin author needs to provide a device type(such as "GPU") as well as a subdevice type(such as "X_GPU") to TensorFlow proper for kernel registration and dispatch. The device type indicates the device the kernel runs on, the subdevice type is for low-level specialization of the device.

void SE_InitializePlugin(SE_PlatformRegistrationParams* params, TF_Status* status) {
  ...
  std::string type = "GPU" // front-end visible device type
  params.params.type = type.c_str();
  std::string name = "X_GPU"; // low-level specialization device type
  params.params.type = name.c_str();
  ...
}

void InitKernelPlugin() {
  TF_KernelBuilder* builder = TF_NewKernelBuilder(/*op_name*/"Convolution", "GPU", //"GPU" is device type
      "X_GPU", &Conv_Create, &Conv_Compute, &Conv_Delete); // "X_GPU" is sub device type
  TF_Status* status = TF_NewStatus();
  TF_RegisterKernelBuilder(/*kernel_name*/"Convolution", builder, status);
  if (TF_GetCode(status) != TF_OK) { /* handle errors */ }
  TF_DeleteStatus(status);
}

Using stream inside PluggableDevice kernel

The following code shows a convolution kernel implementation using the stream handle. The streams are created during the pluggable device creation. The placer decides which device to use for each OP in the graph. Then the streams associated with the device are used to construct the OpKernelContext for the op computations during the graph execution.

void Conv_Compute(TF_OpKernelContext*) {
  (context, input_index, &input, &status);
  TF_GetInput(context, filter_index, &filter, &status);
  auto output = TF_AllocateOutput(context, output_index, TF_Float32, dims, num_dims, len, status);
  SP_Stream* se_stream = TF_GetStream(TF_OpKernelContext);
  auto native_stream = static_cast<native_stream_type>(se_stream->stream_handle);
  my_conv_impl(input, filter, output, native_stream);
}

Kernel and op registration and implementation API RFC needs to be extended to retrieve streams/device context from the TF_OpKernelContext, besides inputs and outputs.

Alternatives Considered

  • Without this RFC, end users need to change the python code to import the third-party device plugin.

  • Without this RFC, the third-party device vendor may implement the LocalDevice interface, which is not a C API interface and may interact with potential C++ ABI incompatibility issues.

Performance Implications

  • We don’t expect performance impact due to this RFC. The functions described by this RFC are realized at the initialization stage.

Dependencies

  • This RFC doesn’t add new dependencies to external libraries.

  • It depends on three modular TensorFlow related RFC

    • Modular TensorFlow RFC

    • StreamExecutor C interface RFC

    • Kernel and op registration and implementation API RFC

Engineering Impact

  • The impact to binary size / startup time / build time / test times are minimum.

  • The TensorFlow team will maintain this code.

Platforms and Environments

  • The pluggable device mechanism is based on LoadLibrary() so should work on all the platforms supported by LoadLibrary. The other enhancement to tensorflow proper is platform independent.

Best Practices

  • This works with Modular TensorFlow which will be the only way to integrate new third-party devices to the current TensorFlow stack.

Compatibility

The RFC promotes the current TensorFlow ecosystem as it supports plugging new devices to TensorFlow.

We don't expect this proposal to impact with other parts of the TensorFlow ecosystem. It doesn't support TFLite. It should not impede distribution strategies and would not interact with tf.fuction and SaveModel.