Model Weight Refit #2900

cehongwang · 2024-06-07T22:18:44Z

cehongwang
Jun 7, 2024
Collaborator

Model Weight Refit

TL;DR

TensorRT supports updating engine weights after compilation via the nvinfer1::IRefitter class, referenced here in C++, and here in Python. This could be a beneficial feature to bring into Torch-TensorRT, specifically the FX path, since models which are pre-compiled and saved can be easily refitted to new training weights, so long as the model architecture is unchanged. This can save hours of compilation time and enables different extensions such as LoRA and on-cloud pre-compiled TensorRT engine.

User scenario

Alex is a digital artist who frequently uses Stable Diffusion to generate AI-powered artwork. To achieve various artistic styles, Alex utilizes different LoRA (Low-Rank Adaptation) configurations. Each LoRA configuration provides a unique style, enabling Alex to create diverse visual effects and aesthetics in their artwork. To maximize the creating efficiency, he uses Torch-TensorRT to accelerate the stable diffusion to make the creation process faster.

However, every time when he uses a new LoRA, the compilation time of the TensorRT module is around 10 minutes. This significantly slows down Alex's workflow. Instead of focusing on creating new art, Alex spends a considerable amount of time waiting for the model to recompile.

With the proposed engine refit feature, Alex can apply the LoRA within one minute. This significantly cuts the wait time and he can switch different combinations of LoRA whenever he wants.

Problem

Building a TensorRT Engine is time-consuming due to complex procedures like kernel auto-tuning. For instance, compiling a large language model can take several hours. When model weights are frequently updated, such as during A/B testing of different versions or adding adapters for various purposes, the need to repeatedly recompile the engine becomes highly inefficient. This can potentially cause

Increased Latency in Deployment: The time taken to recompile and redeploy models can lead to increased latency in pushing updates to production, affecting the responsiveness of applications and services.
Slower Iteration Speed: Frequent model updates, such as those needed for A/B testing or adapting to new data, become cumbersome and slow, reducing the ability to quickly iterate and optimize models.
Higher Computational Costs: Recompilation is resource-intensive, leading to increased computational costs and energy consumption, which can be particularly burdensome for large-scale models and frequent updates.
Higher Storage Costs: If the user doesn't want to recompile the engine every time, he needs to store the compiled engine on the disk. This is impractical if there are too many variants of the model (in case of different combinations of LoRA with stable diffusion).

Motivations and usecases

Model weight refit will reduce time spent compiling models with Torch-TensorRT by 80% since models would only need to be compiled once per architecture, and subsequent weight updates can be propagated into the compiled model post-compilation, without the overhead of recompiling. This also enables

Usage cloud pre-compiled TensorRT engine and weight updates
Easily switch to different adapters such as LoRA for Stable Diffusion or LLMs
Easy update of deployed models such as in A/B testing

Proposed APIs

Users of this feature would first have a compiled TensorRT Graph Module ready, for example with:

exported_program = torch.export.export(model, inputs)
trt_gm = trt.dynamo.compile(exported_program, ...)

then save their model. Then, at a later time when loading the compiled model, if the weights have updated from their original values, the user could call the Model Weight Refit function to refit the stored weights in TensorRT Graph Modules.

Example Workflow

User can first load the previously compiled TensorRT Graph Module using trt_gm.load_state_dict(state_dict)
The additional step, as per the proposed API would be to call:

new_model = model()
new_exported_program = torch.export.export(new_model, inputs)
new_trt_gm = refit_module_weights(
    compiled_module=trt_gm,
    new_weight_module=new_exported_program,
    inputs=inputs,
)

This function would parse the new exported program, determine the mapping between weights in the exported program and in TensorRT engines, and return a copy graph module with updated weights using the TRT Python API for TRT-accelerated modules, and the Torch API for non-accelerated modules.

Implementation Design

High Level Explanation

After the export of the PyTorch model, the newly compiled graph module will first go through the ATen Tracing and Lowering. After that, the graph will be partitioned into several subgraphs if there are any graph breaks resulting from unsupported operations. Then each of these subgraphs is converted to INetworkDefinition. These INetworkDefinition are eventually used to refit the weight in each TensorRT engine in the compiled Graph Module.

Model interpretation

The model interpretation is mostly the same process as when the model is first compiled into a TRT engine. To make sure that the model with the new weights has the same compilation setting as the compiled model, the old settings are stored and re-used. The references to the settings are stored in graph modules (PythonTensorRTModule/TensorRTModule).

Mapping Construction

After the INetworkDefinition is constructed, different layers of INetworkDefinition are examined to extract the weights to be refitted. Specifically, weights like

Weights and bias of convolution layers and deconvolution layers
Scale and shift of BatchNorm layers
Scale and shift of LayerNorm layers
Weights of Constant layers

are extracted and map the weights to the keys of TensorRT engine weights.

def construct_refit_mapping(
    module: torch.fx.GraphModule,
    inputs: Sequence[Input],
    settings: CompilationSettings = CompilationSettings(),
) -> dict[str, np.ndarray]:
    """Run the interpreter and find out the weight mapping between weight in exported program and TensorRT engine
    Args:
        module: FX GraphModule to interpret
        inputs: Sequence of Tensors representing inputs to the module
        settings: Compilation settings
    Returns:
        Mapping from weight name in TensorRT to actual weight value in np.ndarray
    """

Extensions Required to Core API Implementations

The existing library should not require many changes, as this add-on would simply add functionality while preserving existing core APIs.

One small change is that we want to store the settings used to compile a module and add the reference to the module. In this way, during the second module parsing, we can re-use the settings.

Target Platforms

This feature targets the Stable Diffusion refitting on Windows and LLMs refitting on Linux.

Implementation Phases

Prototype - Small/Medium

Determine standard paradigm for providing new weights to existing model
- Ensure that weight names do not change between different instances of the same model
- Check if the exported program for a newly-trained model contains sufficient information to update the weights of an existing model
- Find the mapping between weights in exported program and
Implement helper functions described in Design section above
Develop methods to apply weight updates to a Torch/FX-only model (no TRT)

MVP `1.4.0` - Medium

Using the prototype, implement full support for the refit_module_weights function, including refitting weights with multiple TRT-accelerated submodules and multiple Torch/FX non-accelerated submodules

Extension Phase 1 [Potential] - Medium

LoRA for Stable Diffusion 3
- Stable Diffusion 3 has just been released. The goal is to support fast LoRA application for compiled Stable Diffusion 3 models.
- The first step is to refit SD3 models successfully after applying LoRA and write a tutorial
- The second step is to integrate that functionality with Huggingface or mainstream UIs.
Refitting acceleration
- I proposed an optimization strategy to shortcut the refitting process by caching the mapping between the weight name in state_dict and in the TRT Engine. If that is successful, no re-interpretation is needed.
- If the refitting becomes very fast, we can potentially enable TensorRT optimization during training, and refit the engine for weight updates
LLM Parameter Efficient Fine Tuning
- In LLM Parameter Efficient Fine Tuning, the majority part of the model is frozen. It would be nice if we could build the engine and only refit the adapter on every backward pass. This would significantly accelerate the training process.

narendasan · 2024-06-26T22:42:56Z

narendasan
Jun 26, 2024
Collaborator

Initiative:

After a model is compiled to an fx graph module, we may still want to make modifications such as updating the weights or even make modifications to the original structure. We want to design a handy way for users to interact with and easily modify compiled graph modules.

Design Concept

We create something like MutableTorchTensorRTModule as a wrapper that stores both nn.Module (or ExportedProgram) and fx.graphmodule (with TorchTRTModule). It maintains a connection between them. Modifying the original module would trigger the fx graph module update, and the program can decide between an engine refit or re-compilation.

Conceptually this module would present just as an nn.Module to the user in all ways except when it is run it calls the compiled module under the hood. Users can also use inheritance and create custom subclasses that achieve custom behaviors of nn.Module.

Example Usecase 1

Users can load the LoRA into nn.Module of stable diffusion pipeline (which will be on host memory), and that automatically triggers refit or users can initiate the refit manually.

Example Usecase 2

In model training, after a backward propagation, the state_dict gets updated, and that automatically triggers refit.

1 reply

cehongwang Jun 26, 2024
Collaborator Author

import base64
import pickle
import time
import numpy as np
import torch

from torch_tensorrt.dynamo._refit_state_dict import refit_module_weights

import torch
from diffusers import DiffusionPipeline
torch.manual_seed(1) 
class UNetTorchTRTWrapper():
    def __init__(self, unet) -> None:
        self.unet = unet
        self.config = unet.config
        self.sample_inputs = torch.load("/opt/torch_tensorrt/TensorRT/py/torch_tensorrt/dynamo/refitting/sample_input.pt")

    def set_module(self, gm):
        self.gm = gm

    def refit_gm(self, export_program):
        self.gm = refit_module_weights(self.gm, export_program, self.sample_inputs)

    def compile_trt(self):
        pass

    def __call__(self, *args, **kwargs):
        return [self.gm(*args, kwargs['encoder_hidden_states'])['sample']]


with torch.no_grad():
    inputs = torch.load("/opt/torch_tensorrt/TensorRT/py/torch_tensorrt/dynamo/refitting/sample_input.pt")

    model_id = "runwayml/stable-diffusion-v1-5"
    device = "cuda:0"

    # Instantiate Stable Diffusion Pipeline with FP16 weigh ts
    pipe = DiffusionPipeline.from_pretrained(
        model_id, revision="fp16", torch_dtype=torch.float16, safety_checker=None
    )
    pipe.to(device)
    backend = "torch_tensorrt" 
    model = pipe.unet

    
    with open("./trt_module.pkl", "rb") as f:
        trt_gm = pickle.load(f)

    wrapper = UNetTorchTRTWrapper(pipe.unet)
    wrapper.set_module(trt_gm)
    pipe.unet = wrapper
    
    prompt = "portrait of a woman standing, shuimobysim, wuchangshuo, best quality"
    negative = "(worst quality:2), (low quality:2), (normal quality:2), lowres, normal quality, skin spots, acnes, skin blemishes, age spot, glans, (watermark:2),"
    image = pipe(prompt, negative_prompt=negative, num_inference_steps=30).images[0]
    image.save('./without_LoRA.jpg')

    pipe.unet = wrapper.unet
    pipe.load_lora_weights("/opt/torch_tensorrt/moxin.safetensors", adapter_name="lora1")
    pipe.set_adapters(["lora1"], adapter_weights=[1])
    pipe.fuse_lora(['lora1'], 1)
    pipe.unload_lora_weights()

    export_program = torch.export.export(pipe.unet, tuple(inputs))
    wrapper.refit_gm(export_program)

    pipe.unet = wrapper
    image = pipe(prompt, negative_prompt=negative, num_inference_steps=30).images[0]
    image.save('./with_LoRA.jpg')
    print()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Weight Refit #2900

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Model Weight Refit #2900

cehongwang Jun 7, 2024 Collaborator

Model Weight Refit

TL;DR

User scenario

Problem

Motivations and usecases

Proposed APIs

Example Workflow

Implementation Design

High Level Explanation

Model interpretation

Mapping Construction

Extensions Required to Core API Implementations

Target Platforms

Implementation Phases

Prototype - Small/Medium

MVP 1.4.0 - Medium

Extension Phase 1 [Potential] - Medium

Replies: 1 comment · 1 reply

narendasan Jun 26, 2024 Collaborator

Initiative:

Design Concept

Example Usecase 1

Example Usecase 2

cehongwang Jun 26, 2024 Collaborator Author

cehongwang
Jun 7, 2024
Collaborator

MVP `1.4.0` - Medium

Replies: 1 comment 1 reply

narendasan
Jun 26, 2024
Collaborator

cehongwang Jun 26, 2024
Collaborator Author