"Stable Remaster: Bridging the Gap Between Old Content and New Displays" Arxiv, 2023 Jun [paper]([GitHub - naston/StableRemaster: Adjusting aspect ratio of old animated content for new displays.]( code paper local pdf


Task: Aspect ratio conversion for animation videos,宽长比 4:3 的动画片,两边有黑色 bar 不好看,作者想用其他帧的背景信息去填充。先分割出前景,mask 后得到各个帧的背景,做一个整个视频的 total background,做 image stitching,对缺失部分用 stable-diffusion 做生成。

本文类似实验报告,可以看成多个任务组合起来的 pipeline,各个模块用的是现有的 api,不需要训练,只是作者在宽长比 4:3 的老视频做推理。


Related Work

  • "A content-aware tool for converting videos to narrower aspect ratios" IMX, 2022 Jun paper

  • "Imagen Video: High Definition Video Generation with Diffusion Models" Arxiv, 2022 Oct ⭐ paper

    Diffusion in video generation !!!!

  • "Object-Ratio-Preserving Video Retargeting Framework Based on Segmentation and Inpainting" WACV, 2023 paper

    retarget the old video screen ratio to a wider target aspect ratio by using segmentation & inpainting network

  • diffusers >> 快速使用 pretrained Stable-diffusion

Background Collapse & stitch

Seam Carving -- 基于内容的图像缩放算法 Seam Carving 算法是下面论文中提出的一种图像缩放算法,它的好处是可以尽可能保持图像中“重要区域”的比例,避免由于直接缩放造成的“失真”。按区域重要性进行缩放,视觉效果更好,如下图:


  • Background Collapse involves identifying and reducing redundant background regions in images or videos, allowing for the preservation of important foreground elements while resizing or retargeting. 现识别出哪些部分重要
  • Background Stitching combine parts of images or video frames to create a seamless consistent output 对于


宽长比 9:16 的视频(画面是竖着的,两边为黑色 bar),通过视频画面运动,补出两边黑色 bar 的内容,效果可以参考下面这个论文的图。本文作者用 Stable-diffusion 做 video-outpainting

"Complete and Temporally Consistent Video Outpainting" CVPR, 2022 paper


  • scene identification and segmentation

    直接调用 PySceneDetect, 文章里面主要讲各个 api 对比 (实验报告类似)

  • Foreground Masking

    基于假设:前景为主要物体,背景不变。通过检测前景物体,mask 掉前景后只用处理背景像素,之后用这些背景像素去做 image stitching, 生成。

    • bounding box 为最低要求,Mask-RCNN
  • Background Stitching

    create a total background for the scene. (视频所有帧整个的 background) 减少用 stable-diffusion 进行生成,计算量太大。用 SIFT 特征点匹配得到全局 backgroud 后,输入之后 pipline 判断那些像素需要去生成。

    • SIFT as our method for keypoint detection and description >> image stitching
  • Outpaint Region Selection

    用一个 mask 表示哪些区域需要后续 stable diffusion 生成像素

  • Stable Diffusion Outpainting

    用上一步的 region mask 输入 stable-diffusion,生成缺失的像素,补充到 total background 里面

    作者 2070 的卡,如果不用 background stitching 预处理步骤,只用 stable-diffusion 20min 视频需要 400h (大约16天)生成

  • Frame Resampling

    用 background stitching 里面的 affine transformation 的逆变换,去转回原来的视角


  • one RTX 2070 SUPER >> 8G 显存



本文主要做**老视频长宽比调整,**可以理解成多个 task 模型的 pipline,因此不需要 labeled data. 收集宽长比 4:3 的动画,用于验证。

  • gathered videos from the animated television show ’Avatar the Last Airbender

  • Evaluation: manual inspection anything seems out of place or violates environmental rules set by the animator it would be deemed incorrect.


  • Mask RCNN 前景分割不准确
  • 生成+拼接的背景有明显分隔,色差

Code implementation

  • MaskRCNN

    import torchvision
    model = torchvision.models.detection.maskrcnn_resnet50_fpn(weights = "DEFAULT")
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    model =
    transform = T.ToTensor()
    # frame: np.ndarray
    frame_input = transform(frame).to(device)
    with torch.no_grad():
        pred = model([frame_input])
    masks = pred[0]["masks"].cpu()
    labels = pred[0]['labels'].cpu()
    scores = pred[0]['scores'].cpu()
  • Image Stitching

    blog: SIFT matching python PythonSIFT github 👍


    def stitch_images(image1, image2, mask1, mask2, scale=1):
        # Detect features and keypoints using SIFT
        sift = cv2.SIFT_create()
        keypoints1, descriptors1 = sift.detectAndCompute(image1, mask1)
        keypoints2, descriptors2 = sift.detectAndCompute(image2, mask2)
        # Match the features using FLANN-based matcher
        index_params = dict(algorithm=FLANN_INDEX_KDTREE, trees=5)
        search_params = dict(checks=50)
        flann = cv2.FlannBasedMatcher(index_params, search_params)
        matches = flann.knnMatch(descriptors1, descriptors2, k=2)
        # Filter good matches using the ratio test
        good_matches = []
        for pair in matches:
            if len(pair) == 2:
                m, n = pair
                if m.distance < 0.5 * n.distance:
        if len(good_matches) < 4:  # Minimum number of matches required for homography
            print("Not enough good matches found to stitch the images.")
            print("# of good matches:", len(good_matches))
            return image1, mask1, np.array([[0,0,0],[0,0,0]])
        # Extract the matched points
        src_pts = np.float32([keypoints1[m.queryIdx].pt for m in good_matches]).reshape(-1, 1, 2)
        dst_pts = np.float32([keypoints2[m.trainIdx].pt for m in good_matches]).reshape(-1, 1, 2)
        # Compute the Affine Transformation using RANSAC
        M, _ = cv2.estimateAffine2D(src_pts, dst_pts, cv2.RANSAC)
        M_o = np.copy(M)
        x_offset = M[0,2]
        y_offset = M[1,2]
        # ...
        warped_image2 = cv2.warpAffine(image2, M, (w, h))
        warped_mask = cv2.warpAffine(mask2,M, (w,h))
        im_sum = np.sum([warped_image2, warped_mask], axis=0)
        return im_sum
  • diffusers.StableDiffusionInpaintPipeline


    from diffusers import StableDiffusionInpaintPipeline
    import torch
    def get_sd_pipe():
        if torch.cuda.is_available():
            print('Device: CUDA')
            pipe = StableDiffusionInpaintPipeline.from_pretrained(
            device = torch.device('cuda')
            print('Device: CPU')
            pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting",)
            device = torch.device('cpu')
        pipe =
        def dummy(images, **kwargs):
            return images, False
        pipe.safety_checker = dummy
        return pipe

    inference: 对 total background 但张图进行 inpaint

    sub_frame = Image.fromarray(sub_frame).convert("RGB")
    sub_mask = Image.fromarray(sub_mask).convert("RGB")
    new_sub_frame = pipe(prompt='animated background',image=sub_frame, 
    frame[y:(y+sample_height),x:(x+sample_width),:] = np.array(new_sub_frame)

Summary 🌟

learn what & how to apply to our task

  • 文章用各个现有模型实现了一个 pipline,找到一个应用场景(4:3 宽长比的老动画)硬套上去。但各个模块用现有 api 可以学习下
    • PySceneDetect 获取场景分割结果,保存为 MP4
    • python-imutils 进行图像 stitching
    • diffusers 库快速搭建 stable-diffusion 的 pipline
  • stable-diffusion 生成效率低,加上一些预处理步骤减少需要生成的像素
  • 了解一下视频宽长比调整(通过补充背景信息的方式来做)的 pipeline