Skip to content

Latest commit

 

History

History
146 lines (71 loc) · 5.23 KB

2023_CVPR_Diffusion-Video-Autoencoders--Toward-Temporally-Consistent-Face-Video-Editing-via-Disentangled-Video-Encoding_Note.md

File metadata and controls

146 lines (71 loc) · 5.23 KB

Diffusion Video Autoencoders

2023_CVPR_Diffusion-Video-Autoencoders--Toward-Temporally-Consistent-Face-Video-Editing-via-Disentangled-Video-Encoding.pdf code

Background

face video editing (edit hair color, gender, wearing glasses, etc.) bearing the challenge of temporal consistency among edited frames.

Contributions

  1. devise diffusion video autoencoders based on diffusion autoencoders that decompose the video into a single time-invariant and per-frame time-variant features for temporally consistent editing.
  2. Based on the decomposed representation of diffusion video autoencoder, face video editing can be conducted by editing only the single time-invariant identity feature and decoding it together with the remaining original features
  3. our framework can be utilized to edit exceptional cases(partially occluded)
  4. text-based editing methods(CLIP-loss)

Related Work

  • Diffusion Autoencoder

methods

DVA_overview.jpg

  1. 模仿 [STIT](#STIT ⭐) 对所有帧先 crop & align

  2. 对每帧提取 $Z_{face}^{(i)}$ $$ Z_{id_rep} = 1/N * \sum_{n=1}^{N}{(Z_{id}^{(n)})} \ Z_{face}^{(n)} = MLP(Concatenate(Z_{id_rep}, Z_{landscape}^{(n)})) $$

  3. DDIM forward process conditioned on $Z_{face}$

  4. 对 ID feature 进性编辑$Z_{id,rep}^{edit}$ ,之后跟原来一样得到 $Z_{face}^{(n),edit} = MLP(Concatenate(Z_{id,rep}^{edit}, Z_{landscape}^{(n)}))$

    • pretrained linear attribute-classifier
    • CLIP
  5. DDIM reverse process 获取修改完对应画面

  6. paste back: 用 BiSeNetV2 分割人脸

Distangle Video Encoding

Distangle the video into $Z_{face} = MLP(Concatenate(Z_{id}, Z_{landscape}^{(n)}))$$Z_{T}$ noise map for background information.

Identity Feature

  1. $Z_{id}$ >> ArcFace get identity feature

    $Z_{id}$ regards as time-variant feature. So denote without frame index in the right-up corner.

    Id 特征在所有帧共享,但 ArcFace 提取出的任务类别特征每帧有差异,文中直接 average 🚨

    ❓ average 如果对于人戴面具的场景??

  2. $Z_{landscape}^{(n)}$ >> pytorch Facial Landmarks code 提取人脸关键点,提取运动特征

为了提纯 $Z_{face}$ 使用两个 Loss objective $L_{DVA} = L_{simple} + L_{reg}$

  1. DDPM loss
  2. L_reg loss 采样两个 nosise 和 mask 相乘,提取面部区域计算 L1 loss

DVA_overview.jpg

  • ablation study for Loss_reg

DVA_ablation_Loss_reg.jpg

ArcFace

2022_ArcFace-Additive-Angular-Margin-Loss-for-DeepFace-Recognition.pdf 2021_CVPR_One-Shot-Free-View-Neural-Talking-Head-Synthesis-for-Video-Conferencing.pdf

ArcFace比Softmax的特征分布更紧凑,决策边界更明显,一个弧长代表一个类。

Facial Landmarks

pytorch Facial Landmarks code >> 面部关键点特征

Editing 🔑

Classifier-based editing

DiffAE

博客参考 code

ProgressiveGrowingGAN

CLIP-based editing

StyleGAN-NADA

StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators StyleGAN-NADA blog