2023_CVPR_Diffusion-Video-Autoencoders--Toward-Temporally-Consistent-Face-Video-Editing-via-Disentangled-Video-Encoding.pdf code
Background
face video editing (edit hair color, gender, wearing glasses, etc.) bearing the challenge of temporal consistency among edited frames.
Contributions
- devise diffusion video autoencoders based on diffusion autoencoders that decompose the video into a single time-invariant and per-frame time-variant features for temporally consistent editing.
- Based on the decomposed representation of diffusion video autoencoder, face video editing can be conducted by editing only the single time-invariant identity feature and decoding it together with the remaining original features
- our framework can be utilized to edit exceptional cases(partially occluded)
- text-based editing methods(CLIP-loss)
- Diffusion Autoencoder
-
Style-GAN based methods edit each frames, unable to perfectly reconstruct original image , especially when occlusion happens:question:
- 一些方法针对无法复原原图的问题,继续 finetune GAN-inversion。但会有失去原来可编辑能力的风险,针对视频多帧更为严重。
-
video temporal consistency ❓
对 latent trajectory or features 进性平滑,但无法保证一致性。是因为其隐式地改变了 motion feature
-
pytorch Facial Landmarks code >> 面部关键点特征
-
Progressive Growing of GANs for Improved Quality, Stability, and Variation
https://github.com/tkarras/progressive_growing_of_gans.git
-
StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators
-
模仿 [STIT](#STIT ⭐) 对所有帧先 crop & align
-
对每帧提取
$Z_{face}^{(i)}$ $$ Z_{id_rep} = 1/N * \sum_{n=1}^{N}{(Z_{id}^{(n)})} \ Z_{face}^{(n)} = MLP(Concatenate(Z_{id_rep}, Z_{landscape}^{(n)})) $$ -
DDIM forward process conditioned on
$Z_{face}$ -
对 ID feature 进性编辑$Z_{id,rep}^{edit}$ ,之后跟原来一样得到
$Z_{face}^{(n),edit} = MLP(Concatenate(Z_{id,rep}^{edit}, Z_{landscape}^{(n)}))$ - pretrained linear attribute-classifier
- CLIP
-
DDIM reverse process 获取修改完对应画面
-
paste back: 用 BiSeNetV2 分割人脸
Distangle the video into
Identity Feature
-
$Z_{id}$ >> ArcFace get identity feature$Z_{id}$ regards as time-variant feature. So denote without frame index in the right-up corner.Id 特征在所有帧共享,但 ArcFace 提取出的任务类别特征每帧有差异,文中直接 average 🚨
❓ average 如果对于人戴面具的场景??
-
$Z_{landscape}^{(n)}$ >> pytorch Facial Landmarks code 提取人脸关键点,提取运动特征
为了提纯
- DDPM loss
- L_reg loss 采样两个 nosise 和 mask 相乘,提取面部区域计算 L1 loss
- ablation study for Loss_reg
2022_ArcFace-Additive-Angular-Margin-Loss-for-DeepFace-Recognition.pdf 2021_CVPR_One-Shot-Free-View-Neural-Talking-Head-Synthesis-for-Video-Conferencing.pdf
ArcFace比Softmax的特征分布更紧凑,决策边界更明显,一个弧长代表一个类。
pytorch Facial Landmarks code >> 面部关键点特征
Classifier-based editing
CLIP-based editing
StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators StyleGAN-NADA blog