Face swap is a technique that uses artificial intelligence (AI) to super-impose a person’s face onto a photo or video while preserving the original face’s expressions, eye and mouth movements, pose, lighting, and background.
The goal is to maintain the context of the rest of the body and environment.
It is also performed on videos, and the same facial expressions, eye and mouth movements of the source are carried over to the target.
In the process of face-swapping, there are three objects involved:
(1) Target photo/video, which is the material where you want to replace the face while keeping all other elements consistent.
(2) Source photo, which is the source of the face you want to swap.
(3) Result photo/video, which is the final product after the face has been swapped.
The process of face swapping is complicated. It involves several key steps:
Step 1: Face Detection
The first step is to detect and locate faces in the source and target images or videos. This is typically done using computer vision algorithms that identify facial features.
Several algorithms(especially the CNNs) can be used for face detection:
Deep Learning (Convolutional Neural Networks — CNNs)
Training: CNNs are trained in large datasets of images containing faces. The network learns to recognize the complex patterns and structures of faces.
Inference: The trained CNN is used to predict the presence and location of faces in new images.
Step 2: Facial Landmark Detection
Once faces are detected, the next step is to identify specific facial landmarks or key points.
CNNs or other deep learning models identify specific facial landmarks or key points, corresponding to important facial features like eyes, nose, mouth, and jawline.
Typical landmark detection models identify between 68 to 194 key points on a face.
Step 3: Face Alignment
The detected faces are aligned based on the identified landmarks to ensure proper matching of facial features between source and target
Step 4: Feature Extraction
Deep neural networks, often pre-trained on face recognition tasks, extract identity-related features from the source face
Step 5: Attribute Encoding
A separate network encodes attributes of the target face, such as pose, expression, and lighting conditions.
Step 6: Face Generation
A generator network, typically based on generative adversarial networks (GANs) or autoencoders, combines the identity features from the source face with the attributes of the target face to generate the swapped face.
Step 7: Blending
The generated face is blended with the target image, often using techniques like adaptive attentional denormalization (AAD) to integrate identity and attributes seamlessly.
Step 8: Refinement
Additional networks may be used to refine the result, such as preserving occlusions or improving overall realism.
Step 9: Video Processing
For face swapping in videos, additional steps involve tracking facial movements frame by frame and ensuring temporal coherence.
Step 10: Training
The entire system is typically trained end-to-end using large datasets of face images, often employing techniques like progressive training to achieve high-resolution results.
There are many factors to evaluate a face swap but one of the most important ones is to measure how similar the source and the result
When we compare how similar two images are, we turn the images into a set of numbers (feature vectors) that represent their key features.
These numbers are like a special code that tells us what’s in the image.
We then compare these codes using something called cosine similarity, which is like measuring how close together two arrows are pointing.
If the arrows point of cosine similarity in almost the same direction, it means the images are very similar.
If the arrows point in opposite directions, it means the images are very different.
When we say the cosine similarity is 0.9, it’s like saying the arrows are pointing very close to each other, which means the images have a lot of the same features and look quite alike.
So, a cosine similarity of 0.9 between two images tells us that they look very similar to each other.
Unlike other services and products that use open-source AI models, DeepSwap has developed its own AI model over many years.
It has been trained on more than 50,000 images from public datasets such as CelebA and LFW, using data augmentation techniques like rotation, cropping, and color transformation to enhance the diversity of the data.
When I tested the similarity of DeepSwap’s face-swapping, I used Pytorch InsightFace to calculate the loss function after the face swap.
After long time test, on average, DeepSwap’s loss function is only 0.06, which means the similarity of its face-swapping effect reaches 1–0.06 = 0.94, or 94%.
Typically, if the similarity exceeds 80%, it becomes quite difficult for the human eye to recognize the differences.
If the number surpasses 90%, the result is virtually indistinguishable from the original person.
In contrast, general open-source models usually achieve a similarity of about 60% to 70%.