- Title: Neural Aggregation Network for Video Face Recognition
- Authors: Jiaolong Yang, Peiran Ren, Dongqing Zhang, Dong Chen, Fang Wen, Hongdong Li, Gang Hua
- Link: https://arxiv.org/abs/1603.05474
- Tags: Neural Network, Face Recognition
- Year: 2016
- What
- They suggest a method to get cumulative/aggregated embedding from a sequence of embeddings (i.e get a single face embedding vector from a video).
-
How
-
Use attention mechanism to weight embeddings in a sequence.
-
They suggest two options:
-
Single attention block – Universal face feature quality measurement.
where f_k = embedding for k-th image in a sequence, a_k = obtained weight corresponded to k-th embedding
Trainable parameter is: q (shape = embedding size x 1)
-
Cascaded two attention blocks – Content-aware aggregation.
This q^1 replaces the q in above formula that computes coefficients a_k.
Trainable parameters are: W (shape = embeddings size x embeddings size), b (shape = embedding size x 1).
-
-
Face embedder (could be any CNN) and "Attention blocks" can be trained together in end-to-end manner or separately one-by-one.
-
Training procedure:
- For verification problem they used siamese structure with contrastive loss.
- For identification problem they used softmax and cross-entropy as loss function.
-
No recurrent blocks, but still input size independent.
-
Coefficients a (from the first attention block) strongly correlates with face quality and it's usefulness for recognition.
-
-
Results
- Shows better results than combining a single embedding by taking mean, median, l2/cos closest, etc.
- Shows state-of-the-art performance on YouTubeFaces and IJB-A datasets.