Homework 2 Part 2
Face Classification & Verification using CNN
11-785: Introduction to Deep Learning (Fall 2024)
DUE: 12th Oct, 2024
Writeup Version: 1.0.0
Start Here
• Collaboration policy:
– Youareexpectedtocomplywiththe University Policy on Academic Integrity and Plagiarism.
– You are allowed to talk and work with other students for homework assignments.
– You can share ideas but not code, you must submit your own code. All submitted
code will be compared against all code submitted this semester and in previous
semesters using MOSS.
– You are allowed to help your friends debug, however - you are not allowed to type
code for your friend
– You are not allowed to look at your friends’ code while typing your solution
– You are not allowed to copy and paste solutions off the internet
– Meeting regularly with your study group to work together is highly encouraged.
You can even see from each other’s solution what is effective, and what is ineffec-
tive. You can even “divide and conquer” to explore different strategies together
before piecing together the most effective strategies. However, the actual code
used to obtain the final submission must be entirely your own.
• Overview:
– Part 2: This section of the homework is an open-ended competition hosted on
Kaggle.com, a popular service for hosting predictive modeling and data analytics
competitions. The competition page can be found here.
– Part 2 Multiple Choice Questions: You need to take a quiz before you start
with HW2-Part 2. This quiz can be found on Canvas under HW2P2: MCQ
(Early deadline). It is mandatory to complete this quiz before the early
deadline for HW2-Part 2.
• Submission:
– Part 2: See the the competition page for details.
1
Homework objective
After this homework, you would ideally have learned:
• To implement CNNs for image data
– How to handle image data
– How to use augmentation techniques for images
– How to implement your own CNN architecture
– How to train the model
– How to optimize the model using regularization techniques
• To derive semantically meaningful representations
– To understand what semantic similarity means in the context of images
– To implement CNN architectures that are one of the many ways commonly used
for representation learning
– To identify similarity or distance metrics to compare the extracted feature repre-
sentations
– To measure the semantic similarity between two derived representations using
these appropriate similarity measures. Use this to generate discriminative and
generalizable feature representations for data. Explore different advanced loss
functions and architectures to improve the learned representations
– Learn how classification and verification are connected
• To explore architectures and hyperparameters for the optimal solution
– To identify and tabulate all the various design/architecture choices, parameters,
and hyperparameters that affect your solution
– Todevisestrategiestosearchthroughthisspaceofoptionstofindthebestsolution
• To engineer the solution using your tools
– To use objects from the PyTorch framework to build a CNN.
– To deal with issues of data loading, memory usage, arithmetic precision, etc. to
maximize the time efficiency of your training and inference
2
Checklist
Here is a checklist page that you can use to keep track of your progress as you go through
the write-up and implement the corresponding sections in your starter notebook.
1. Getting started
Join the Kaggle competition
Download the starter notebook
Load the libraries, install Kaggle API, and download Kaggle dataset files
2. Completethetrain, val, andtestdatasetclasses, initializethedatasetsanddataloaders
Explore and visualize your dataset
Explore and experiment with transformations and normalization
3. Build a model
4. Run training and evaluation
Set up everything for training
Save your model checkpoints
5. Run testing and submit final predictions to Kaggle
6. (Optional) Finetune the model using different losses
Refer to the additional notebook for more details
3
Contents
1 Introduction 6
1.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Architectures of CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Face Recognition Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Problem Specifics 7
3 Data Description 9
3.1 Dataset Class - ImageFolder . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 CNN Architectures and Data Augmentations 10
4.1 How do we differentiate faces? . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 How do we train CNNs to produce multi-class classification? . . . . . . . . . 10
4.3 Transformations and Data Augmentation . . . . . . . . . . . . . . . . . . . . 11
4.3.1 Why Transformations Matter: . . . . . . . . . . . . . . . . . . . . . . 11
4.3.2 Important Considerations: . . . . . . . . . . . . . . . . . . . . . . . 12
4.4 Create deeper layers with residual networks . . . . . . . . . . . . . . . . . . 12
4.4.1 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4.2 ConvNeXt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Face Recognition 15
5.1 Training Face Classifiers with Softmax Cross-Entropy . . . . . . . . . . . . . 15
5.2 A Closer Look: What Are We Learning with CE? . . . . . . . . . . . . . . . 16
5.3 Objectives of Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.4 Feature-based Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.4.2 Triplet Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.4.3 N-Pair Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.4.4 Remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.5 Margin-based Softmax Loss Functions . . . . . . . . . . . . . . . . . . . . . . 19
5.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.5.2 ArcFace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.5.3 Combined Margin-based Loss Functions . . . . . . . . . . . . . . . . 20
5.5.4 Remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.6 A Unified View of Feature-based and Margin-based Softmax Loss Functions 21
5.6.1 Positives vs. Negatives . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.6.2 Circle Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.6.3 Loss Function Rewrite . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.6.4 Supervised Contrastive Loss . . . . . . . . . . . . . . . . . . . . . . . 22
5.7 Remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6 Face Verification 24
6.1 Building upon the multi-class classification . . . . . . . . . . . . . . . . . . . 24
6.2 Verification Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4
7 Kaggle Competitions 25
7.1 File Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.1.1 Classification Dataset Folder . . . . . . . . . . . . . . . . . . . . . . . 25
7.1.2 Kaggle Verification dataset folder . . . . . . . . . . . . . . . . . . . . 25
7.1.3 Evaluation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8 Conclusion 26
Appendix A 27
A.1 List of relevant recitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5
1 Introduction
1.1 Overview
In this homework, we will build and train Convolutional Neural Networks (CNNs) for face
recognition tasks. The most successful face recognition systems today, as of early 2019 [1, 2],
can achieve more than 99% accuracy for faces of any identity. This remarkable accuracy
implies that face recognition systems do not necessarily need to be trained on face images
of all possible identities in the world but can still robustly conduct open-set inference, i.e.,
recognizing identities not present in the training set. Beyond face recognition, such systems
can be extended to many other applications, such as retail product recognition, surveillance
person identification, and vehicle identification, as long as they are trained on the specific
data relevant to these tasks.
In this homework, we will learn how to build our own face recognition systems from two
perspectives:
• How to build effective CNN architectures.
• How to build CNNs for face verification using different loss functions.
1.2 Architectures of CNNs
Convolutional Neural Networks (CNNs) have significantly evolved since their inception, be-
coming more sophisticated and efficient in handling complex tasks. Early architectures like
LeNet [3] and AlexNet [4] paved the way, focusing on simple layers and large fully connected
networks. Subsequent architectures, such as VGGNet [5], introduced deeper networks with
uniform layer structures, while ResNet [6] introduced residual connections to enable the
training of much deeper networks.
In this homework, we will focus on modern CNN architectures known for their effective-
ness in face recognition tasks, such as ResNet [6] and its variants [7, 8]. These architectures
are designed to extract robust features that are crucial for accurate face recognition.
1.3 Face Recognition Loss Functions
Facerecognitionpresentsuniquechallenges, includingvariationsinlighting, pose, expression,
and occlusion, all of which can significantly affect the performance of recognition systems.
To address these challenges, the loss functions used to train face recognition models must
ensure that the features extracted are not only discriminative but also invariant to these
variations.
We will explore several loss functions specifically designed to enhance the performance of
face recognition systems, including feature-based losses like Triplet Loss [9] and N-Pair Loss
[10], as well as margin-based Softmax losses like ArcFace [2]. These loss functions aim to
improve the separability of different identities in the feature space, ensuring that the system
performs well even under challenging conditions.
The trained networks will be validated via face verification, where given two face images,
your network needs to determine whether these two faces are from the same person identity.
6
2 Problem Specifics
In this assignment, you’ll explore how to extract important features from face images that
can be used to effectively verify identities. By the end of this homework, you’ll have a better
understanding of how face recognition systems work, particularly in distinguishing between
different faces.For this, you will have to implement the following:
• A face classifier that can extract feature vectors from face images. The face
classifier consists of two main parts:
– Feature extractor:
∗ Objective: Yourgoalistodesignamodelthatcanlearnandidentifydistinc-
tive facial features (like skin tone, hair color, nose size, etc.) from a person’s
faceimage. Thesefeatureswillberepresentedasafixed-lengthfeaturevector,
commonly known as a face embedding.
∗ How It Works: To achieve this, you will explore different architectures that
involve multiple convolutional layers. These layers will help your model break
down the image into various components:
· Low-Level Features: The first few layers might detect simple elements
like edges or lines.
· High-Level Features: As you stack more convolutional layers, the
model will start to recognize more complex patterns by combining these
low-level features, such as detecting shapes, textures, and specific facial
attributes.
∗ Why It Matters: This hierarchical decomposition of the image is essential
because it allows the model to understand and capture the detailed charac-
teristics that make each face unique.
– Classification Layer:
∗ Objective: Onceyouhaveobtainedthefeaturevector(faceembedding)from
the feature extractor, you’ll use it to classify the image into one of several
categories (e.g., identifying the person in the image from a set of known
identities).
∗ How It Works: The feature vector will be passed through a linear layer
or a Multi-Layer Perceptron (MLP), which will output the probability of
the image belonging to each of the ’N’ categories. The model will then use
cross-entropy loss during training to optimize its performance.
∗ Why It Matters: After training the model, these feature vectors can be
used not just for classification but also for face verification tasks, where you’ll
compare feature vectors from different images to determine if they belong to
the same person.
Your model needs to be able to learn facial features (e.g., skin tone, hair
color, nose size, etc.) from an image of a person’s face and represent them as
a fixed-length feature vector called face embedding. In order to do this, you
will explore architectures consisting of multiple convolutional layers.
7
• A verification system that computes the similarity between feature vectors
of two images. Essentially, the face verification system takes two images as input
and outputs a similarity score that represents how similar the two images are and if
they are of the same person or not. The verification consists of two steps:
1. Extracting the feature vectors from the images.
2. Comparing the feature vectors using a similarity metric.
A vanilla verification system looks like this:
1. image1 =⇒ feature extractor =⇒ feature vector1
2. image2 =⇒ feature extractor =⇒ feature vector2
3. feature vector1, feature vector2 =⇒ similarity metric =⇒ similarity score
NOTE:Forabetterunderstandingofwhatishappeningduringtrainingandinference.
Please refer to fig. 1
Figure 1: Face Identification and Verification setup
8
3 Data Description
For this homework, your training task (classification) and inference task (verification) would
be are different. Thus we will have two datasets.
• The first dataset for classification training used in this homework is a subset of
the VGGFace2 dataset. This dataset is very widely known and used in research and
industry. Images are downloaded from Google Image Search and have large variations
in pose, age, illumination, ethnicity, and profession (e.g., actors, athletes, politicians).
The classification dataset consists of 8631 identities with image resolution of 112×112.
• The second dataset is a verification dataset used only for validation and testing. It
consists of 6000 image pairs with 5749 identities in total.
3.1 Dataset Class - ImageFolder
Implementing the Dataset and Dataloader class for this homework is actually very straight-
forward: wewillbeusingtheImageFolderclassfromthetorchvisionlibraryandpassingitthe
pathtothetrainingandvalidationdataset. Theimagesinsubfoldersof classification data
are arranged in a way that is compatible with this dataset class. Since the folder names cor-
respond to the classes and the images of respective classes are placed in folders with the same
names, the ImageFolder class will automatically infer the labels and make a dataset object,
which we can then pass on to the dataloader. The only thing to keep in mind is to include
the image transforms when passing the dataset to ensure data augmentation is applied.
9
4 CNN Architectures and Data Augmentations
4.1 How do we differentiate faces?
Before diving into the implementation, let’s pause and ask an important question: How do
we differentiate faces?
You mightthink offeatures likeskin tone, eye shapes, nosesize, and othercharacteristics.
These are known as facial features. These features vary significantly from person to person
and are what make each of us unique. The primary goal of this assignment is to train a
Convolutional Neural Network (CNN) model to identify and represent these important facial
features from a person’s face image. The model will do this by extracting these features and
encoding them into a fixed-length vector called a face embedding.
Once your CNN model is capable of encoding sufficient and distinctive facial features
into these face embeddings, you can then use these embeddings for further tasks, such as:
• Face Classification: Assigning an ID or label to a given face.
• Face Recognition/Verification: Identifying or verifying a person based on their
face. The face could never appear in your training set.
In this assignment, you’ll perform face verification by modifying your trained model. During
inference, you’ll remove the classification layer and use the face embeddings generated during
training. These embeddings, which represent unique facial features, will be compared to
determine if two face images belong to the same person, allowing you to verify identities
instead of classifying them.
4.2 How do we train CNNs to produce multi-class classification?
Now comes our second question: how should we train your CNN to produce high-quality
face embeddings? It may sound fancy, but conducting face classification is just doing a
multi-class classification: the input to your system is a face’s image, and your model
needs to predict the ID of the face.
Suppose the labeled dataset contains a total of M images that belong to N different peo-
ple (where M > N). Your goal is to train your model on this dataset to produce “good” face
embeddings. You can do this by optimizing these embeddings to predict the face IDs from
the images. The resulting embeddings will encode a lot of discriminative facial features,
just as desired. This suggests an N-class classification task. A typical multi-class classifier
conforms to the following architecture:
Classic multi-class classifier = feature extractor(CNN) + classifier(FC)
More concretely, your network consists of several (convolutional) layers for feature ex-
traction. The core operation in a convolutional layer involves sliding a filter (also known as
kernel) over the input data (for example an image) to produce a output feature map. The
output of the last such feature extraction layers (i.e. the final output feature map) would
be the face embedding. You will pass this face embedding through a linear layer whose
dimension is embedding dim × num of face-ids to classify the image among the N (i.e., num
10
Figure 2: A typical face classification architecture
of face-ids) people. You can then use cross-entropy loss to optimize your network to predict
the correct person for every training image.
The ground truth will be provided in the training data (making it supervised learning).
You are also given a validation set for fine-tuning your model. Please refer to the Dataset
section where you can find more details about what dataset you are given and how it is or-
ganized. To understand how we (and you) evaluate your system, please refer to the System
Evaluation section.
4.3 Transformations and Data Augmentation
When working with images in deep learning, it’s essential to prepare and augment your
data to improve your model’s performance. PyTorch provides a module called torchvi-
sion.transforms that is specifically designed for this purpose. This module includes a va-
riety of pre-defined transformations that can be easily applied to images or entire datasets.
Some of the most commonly used transformations include resizing, cropping, flipping,
rotating, adjusting brightness/contrast, normalizing pixel values, and converting
images to tensors.
4.3.1 Why Transformations Matter:
Image transformations, often referred to as data augmentation techniques, play a crucial role
in training Convolutional Neural Networks (CNNs). While they don’t directly generate new
features, they offer several significant benefits:
• 1. Increasing Data Volume: By applying transformations to your images, you
effectively increase the size of your dataset. This helps the model learn from a wider
variety of examples, which can improve both training and generalization.
• 2. Preventing Overfitting: Overfitting occurs when a model memorizes the training
datainsteadoflearningtogeneralize. Byexposingthemodeltoslightlyalteredversions
11
of the same images, transformations help the model focus on learning robust features
rather than specific details of individual images.
• 3. Invariance: Transformations teach your model to recognize objects regardless of
their orientation, position, or other variations. This means that your model becomes
more versatile and capable of handling real-world scenarios where such variations are
common.
• 4. Better Generalization: A diverse training set, created through transformations,
exposes your model to a wider range of scenarios. This can lead to better performance
when the model encounters new, unseen data.
4.3.2 Important Considerations:
While transformations can be highly beneficial, they should be chosen carefully to suit your
specific task. Here are some key points to keep in mind:
• Appropriate Transformations: Ensure that the transformations you use are rel-
evant to the type of data you’re working with. For instance, if you’re working with
face images, applying vertical flips might not be appropriate, as upside-down faces are
uncommon in real-world data.
• Impact on Training Time: Some transformations, especially those that require
significant processing, can increase the training time per epoch. It’s important to
balance the benefits of augmentation with the computational cost.
• Normalization: Many models require the input data to be normalized. This involves
subtracting the mean from each pixel and dividing by the standard deviation.
This process ensures that the data is on a similar scale, which can help with model
convergence. For more on how to normalize images in PyTorch, you can refer to
this guide 1. PyTorch’s torchvision.transforms.Normalize() can be used for this
purpose, but make sure to check whether it takes in images or tensors as input.
However, it’s important to note that while transformations can be very beneficial, they
should be chosen carefully. There are potential downsides to using image transformations.
These include increased training time due to more data, the risk of inappropriate transfor-
mations depending on the context, possible distortion or loss of information, the need for
careful parameter choice, and not entirely resolving overfitting issues, especially in cases of
very small datasets or complex models.
4.4 Create deeper layers with residual networks
Having a network that is good at feature extraction and being able to efficiently train that
network is the core of the classification task. This homework requires training deep neural
1How to Normalize Images in PyTorch
12
networks and, as it turns out, deep neural networks are difficult to train, because they suffer
from vanishing and exploding gradients types of problems. Here we will learn about skip
connections that allow us to take the activations of one layer and suddenly feed it to an-
other layer, even much deeper in the network. Using that, we can build residual networks
(like ResNets), which enable us to train very deep neural networks, sometimes even networks
of over one hundred layers.
Figure 3: A Residual Block
Resnets are made of something called residual blocks, which are a set of layers that are
connected to each other, and the input of the first layer is added to the output of the last
layer in the block. This is called a residual connection. This identity mapping does not
have any parameters and is just there to add the input to the output of the block. This
allows deeper networks to be built and trained efficiently.
Several other blocks make use of residual blocks and residual connections and can be used
for the classification task, such as ResNet, SEResNet, ConvNext, MobilNet, etc. You
are encouraged to read their respective research papers to understand better how they work
and implement blocks from these architectures.
4.4.1 ResNet
ResNet models were proposed in “Deep Residual Learning for Image Recognition”. Here
we have the 5 versions of ResNet models, which contain 18, 34, 50, 101, and 152 layers,
respectively. Detailed model architectures can be found in the paper linked above.
4.4.2 ConvNeXt
ConvNeXt is a recent CNN architecture that uses inverted bottlenecks inspired by the Swin
Transformer, residual blocks, and depthwise separable convolutions instead of regular con-
volutions. A comparison of the ResNet-50 and ConvNeXt-T and the detailed architecture
can be found in “A ConvNet for the 2020s”.
Since you will be using blocks and customized versions of these architectures, the per-
formance may or may not match the expected outcomes based on the benchmarks in the
13
papers. Thus, you are encouraged to explore various architectures systematically to get you
to the high cut-off. That’s pretty much everything you need to know for your Classification
Kaggle competition. Go for it!
14
5 Face Recognition
To develop a robust face recognition system, it is essential to understand how the training
data and model architecture are defined and interconnected.
LetusdefinethetrainingdatasetasD = {(x ,y )} ,wherex ∈ RH×W×C indicates
train i i i∈[N] i
an input face image, y is the corresponding one-hot ground-truth identity for the image,
i
i.e., a one-hot class label, and N denotes the dataset size. We can define our face recognition
system as the stack of a CNN-based feature extractor z = f (x) with learnable parameters θ,
θ
which transforms the input images x into features z ∈ RD, where D indicates the dimension
of the feature space, and a linear projection head s = g (z) with learnable parameters ω,
ω
which maps the features z into scores s ∈ RC with dimension C.
To express the relationship between the feature vectors and the scores in matrix form,
we can write:
z = f (x )
i θ i
(1)
s = g (z ) = z W
i ω i i
Here, W ∈ RD×C is the weight matrix of the linear projection head, which maps the
feature vector z ∈ RD to the score vector s ∈ RC. Each element sj in the score vector
i i i
represents the unnormalized logit for class j, which will be used to compute the probability
of the input image belonging to each identity class through the Softmax function.
5.1 Training Face Classifiers with Softmax Cross-Entropy
In this part, we will show the first modeling of a face recognition system as a face identity
classifier. As you learned from the class, this is as easy as solving a classification problem
using a feature extractor of CNN and a classification head with learnable weights W ∈
RD×C2, by optimizing the Cross-Entropy (CE) loss with the Softmax function. Recall that,
in the classification setting, we usually term the score vectors s as logits, and the Softmax
function converts them into probabilities:
exp(s1) exp(sC)
p = Softmax(s) = [ ,..., ] ∈ RC, (2)
(cid:80)C
exp(sk)
(cid:80)C
exp(sk)
k=1 k=1
where each element in the vector indicates the (predicted) probability of this sample belong
to a class, e.g., pj = exp(s1) is the (predicted) probability of a sample belonging to class
(cid:80)C exp(sk)
k=1
k. We optimize the Cross-Entropy (CE) to encourage the probability of each sample x
i
belonging y to be maximized, i.e, the cross-entropy between the predicted probability and
i
the one-hot ground truth to be minimized:
N C
1 (cid:88)(cid:88)
L = −yjlogpj. (3)
CE N i i
i j
2Here we omit the bias term in the linear layer, but it can be easily incorporated into W.
15
Figure 4: Feature visualization of Softmax CE loss.
With proper CNN architecture and data augmentation, we can obtain a strong classification
baseline for this homework.
However, there are two main limitations of treating face recognition as a classification
problem. The first limitation is obvious. The trained face classifier can only recognize the
identities of the faces y that are present in the training data. During inference, we predict
the identity of the face as argmax pj for the testing input. This means the system only
j
works for closed-set identities, and when we want the system to recognize new open-set
identity that is not included in the training set, we need to re-train the whole system. This
becomes extremely un-affordable for million-scale and billion-scale face recognition systems,
the most common scenarios in reality. One can circumvent this limitation by performing face
recognition from the trained feature extractor only, without using the classification head.
This requires us to compare the features of the testing input with all the features of the
training data, and return the identity/label of the training sample with the largest similarity
ofthefeatures: argmax Similarity(z,z ). Thesecondlimitationisthatthefeatureextractor
yi i
trained by classification with CE loss and Softmax presents a very poor similarity measure,
which is less obvious and we will elaborate in section 5.2.
5.2 A Closer Look: What Are We Learning with CE?
C
(cid:88)
l = −yjlogpj
CE
j
(cid:88)C exp(sj)
= −yjlog
(cid:80)C
exp(sk)
j k=1
(cid:80)C exp(sk) (4)
k=1,k̸=y
= −log
exp(sy)
C
(cid:88)
= sy −log exp(sk)
k=1,k̸=y
≈ sy − max (sk)
k∈[C],k̸=y
The objective of Softmax CE loss: optimizing the ground-truth score sy to be larger than
the maximum of non-ground-truth score max (sk).
k∈[C],k̸=y
16
Wehavesj = zwT = ∥z∥∥wT∥cosθ . ThisindicatesthatSoftmaxCElosswillstrengthen
j j j
the length of feature vector and the weight vector, resulting in a radial feature space, as
shown in fig. 4. A radial feature space presents very limited discriminative features for
feature matching of face recognition. For example, z and z belong to the same class, but
1 2
it present small feature similarity compared to z and z due to the strengthed features.
1 3
5.3 Objectives of Face Recognition
The primary goals for an effective open-set face recognition system are as follows:
• Maximize Intra-Class Similarity: Ensure that face images of the same identity
have highly similar feature representations. This means that features extracted from
different images of the same person should be close together in the feature space, which
facilitates accurate identification.
• Minimize Inter-Class Similarity: Ensure that face images of different identities
have distinct feature representations. This involves making the features of different
individuals sufficiently different, thereby reducing the chance of mis-identification.
• Maintain a Large Margin between Intra-Class and Inter-Class Similarity:
Ensure that the similarity between features of the same class (intra-class similarity) is
significantly higher than the similarity between features of different classes (inter-class
similarity). This is crucial for robust face recognition, especially when dealing with
challenging cases such as similar-looking images taken under varying conditions.
To mathematically capture these objectives, we define a loss function that penalizes small
intra-class distances and large inter-class distances. A common formulation is:
L = max(sn −sp +m,0), (5)
where:
• sp = Similarity(z ,zp) represents the similarity score between the anchor feature z and
i j i
a positive feature zp from the same identity.
j
• sn = Similarity(z ,zn) represents the similarity score between the anchor feature z
i k i
and a negative feature zn from a different identity.
k
• m is a margin parameter that enforces a gap between the positive and negative simi-
larity scores.
The objective of this loss function is to ensure that sn (the similarity with a negative
sample) is at least m units less than sp (the similarity with a positive sample). If this
conditionisnotmet,thelosswillbepositiveandthemodelwillbepenalized,therebypushing
the features to satisfy this margin constraint. By optimizing this loss, the model learns to
create a feature space where same-identity faces are clustered together, and different-identity
facesarewellseparated. Also, fromthisobjective, wecanobservethatinpurelyCEtraining,
17
the loss functions only encourage the positive similarity sy to be larger than the maximal of
the negative similarities sk,k ∈ [C],k ̸= y, without explicitly encourage a margin.
In the next, we will revisit the two paradigms of achieving this objective: feautre-based
loss functions (also known as metric learning) and margin-based loss functions. We will also
go through a unified view of both paradigms at the end.
5.4 Feature-based Loss Functions
5.4.1 Overview
Feature-based loss functions, also known as metric learning losses, focus on directly opti-
mizing the feature space to enhance the discriminative power of the learned representations.
Instead of merely classifying faces into predefined categories, these loss functions aim to
structure the feature space such that similar identities are clustered closely together while
dissimilar identities are pushed far apart. This approach is particularly advantageous for
open-set face recognition, where the system encounters identities not seen during training.
5.4.2 Triplet Loss
Triplet loss [9] is one of the most widely used feature-based loss functions. It operates on
triplets of images: an anchor image x , a positive image xp (from the same identity as the
i i
anchor), and a negative image xn (from a different identity). The goal of the triplet loss is
i
to ensure that the distance between the anchor and the positive image is smaller than the
distance between the anchor and the negative image by at least a margin α. The triplet loss
function is defined as:
N
(cid:88)(cid:104) (cid:105)
L = ∥f(xa)−f(xp)∥2 −∥f(xa)−f(xn)∥2 +m , (6)
triplet i i 2 i i 2
+
i=1
where f(x) denotes the feature representation of image x learned by the model, ∥·∥2 is the
2
squared Euclidean distance (Cosine similarity can also be used, in which case the sign of the
similarity score would be flipped), and [·] denotes the hinge loss that only penalizes the loss
+
when it is positive.
This loss encourages the model to learn a feature space where:
• The features of images from the same identity (anchor and positive) are close together.
• The features of images from different identities (anchor and negative) are far apart by
at least the margin m.
Looking closer at the Triplet loss function, you will find it is very similar to what was defined
earlier as the objective for face recognition, i.e., eq. (5), by replacing the similarity metric as
the Euclidean distance and flipping the sign.
18
5.4.3 N-Pair Loss
N-Pairloss[10]generalizestheconceptoftripletlossbyconsideringmultiplenegativesamples
for each anchor-positive pair, rather than just one. This allows for more robust training by
incorporating a broader context of the feature space. The N-Pair loss is defined as:
(cid:34) (cid:35)
N
L =
(cid:88)
log
1+(cid:88) exp(cid:0) f(xa)Tf(xn)−f(xa)Tf(xp)(cid:1)
, (7)
N-Pair i j i i
i=1 j̸=i
where f(xa)Tf(xn) denotes the similarity between the anchor and a negative sample, and
i j
f(xa)Tf(xp) denotes the similarity between the anchor and the positive sample.
i i
The N-Pair loss effectively encourages the model to maximize the similarity between
the anchor and the positive pair while minimizing the similarity with multiple negatives
(compared to only one negative in triplet loss), providing a more comprehensive training
signal.
5.4.4 Remark
Feature-based loss functions are powerful for structuring the feature space, but they come
with certain challenges:
• Directly Optimizing the Objective: Youwillfindthesemetriclossdireclyoptimize
the objectives we listed earlier.
• Difficult to Tune: Choosing the right margin parameter α in triplet loss or selecting
the appropriate number of negative samples in N-Pair loss can be challenging and may
require extensive hyper-parameter tuning. Usually, it makes life easier to combine
the metric learning loss with CE loss.
• Hard Sample Mining: The effectiveness of feature-based losses heavily relies on the
selection of hard samples—triplets or pairs where the model struggles to differentiate
between positive and negative samples. Efficient mining strategies are crucial but can
be computationally expensive.
• Training Complexity: These loss functions often require careful batching and sam-
pling strategies to ensure the training process is stable and effective, adding complexity
to the model training pipeline.
Despite these challenges, feature-based loss functions are indispensable for applications
requiring robust face recognition, especially in open-set scenarios where the system must
generalize to unseen identities.
5.5 Margin-based Softmax Loss Functions
5.5.1 Overview
Margin-basedSoftmaxlossfunctionsaredesignedtoenhancethediscriminativepoweroffea-
tureslearnedbydeepneuralnetworks,particularlyinclassificationtaskslikefacerecognition.
19
These loss functions modify the traditional Softmax Cross-Entropy (CE) loss by introducing
a margin that separates different classes more distinctly in the feature space. This approach
is particularly useful in scenarios where fine-grained distinctions between classes (e.g., differ-
ent identities) are critical, as it encourages the model to maximize inter-class variance while
minimizing intra-class variance.
5.5.2 ArcFace
ArcFace is one of the most popular margin-based Softmax loss functions. It introduces
an angular margin to the Softmax function, which helps in learning more discriminative
features by pushing the decision boundaries further away from each other. The formulation
of ArcFace is as follows:
Given the feature vector z of a sample and the corresponding weight vector w for class
i j
j, the original Softmax function is defined as:
sj = z ·w = ∥z ∥∥w ∥cosθ , (8)
i j i j j
where θ is the angle between the feature vector z and the weight vector w .
j i j
ArcFace modifies this by adding an angular margin m to θ , effectively making it
j
harder for the model to classify samples correctly, thus pushing the model to learn more
discriminative features. To make the angular margin effective, the magnitude of both feature
and weight vectors are normalized to unit length and scaled by a fixed parameter α , which
transforms the logit (similarity score) to:
sj = γcos(θ +m) = α(cosθ cosm−sinθ sinm). (9)
j j j
The modified logits are then passed through the Softmax function as usual:
p = Softmax(sj), (10)
The key idea is that by introducing the margin m, the model is forced to learn features
that not only separate classes, but also create a margin between them in the angular space,
leading to more robust face recognition.
5.5.3 Combined Margin-based Loss Functions
In addition to ArcFace, other margin-based loss functions such as CosFace [1], SphereFace
[11], and AM-Softmax [12] have been proposed, each introducing different types of margin
(additive angular, multiplicative angular, etc.). These can be combined to create a unified
margin-based loss function that takes advantage of multiple margin types simultaneously.
For instance, a combined margin-based loss function could be formulated as:
sj = γ(cos(θ +m )+m ), (11)
j 1 2
where m and m are different types of margins (e.g., angular and additive). This allows for
1 2
more flexibility in training and can lead to improved performance in specific face recognition
tasks. You are encouraged to try different combination of margin and find out which one
works the best.
20
5.5.4 Remark
Margin-based Softmax loss functions have several advantages:
• Easier to Train: Compared to feature-based loss functions, margin-based Softmax
losses are easier to implement and train because they do not require complex sam-
ple mining strategies. They directly modify the logits before applying the Softmax
function, which simplifies the training pipeline.
• Effective for Fine-Grained Classification: By introducing a margin, these loss
functions make the decision boundaries between classes more distinct, which is crucial
for tasks like face recognition where classes (identities) can be very similar.
• Limited to One Positive Class: A limitation of these methods is that they are
designed for single-label classification scenarios, where each sample belongs to only
one class. They are less effective in multi-label scenarios or when the task requires
recognizing multiple faces simultaneously.
• Scalability: Margin-based Softmax loss functions scale well with the number of
classes, making them suitable for large-scale face recognition systems.
In summary, margin-based Softmax loss functions strike a balance between ease of train-
ing and the ability to learn highly discriminative features, making them a popular choice for
modern face recognition systems.
5.6 A Unified View of Feature-based and Margin-based Softmax
Loss Functions
5.6.1 Positives vs. Negatives
Both feature-based and margin-based Softmax loss functions share a common goal: to en-
hance the discriminative power of the learned features by maximizing the similarity between
samples of the same class (positives) and minimizing the similarity between samples of dif-
ferent classes (negatives). While they approach this objective differently, the underlying
principles are closely related.
Feature-based loss functions, such as Triplet Loss and N-Pair Loss, explicitly operate
on the distances or similarities between anchor-positive and anchor-negative pairs. They
directly optimize the feature space to ensure that positive pairs are close and negative pairs
are far apart.
Margin-based Softmax loss functions, like ArcFace, introduce a margin in the classifi-
cation layer to implicitly achieve a similar effect. By modifying the logits before applying
the Softmax function, these losses create a separation (or margin) between different classes,
pushing the model to learn more distinct features. The weight of the last linear classifier
can be viewed as an anchor for each classes (especially after being normalized), which thus
ties the margin-based Softmax loss functions closely with feature based loss functions. In
the next, we will introduce more modern loss functions that provides a unified view of both
paradigms.
21
5.6.2 Circle Loss
Circle Loss [13] is an example of a unified approach that combines elements of both feature-
based and margin-based losses. It introduces a unified perspective by directly optimizing
the similarity scores with a margin, while also ensuring that the optimization is focused on
hard samples (those that are difficult to classify). Circle Loss is defined as:
(cid:34) (cid:35)
(cid:88) (cid:88)
L = log 1+ exp(γ(cosθ −m)) exp(γ(cosθ +m)) , (12)
circle pos neg
pos neg
where γ is a scaling factor, m is the margin, θ and θ are the angles corresponding to
pos neg
positive and negative pairs, respectively.
Circle Loss unifies the optimization of intra-class and inter-class similarities by applying
different margins to positive and negative pairs, ensuring that the learned feature space is
well-structured for discriminative tasks like face recognition.
5.6.3 Loss Function Rewrite
Both feature-based and margin-based losses can be rewritten to highlight their commonali-
ties. For instance, the triplet loss can be seen as a special case of a margin-based loss when
we view the distance metric as a form of similarity score with a margin. Similarly, margin-
based Softmax losses can be interpreted as imposing a margin directly on the logit-level
comparisons that occur in feature-based losses.
This unified perspective allows us to see that both approaches are variations of the same
fundamental principle: enforcing a margin between positive and negative samples to improve
feature discriminability.
5.6.4 Supervised Contrastive Loss
Supervised Contrastive Loss (SupCon) [14] is another loss function that unifies the principles
of feature-based and margin-based approaches. It leverages contrastive learning, a self-
supervised technique, in a supervised setting to enhance intra-class compactness and inter-
class separability. The SupCon loss is defined as:
N
1 (cid:88) −1 (cid:88) exp(z ·z /τ)
i p
L = log , (13)
supcon N |P(i)| (cid:80)2N I exp(z ·z /τ)
i=1 p∈P(i) a=1 [a̸=i] i a
where z and z are the feature representations of the anchor and positive samples, τ is a
i p
temperature scaling parameter, and P(i) denotes the set of positives for anchor i.
SupervisedContrastiveLossencouragessamplesfromthesameclasstobepulledtogether
in the feature space, while samples from different classes are pushed apart. This aligns with
the goals of both feature-based and margin-based losses, providing a robust approach to
learning discriminative features.
22
5.7 Remark
Although these loss functions provide a unified view of both learning paradigms, they usually
require a very large batch size (>2048) to achieve reasonable performance. We suggest that
you try these loss functions with multiple GPU cards or in combination with previous loss
functions. You will be able to achieve a high enough score with previous loss functions only.
23
6 Face Verification
Now let us switch gears to face verification. After training of your model using the loss
functions, the input to your system will now be a pair of face images that may or may not
belong to the same person during inferece. Given a pair, your goal is to output a numeric
score that quantifies how similar the faces in the two images are. A higher score indicates a
higher confidence about whether the faces in the two images are of the same person.
6.1 Building upon the multi-class classification
If your model yields high accuracy in face classification, you might already have a good
Feature Extractor for free. That being said, if you remove the fully connected/linear layer,
thisleavesyouwithaCNNthat“can”(probably could wouldbemoreaccuratehere)generate
discriminative face embeddings, given arbitrary face images.
6.2 Verification Pipeline
We shall all agree that The face embeddings of the same person should be similar (i.e.,
the distance between feature vectors generated is small), even if they are extracted from
different images. After being trained explicitly to maximize the similarity of positive pairs,
minimize the similarity of negative pairs, and encourage some margin between positives and
negatives, your model is capable to generate accurate face embeddings. We only need to
compute a proper distance metric to evaluate how close the given face embeddings are.
If two face embeddings are close in distance, they are more likely to be from the same person.
If you follow this design, your system should look like the Figure below. Please notice
that the Feature Extractor in Figure 5 is the same one, even though it is drawn twice.
Figure 5: Face verification architecture
24
7 Kaggle Competitions
For this assignment, we will provide one kaggle competition for the face verification task.
But we will provide validation data for both classification and verification tasks. One for
testing your classifier on closed-set identities, i.e., same as your training set, and another
for testing your verification system on open-set identities, i.e., different from your training
set. The purpose of the first classification competition is for you to test your classifier and
get a sense that higher classification accuracy may not necessarily imply a good verification
system (as shown in the CE loss part).
• Face classification
– Goal: Given a person’s face, return the identity of the face.
• Face verification
– Goal: Given a list of known and unknown identities, map each unknown identity
to either a known identity or a special, ”no-correspondence” label.
– Kaggle: https://www.kaggle.com/competitions/11785-hw-2-p-2-face-verification-fall-2024
7.1 File Structures
The structure of the dataset folders is as follows:
7.1.1 Classification Dataset Folder
• Each sub-folder in train and dev contains images of one person, and the name of that
sub-folder represents their ID.
– train: You are supposed to use the train set to train your model both for the
classification task and verification task.
– dev: You are supposed to use dev to validate the classification accuracy.
7.1.2 Kaggle Verification dataset folder
• ver data: This is the folder of all face images with unknown identities.
• test pair.csv: This is the test file where each two sampled images are treated as
a pair. And you need to predict a score for each pair for the kaggle to compute the
metric.
• verification sample submission.csv: This is a sample submission file for the face
verification competition. The first column is the index of the image files. Your task is
to assign a label to each image and generate a submission file as shown here.
25
7.1.3 Evaluation System
• Face Classification
This is quite straightforward,
# correctly classified images
Accuracy =
total images
• Kaggle: Face Verification
In this task, the performance is evaluated using the Equal Error Rate (EER). The
EER is the point at which the rate of false acceptances (False Acceptance Rate, FAR)
equals the rate of false rejections (False Rejection Rate, FRR). These rates are defined
as follows:
# of false acceptances
FAR =
total # of impostor attempts
# of false rejections
FRR =
total # of genuine attempts
TheEERisthevaluewherethesetworatesareequal, indicatingthethresholdatwhich
the system’s error rates are balanced. A lower EER indicates better performance.
EER = FAR = FRR
8 Conclusion
Nicely done! Here is the end of HW2P2, and the beginning of a new world. As always, feel
free to ask on Piazza if you have any questions. We are always here to help.
Good luck and enjoy the challenge!
26
Appendix A
A.1 List of relevant recitations
Please review the below recitations for supplementary material that could be helpful for this
assignment -
• Pytorch Fundamentals
• OOPS Fundamentals
• Google Colab
• GCP
• Kaggle
• Data Loaders
• WandB
• Blocks coding
• Discriminative Losses
References
[1] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng
Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages
5265–5274, 2018.
[2] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive an-
gular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, pages 4690–4699, 2019.
[3] Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learn-
ing applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
deep convolutional neural networks. Advances in neural information processing systems,
25, 2012.
[5] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-
scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 770–778, 2016.
27
[7] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
[8] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and
Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 11976–11986, 2022.
[9] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In Similarity-
based pattern recognition: third international workshop, SIMBAD 2015, Copenhagen,
Denmark, October 12-14, 2015. Proceedings 3, pages 84–92. Springer, 2015.
[10] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective.
Advances in neural information processing systems, 29, 2016.
[11] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song.
Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 212–220, 2017.
[12] Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu. Additive margin softmax for
face verification. IEEE Signal Processing Letters, 25(7):926–930, 2018.
[13] Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang,
and Yichen Wei. Circle loss: A unified perspective of pair similarity optimization. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
pages 6398–6407, 2020.
[14] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip
Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning.
Advances in neural information processing systems, 33:18661–18673, 2020.
28