Computer Science 401 10 Nov 2024
St. George Campus University of Toronto
Homework Assignment #3
Due: Tuesday, 3 December 2024 at at 23h59 (11:59pm)
Email: [email protected]. Please prefix your subject with [CSC401 F24-A3].
All accessibility and extension requests should be directed to [email protected].
Lead TAs: Yushi Guan.
Instructors: Gerald Penn.
Overview
This assignment introduces you to Gaussian mixture modelling, recurrent neural networks, and dynamic
programmingforspeechdata. Theassignmentisdividedintotwosections. Inthefirst,youwillexperiment
with acoustic-based classification for 1) speaker identification, and 2) deception detection. In the second
section, you will 1) evaluate two speech recognition engines, and 2) perform a naive speaker verification.
Data for the deception detection task come from the CSC Deceptive Speech corpus, which was
developed by Columbia University, SRI International, and University of Colorado Boulder. It consists of
32 hours of audio interviews from 32 native speakers of Standard American English (16 male, 16 female)
recruited from the Columbia University student population and the community. The purpose of the
study was to distinguish deceptive speech from non-deceptive speech using machine learning techniques on
extracted features from the corpus.
All starter code is available on MarkUs.
Data are on the teach.cs servers in the /u/cs401/A3/data/ directory. Each sub-folder represents
speech from one speaker and contains raw audio, pre-computed MFCCs, and orthographic transcripts.
Further file descriptions are in Appendix A.
All code should be compatible with the default Python 3 environment on teach.cs. You should call
your scripts with the command python3.10.
Please check Piazza regularly for assignment updates and help.
1 Sequence classification: Speakers and lies [35 marks]
Speaker identification is the task of correctly identifying speaker s from among S possible speakers s
c i=1..S
given an input speech sequence X, consisting of a succession of d-dimensional real vectors. In the interests
of efficiency, d = 13 in this assignment. Each vector represents a small 25 ms unit of speech called a frame.
Speakers are identified by training data that are ascribed to them. This is a discrete classification task
(choosing among several speakers) that uses continuous-valued data (the vectors of real numbers) as input.
Gaussian Mixture Models
Gaussian mixture models (GMMs) are often used to generalize models from sparse data. They can tightly
constrain large-dimensional data by using a small number of components but can, with many more compo-
nents, model arbitrary density distributions. Sometimes, they are simply used because the domain being
modelled appears to have multiple modes.
Copyright © 2024 University of Toronto. All rights reserved.
1
GivenM components,GMMsaremodelledbyacollectionofparameters,θ = {ω ,µ ,Σ },
m=1..M m=1..M m=1..M
where ω is the probability that an observation is generated by the mth component. These are subject
m
(cid:80)
to the constraint that ω = 1 and 0 ≤ ω ≤ 1. Each component is a multivariate Gaussian distri-
m m m
bution, which is characterized by that component’s mean, µ , and covariance matrix, Σ . For reasons
m m
of computational efficiency, we will reintroduce some independence assumptions by assuming that every
component’s covariance matrix is diagonal, i.e.:
Σ [1] 0 ··· 0
m
0 Σ m[2] ··· 0
Σ m = . .
.
0 0 ··· Σ [d]
m
forsomevectorΣ⃗ . Therefore,onlydparametersarenecessarytocharacterizeacomponent’s(co)variance.
m
1.1 Utility functions [10 marks]
We need three utility functions for our GMM. Your first task is to complete the implementation of those
functions in a3 gmm.py.
First, implementlog b m x, whichimplementsthelogobservationprobabilityofx forthemth mixture
t
component, i.e., the log of:
exp(cid:34)
−1
(cid:88)d
(x t[n]−µ
m[n])2(cid:35)
2 Σ [n]
m
b (⃗x ) = n=1 (1)
m t (cid:113)
(2π)d/2 (cid:81)d Σ [n]
n=1 m
Next, implement log p m x, which is the log probability of m given x using model θ, i.e., the log of:
t
ω b (⃗x )
m m t
p(m|⃗x ;θ) = (2)
t (cid:80)M
ω b (⃗x )
k=1 k k t
Finally, implement logLik, which is the log likelihood of a set of data X, i.e.:
T
(cid:16) (cid:17) (cid:88)
logP X˜;θ = logp(⃗x ;θ ) (3)
s t s
t=1
where
M
(cid:88)
p(x⃗ ;θ) = ω b (x⃗ ) (4)
t m m t
m=1
and b is defined in Equation 1. For efficiency, we just pass θ and precomputed b (x⃗ ) to this function.
m m t
Please follow the instruction given in your starter code carefully. It should provide you a very detailed
guideline on expected input, output and function behaviour.
1.2 Training Gaussian mixture models [5 marks]
Now we train an M-component GMM for each of the speakers in the data set. Specifically, for each
speaker s, train the parameters θ = {ω ,µ ,Σ } according to the method described in
s m=1..M m=1..M m=1..M
Appendix B. In all cases, assume that covariance matrices Σ are diagonal. Start with M = 8. You’ll be
m
asked to experiment with that in Section 1.4. Complete the function train in a3 gmm.py.
2
1.3 Classification with Gaussian mixture models [5 marks]
Now we test each of the test sequences we’ve already set aside for you in the main function. I.e., we check
if the actual speaker is also the most likely speaker, sˆ:
(cid:16) (cid:17)
sˆ= argmax logP X˜;θ (5)
s
s=1,...,S
To complete the function test in a3 gmm.py. Run through a train-test cycle, and save the output that
this function writes to stdout, using the k = 5 top alternatives, to the file a3 gmmLiks.txt.
1.4 Experiments and discussion [10 marks]
Experiment with the settings of M and maxIter (or ϵ if you wish). For example, what happens to
classification accuracy as the number of components decreases? What about when the number of possible
speakers,S,decreases? Youwillbemarkedonthedetailwithwhichyouempiricallyanswerthesequestions
and whether you can devise one or more additional valid experiments of this type.
Additionally, your report should include short hypothetical answers to the following questions:
• How might you improve the classification accuracy of the Gaussian mixtures, without adding more
training data?
• When would your classifier decide that a given test utterance comes from none of the trained speaker
models, and how would your classifier come to this decision?
• Can you think of some alternative methods for doing speaker identification that don’t use Gaussian
mixtures?
Put your experimental analysis and answers to these questions in the file a3 gmmDiscussion.txt.
1.5 End-to-end truth-and-lie detection with GRUs [5 marks]
Each of the utterances has been labelled as either truthful or deceitful (see Appendix A). Your task is to
train and test models to tell these utterances apart using the provided data.
• UsingtheacousticsequencesfromthepreviousGMMquestionsasinput, usetorch.nn.GRU tocreate
asimpleunidirectionalGRUwithone hidden layer. ThisGRUtakesinMFCCvectorsasinputs,
and output a single output (truth or false). The starter code labels lies as 1 and truth as 0.
• For this part, modify the __init__ method of the LieDetector in a3 model.py to create a GRU as
specified above. In addition, fill in the code to create a linear layer to project the GRU’s outputs to
logits.
• Experiment by training the model using different hidden sizes of 5, 10, and 50.
• To train, run:
srun -p csc401 --pty python3.10 train.py --source /u/cs401/A3/data/ --hidden_size
If you wish, you can experiment with the hyperparameters by specifying --batch_size, --lr,
--epochs or --optimizer (either ‘adam’ or ‘sgd’; by default, it is ‘adam’).
The srun -p csc401 --pty part of the command requests a compute node on teach.cs. This will
provide you with dedicated CPUs and reduce congestion during peak times. The --pty flag enables
pseudo-terminal mode, which allows your breakpoints to work properly. However, if you want to run the
assignment locally, you should remove it.
3
Is there a trend in performance with different hidden sizes? Explain this trend or lack thereof in 1-2
sentences. Write your model configurations, the detection performance (accuracy) and answers to the
discussion questions in the file a3 detectionDiscussion.txt.
2 Dynamic Programming in Speech [35 marks]
2.1 Word Error Rate in Speech Recognition [15 marks]
Automatic speech recognition (ASR) is the task of correctly identifying a word sequence given an input
speech sequence X. To simplify your lives, we have run two popular ASR engines on our data: the open-
source and highly customizable Kaldi (specifically, a bi-directional LSTM model trained on the Fisher
corpus), and the neither-open-source-nor-particularly-customizable Google Speech API.
We want to see which of Kaldi and Google are the most accurate on our data. For each speaker in
our data, we have three transcript files: transcripts.txt (the gold-standard transcripts, from humans),
transcripts.Kaldi.txt (the ASR output of Kaldi), and transcripts.Google.txt (the ASR output of
Google); see Appendix A.
For this part of the assignment, we want you to complete the file a3 levenshtein.py. Specifically, we
define the Levenshtein function that accepts lists of words r (Reference) and h (hypothesis), and return a
4-item list containing the floating-point WER, the number of substitutions, the number of insertions, and
the number of deletions where
#Insertions+#Deletions+#Substitutions
WER =
#ReferenceWords
Recall from lecture how Dynamic Programming comes into the play. You should feel familiar with how
the modules are structured based on the lecture material. You should complete the pre-defined modules
initialize, step and finalize and also fill in the rest of Levenshtein function.
For this part, we assume that the cost of a substitution is 0 if the words are identical and 1 otherwise.
The costs of insertion and deletion are both 1. Keep this assumption in mind.
The main function iterates through each of the speakers and each line i of their transcripts as well. You
do not have to worry about this part because we wrote the preprocessing and the entire pipeline for you.
In the output file a3 levenshtein.out, you should see something like this for each line:
[SPEAKER] [SYSTEM] [i] [WER] I:[#Insertions], D:[#Deletions], S:[#Substitutions]
where [SYSTEM] is either ‘Kaldi’ or ‘Google’.
You should also see the average and standard deviation of WER for each of Kaldi and Google, respec-
tively.
2.2 Dynamic Time Warping in Speaker Verification [20 marks]
In speaker verification, the goal is to confirm or deny a speaker’s claimed identity based on their voice.
This involves comparing a spoken sample (the ”candidate” sample) against a stored template or model of
the claimed speaker’s voice (the ”reference” sample). Given the natural variations in speech—even from
the same speaker—due to factors like emotion, health, environment, and context, exact matches between
samples are improbable. DTW addresses this by aligning the temporal sequences of feature vectors (often
MFCCs extracted from the speech samples) in a way that minimizes the overall distance between them,
accounting for variations in speaking rate and pronunciation.
Dynamic Time Warping (DTW) is another DP algorithm used in the field of speech processing, par-
ticularly in applications like speaker verification, speech recognition, and voice activity detection. Its
4
primary function is to measure the similarity between two temporal sequences, which could vary in speed
or duration, making it especially suitable for dealing with the temporal variabilities inherent in human
speech.
For this part of the assignment, we want you to complete the file a3 dtw.py. Specifically, we define
the DTW function in a very similar manner to Levenshtein from last part. The goal is to relate two pieces
of MFCC data with DTW. In DTW, the similarity between two sequences X ∈ RN×L and Y ∈ RM×L is
defined by the DTW distance DTW(X,Y):
DTW(X,Y) = D(N,M)
Where D is the Accumulative Cost Matrix filled by the following DP value function:
c(x 1,y 1) n = 1,m = 1
(cid:80)n
c(x ,y ) n ∈ [2 : N],m = 1
D(n,m) = k=1 k 1
(cid:80)m
k=1c(x 1,y k) n = 1,m ∈ [2 : M]
min{D(n−1,m−1),D(n,m−1),D(n−1,m)}+c(x ,y ) 1 < n ≤ N,1 < m ≤ M
n m
and c(x,y) is the cost function. For this assignment, since x,y ∈ RL, we use the Euclidean distance
between two vectors as the cost function.
In addition to DTW(X,Y), we are also interested in tracing the Optimal Warping Path (OWP) p∗ =
{p ,...,p }, where each element p for l ∈ [1 : L] is defined as:
1 L l
(1,1) l = 1
(1,m−1) n = 1,m ∈ [2 : M]
p = (n−1,1) n ∈ [2 : N],m = 1 1 < l < L,p = (n,m)
l l+1
argmin{D(n−1,m−1),D(n,m−1),D(n−1,m)} 1 < n ≤ N,1 < m ≤ M
(N,M) 1 = L
We have selected a few voice segments as a simple test for your code. You should be able to identify
the speaker that matches the reference speaker. You can find the right answer by observing the path we
used to load the data. Getting an intuitive answer with the default test should be a good verification for
your code.
Youcanplayaroundthevaluesinthemainfunction. Forexample,youcanmodifythereferencespeaker
to see what happens when you switch to a different speaker; you can try to edit the timestamps to navigate
to a different segment of the same speaker and see how that affects your result. You can also listen to the
raw audio of those selected segments. Write briefly about 1) what you did, 2) what you have observed for
each change you made and 3) why you think the result is the way it is, in a3 dtwDiscussion.txt. You
will be marked mainly on the effort rather than the results.
5
Submission requirements
This assignment is submitted electronically. Submit your assignment on MarkUs. You should submit:
1. All your code for a3 gmm.py, a3 model.py, a3 levenshtein.py and a3 dtw.py.
2. The output file a3 gmmLiks.txt.
3. Yourdiscussionfilesa3 gmmDiscussion.txt,a3 detectionDiscussion.txtanda3 dtwDiscussion.txt.
4. The ID.txt file available on the course website, with your details filled in.
Do not tar or compress your files, and do not place your files in subdirectories.
Using your own computer
If you want to do some or all of this assignment on your laptop or other computer, you will have to do
the extra work of downloading and installing the requisite software and data. You take on the risk that
your computer might not be adequate for the task. You are strongly advised to upload regular backups
of your work to teach.cs, so that if your machine fails or proves to be inadequate, you can immediately
continue working on the assignment at teach.cs. When you have completed the assignment, you should
try your programs out on teach.cs to make sure that they run correctly there. A submission that does
not work on teach.cs will get zero marks.
6
A Appendix: Details on CSC data set
Each utterance is represented by the following file types:
*.wav The original speech waveform sampled at 16kHz.
*.mfcc.py The Mel-frequency cepstral coefficients obtained from an analysis of the waveform,
in numPy format. Each row represents a 25ms frame of speech and consists of 13
floating point values.
*.txt Labelandorthographictranscriptionofeachutterance,foreachofKaldiandGoogle
ASR, and human gold-standard.
Participants were told that they were participating in a communication experiment which sought to
identify people who fit the profile of top entrepreneurs in America. To this end, participants performed
tasks and answered questions in six areas; they were later told that they had received low scores in some
of those areas and did not fit the profile. The subjects then participated in an interview where they were
told to convince the interviewer that they had actually achieved high scores in all areas and that they did
indeed fit the profile. The interviewer’s task was to determine how he thought the subjects had actually
performed, and he was allowed to ask them any questions other than those that were part of the subjects’
tasks. For each question from the interviewer, subjects were asked to indicate whether the reply was true
or contained any false information by pressing one of two pedals hidden from the interviewer under a table.
Interviews were conducted in a double-walled sound booth and recorded to digital audio tape on
two channels using Crown CM311A Differoid headworn close-talking microphones, then downsampled
to 16 kHz. Interviews were orthographically transcribed by hand using the NIST EARS transcription
guidelines. Labels for local lies were obtained automatically from the pedal-press data and hand-corrected
for alignment, and labels for global lies were annotated during transcription based on the known scores of
the subjects versus their reported scores.
MFCCs were obtained using the python speech features module using default parameters, i.e., 25
ms windows, 13 cepstral coefficients, and 512 fast Fourier transform coefficients
Each transcript file has the same format, where the ith line is:
[i] [LABEL] [TRANSCRIPT]
where i corresponds to i.wav and i.mfcc.npy, [LABEL] is the Global Lie label, and [TRANSCRIPT]
is the actual transcript orthography. Global Lie valence and the version of the pre-interview task for
the utterance appears before the colon (e.g., T/H) and the section name appears after the colon (e.g.,
INTERACTIVE).
Global Lie valence is indicated as: T == Truth; LU == Lie Up (subject claims better performance
than was actually achieved); and LD == Lie Down (subject claims worse performance). The task version
is indicated as: H == Hard; and E == Easy. So, for example, T/H:INTERACTIVE indicates that the
subject is telling the truth based on having performed the hard version of the Interactive task.
7
B Appendix: Training Gaussian mixture models
Input: MFCC data X, number of components M, threshold ϵ, and maxIter
begin
Initialize θ ;
i := 0 ;
prev L := −∞ ; improvement = ∞;
while i < maxIter and improvement >= ϵ do
ComputeIntermediateResults ;
L := ComputeLikelihood (X,θ) ;
θ := UpdateParameters (θ,X,L) ;
improvement := L−prev L ;
prev L := L ;
i := i+1 ;
end
end
Algorithm 1: GMM training algorithm.
For ComputeIntermediateResults, it is strongly recommended that you create two M × T numPy
arrays – one to store each value from Equation 1 and the other to store each value from Equation 2. In
fact, we’ve set up the function logLik to encourage you to do this, to avoid redundant computations. You
willusethesevaluesinbothComputeLikelihoodandupdateParameters, wherethelatterisaccomplished
thus:
(cid:80)T
p(m|⃗x ;θ)
ωˆ = t=1 t
m
T
(cid:80)T
p(m|⃗x ;θ)⃗x
µ⃗ˆ m = (cid:80)t= T1 p(m|⃗xt ;θ)t (6)
t=1 t
(cid:80)T p(m|⃗x ;θ)⃗x2
Σˆ = t=1 t t −µ⃗ˆ2
m (cid:80)T
p(m|⃗x ;θ)
m
t=1 t
In the third equation, the square of a vector on the right-hand side is defined as the component-wise square
of each dimension in the vector. Note that you don’t need to break up Algorithm 1 into separate functions
as implied – it is only written that way above to emphasize the sequence of steps
8