程序辅导案例 > Program >

代写辅导接单-CSC401

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Computer Science 401 10 Nov 2024

St. George Campus University of Toronto

Homework Assignment #3

Due: Tuesday, 3 December 2024 at at 23h59 (11:59pm)

Email: [email protected]. Please prefix your subject with [CSC401 F24-A3].

All accessibility and extension requests should be directed to [email protected].

Lead TAs: Yushi Guan.

Instructors: Gerald Penn.

Overview

This assignment introduces you to Gaussian mixture modelling, recurrent neural networks, and dynamic

programmingforspeechdata. Theassignmentisdividedintotwosections. Inthefirst,youwillexperiment

with acoustic-based classification for 1) speaker identification, and 2) deception detection. In the second

section, you will 1) evaluate two speech recognition engines, and 2) perform a naive speaker verification.

Data for the deception detection task come from the CSC Deceptive Speech corpus, which was

developed by Columbia University, SRI International, and University of Colorado Boulder. It consists of

32 hours of audio interviews from 32 native speakers of Standard American English (16 male, 16 female)

recruited from the Columbia University student population and the community. The purpose of the

study was to distinguish deceptive speech from non-deceptive speech using machine learning techniques on

extracted features from the corpus.

All starter code is available on MarkUs.

Data are on the teach.cs servers in the /u/cs401/A3/data/ directory. Each sub-folder represents

speech from one speaker and contains raw audio, pre-computed MFCCs, and orthographic transcripts.

Further file descriptions are in Appendix A.

All code should be compatible with the default Python 3 environment on teach.cs. You should call

your scripts with the command python3.10.

Please check Piazza regularly for assignment updates and help.

1 Sequence classification: Speakers and lies [35 marks]

Speaker identification is the task of correctly identifying speaker s from among S possible speakers s

c i=1..S

given an input speech sequence X, consisting of a succession of d-dimensional real vectors. In the interests

of efficiency, d = 13 in this assignment. Each vector represents a small 25 ms unit of speech called a frame.

Speakers are identified by training data that are ascribed to them. This is a discrete classification task

(choosing among several speakers) that uses continuous-valued data (the vectors of real numbers) as input.

Gaussian Mixture Models

Gaussian mixture models (GMMs) are often used to generalize models from sparse data. They can tightly

constrain large-dimensional data by using a small number of components but can, with many more compo-

nents, model arbitrary density distributions. Sometimes, they are simply used because the domain being

modelled appears to have multiple modes.

GivenM components,GMMsaremodelledbyacollectionofparameters,θ = {ω ,µ ,Σ },

m=1..M m=1..M m=1..M

where ω is the probability that an observation is generated by the mth component. These are subject

(cid:80)

to the constraint that ω = 1 and 0 ≤ ω ≤ 1. Each component is a multivariate Gaussian distri-

m m m

bution, which is characterized by that component’s mean, µ , and covariance matrix, Σ . For reasons

m m

of computational efficiency, we will reintroduce some independence assumptions by assuming that every

component’s covariance matrix is diagonal, i.e.:

 

Σ [1] 0 ··· 0

 0 Σ m[2] ··· 0 

Σ m =   . .  

 . 

0 0 ··· Σ [d]

forsomevectorΣ⃗ . Therefore,onlydparametersarenecessarytocharacterizeacomponent’s(co)variance.

1.1 Utility functions [10 marks]

We need three utility functions for our GMM. Your first task is to complete the implementation of those

functions in a3 gmm.py.

First, implementlog b m x, whichimplementsthelogobservationprobabilityofx forthemth mixture

component, i.e., the log of:

exp(cid:34)

−1

(cid:88)d

(x t[n]−µ

m[n])2(cid:35)

2 Σ [n]

b (⃗x ) = n=1 (1)

m t (cid:113)

(2π)d/2 (cid:81)d Σ [n]

n=1 m

Next, implement log p m x, which is the log probability of m given x using model θ, i.e., the log of:

ω b (⃗x )

m m t

p(m|⃗x ;θ) = (2)

t (cid:80)M

ω b (⃗x )

k=1 k k t

Finally, implement logLik, which is the log likelihood of a set of data X, i.e.:

(cid:16) (cid:17) (cid:88)

logP X˜;θ = logp(⃗x ;θ ) (3)

s t s

t=1

where

(cid:88)

p(x⃗ ;θ) = ω b (x⃗ ) (4)

t m m t

m=1

and b is defined in Equation 1. For efficiency, we just pass θ and precomputed b (x⃗ ) to this function.

m m t

Please follow the instruction given in your starter code carefully. It should provide you a very detailed

guideline on expected input, output and function behaviour.

1.2 Training Gaussian mixture models [5 marks]

Now we train an M-component GMM for each of the speakers in the data set. Specifically, for each

speaker s, train the parameters θ = {ω ,µ ,Σ } according to the method described in

s m=1..M m=1..M m=1..M

Appendix B. In all cases, assume that covariance matrices Σ are diagonal. Start with M = 8. You’ll be

asked to experiment with that in Section 1.4. Complete the function train in a3 gmm.py.

1.3 Classification with Gaussian mixture models [5 marks]

Now we test each of the test sequences we’ve already set aside for you in the main function. I.e., we check

if the actual speaker is also the most likely speaker, sˆ:

(cid:16) (cid:17)

sˆ= argmax logP X˜;θ (5)

s=1,...,S

To complete the function test in a3 gmm.py. Run through a train-test cycle, and save the output that

this function writes to stdout, using the k = 5 top alternatives, to the file a3 gmmLiks.txt.

1.4 Experiments and discussion [10 marks]

Experiment with the settings of M and maxIter (or ϵ if you wish). For example, what happens to

classification accuracy as the number of components decreases? What about when the number of possible

speakers,S,decreases? Youwillbemarkedonthedetailwithwhichyouempiricallyanswerthesequestions

and whether you can devise one or more additional valid experiments of this type.

Additionally, your report should include short hypothetical answers to the following questions:

• How might you improve the classification accuracy of the Gaussian mixtures, without adding more

training data?

• When would your classifier decide that a given test utterance comes from none of the trained speaker

models, and how would your classifier come to this decision?

• Can you think of some alternative methods for doing speaker identification that don’t use Gaussian

mixtures?

Put your experimental analysis and answers to these questions in the file a3 gmmDiscussion.txt.

1.5 End-to-end truth-and-lie detection with GRUs [5 marks]

Each of the utterances has been labelled as either truthful or deceitful (see Appendix A). Your task is to

train and test models to tell these utterances apart using the provided data.

• UsingtheacousticsequencesfromthepreviousGMMquestionsasinput, usetorch.nn.GRU tocreate

asimpleunidirectionalGRUwithone hidden layer. ThisGRUtakesinMFCCvectorsasinputs,

and output a single output (truth or false). The starter code labels lies as 1 and truth as 0.

• For this part, modify the __init__ method of the LieDetector in a3 model.py to create a GRU as

specified above. In addition, fill in the code to create a linear layer to project the GRU’s outputs to

logits.

• Experiment by training the model using different hidden sizes of 5, 10, and 50.

• To train, run:

srun -p csc401 --pty python3.10 train.py --source /u/cs401/A3/data/ --hidden_size

If you wish, you can experiment with the hyperparameters by specifying --batch_size, --lr,

--epochs or --optimizer (either ‘adam’ or ‘sgd’; by default, it is ‘adam’).

The srun -p csc401 --pty part of the command requests a compute node on teach.cs. This will

provide you with dedicated CPUs and reduce congestion during peak times. The --pty flag enables

pseudo-terminal mode, which allows your breakpoints to work properly. However, if you want to run the

assignment locally, you should remove it.

Is there a trend in performance with different hidden sizes? Explain this trend or lack thereof in 1-2

sentences. Write your model configurations, the detection performance (accuracy) and answers to the

discussion questions in the file a3 detectionDiscussion.txt.

2 Dynamic Programming in Speech [35 marks]

2.1 Word Error Rate in Speech Recognition [15 marks]

Automatic speech recognition (ASR) is the task of correctly identifying a word sequence given an input

speech sequence X. To simplify your lives, we have run two popular ASR engines on our data: the open-

source and highly customizable Kaldi (specifically, a bi-directional LSTM model trained on the Fisher

corpus), and the neither-open-source-nor-particularly-customizable Google Speech API.

We want to see which of Kaldi and Google are the most accurate on our data. For each speaker in

our data, we have three transcript files: transcripts.txt (the gold-standard transcripts, from humans),

transcripts.Kaldi.txt (the ASR output of Kaldi), and transcripts.Google.txt (the ASR output of

Google); see Appendix A.

For this part of the assignment, we want you to complete the file a3 levenshtein.py. Specifically, we

define the Levenshtein function that accepts lists of words r (Reference) and h (hypothesis), and return a

4-item list containing the floating-point WER, the number of substitutions, the number of insertions, and

the number of deletions where

#Insertions+#Deletions+#Substitutions

WER =

#ReferenceWords

Recall from lecture how Dynamic Programming comes into the play. You should feel familiar with how

the modules are structured based on the lecture material. You should complete the pre-defined modules

initialize, step and finalize and also fill in the rest of Levenshtein function.

For this part, we assume that the cost of a substitution is 0 if the words are identical and 1 otherwise.

The costs of insertion and deletion are both 1. Keep this assumption in mind.

The main function iterates through each of the speakers and each line i of their transcripts as well. You

do not have to worry about this part because we wrote the preprocessing and the entire pipeline for you.

In the output file a3 levenshtein.out, you should see something like this for each line:

[SPEAKER] [SYSTEM] [i] [WER] I:[#Insertions], D:[#Deletions], S:[#Substitutions]

where [SYSTEM] is either ‘Kaldi’ or ‘Google’.

You should also see the average and standard deviation of WER for each of Kaldi and Google, respec-

tively.

2.2 Dynamic Time Warping in Speaker Verification [20 marks]

In speaker verification, the goal is to confirm or deny a speaker’s claimed identity based on their voice.

This involves comparing a spoken sample (the ”candidate” sample) against a stored template or model of

the claimed speaker’s voice (the ”reference” sample). Given the natural variations in speech—even from

the same speaker—due to factors like emotion, health, environment, and context, exact matches between

samples are improbable. DTW addresses this by aligning the temporal sequences of feature vectors (often

MFCCs extracted from the speech samples) in a way that minimizes the overall distance between them,

accounting for variations in speaking rate and pronunciation.

Dynamic Time Warping (DTW) is another DP algorithm used in the field of speech processing, par-

ticularly in applications like speaker verification, speech recognition, and voice activity detection. Its

primary function is to measure the similarity between two temporal sequences, which could vary in speed

or duration, making it especially suitable for dealing with the temporal variabilities inherent in human

speech.

For this part of the assignment, we want you to complete the file a3 dtw.py. Specifically, we define

the DTW function in a very similar manner to Levenshtein from last part. The goal is to relate two pieces

of MFCC data with DTW. In DTW, the similarity between two sequences X ∈ RN×L and Y ∈ RM×L is

defined by the DTW distance DTW(X,Y):

DTW(X,Y) = D(N,M)

Where D is the Accumulative Cost Matrix filled by the following DP value function:



 c(x 1,y 1) n = 1,m = 1



 (cid:80)n

c(x ,y ) n ∈ [2 : N],m = 1

D(n,m) = k=1 k 1



(cid:80)m

k=1c(x 1,y k) n = 1,m ∈ [2 : M]



min{D(n−1,m−1),D(n,m−1),D(n−1,m)}+c(x ,y ) 1 < n ≤ N,1 < m ≤ M

n m

and c(x,y) is the cost function. For this assignment, since x,y ∈ RL, we use the Euclidean distance

between two vectors as the cost function.

In addition to DTW(X,Y), we are also interested in tracing the Optimal Warping Path (OWP) p∗ =

{p ,...,p }, where each element p for l ∈ [1 : L] is defined as:

1 L l



(1,1) l = 1





  (1,m−1) n = 1,m ∈ [2 : M]





p = (n−1,1) n ∈ [2 : N],m = 1 1 < l < L,p = (n,m)

l l+1



 argmin{D(n−1,m−1),D(n,m−1),D(n−1,m)} 1 < n ≤ N,1 < m ≤ M



(N,M) 1 = L

We have selected a few voice segments as a simple test for your code. You should be able to identify

the speaker that matches the reference speaker. You can find the right answer by observing the path we

used to load the data. Getting an intuitive answer with the default test should be a good verification for

your code.

Youcanplayaroundthevaluesinthemainfunction. Forexample,youcanmodifythereferencespeaker

to see what happens when you switch to a different speaker; you can try to edit the timestamps to navigate

to a different segment of the same speaker and see how that affects your result. You can also listen to the

raw audio of those selected segments. Write briefly about 1) what you did, 2) what you have observed for

each change you made and 3) why you think the result is the way it is, in a3 dtwDiscussion.txt. You

will be marked mainly on the effort rather than the results.

Submission requirements

This assignment is submitted electronically. Submit your assignment on MarkUs. You should submit:

1. All your code for a3 gmm.py, a3 model.py, a3 levenshtein.py and a3 dtw.py.

2. The output file a3 gmmLiks.txt.

3. Yourdiscussionfilesa3 gmmDiscussion.txt,a3 detectionDiscussion.txtanda3 dtwDiscussion.txt.

4. The ID.txt file available on the course website, with your details filled in.

Do not tar or compress your files, and do not place your files in subdirectories.

Using your own computer

If you want to do some or all of this assignment on your laptop or other computer, you will have to do

the extra work of downloading and installing the requisite software and data. You take on the risk that

your computer might not be adequate for the task. You are strongly advised to upload regular backups

of your work to teach.cs, so that if your machine fails or proves to be inadequate, you can immediately

continue working on the assignment at teach.cs. When you have completed the assignment, you should

try your programs out on teach.cs to make sure that they run correctly there. A submission that does

not work on teach.cs will get zero marks.

A Appendix: Details on CSC data set

Each utterance is represented by the following file types:

*.wav The original speech waveform sampled at 16kHz.

*.mfcc.py The Mel-frequency cepstral coefficients obtained from an analysis of the waveform,

in numPy format. Each row represents a 25ms frame of speech and consists of 13

floating point values.

*.txt Labelandorthographictranscriptionofeachutterance,foreachofKaldiandGoogle

ASR, and human gold-standard.

Participants were told that they were participating in a communication experiment which sought to

identify people who fit the profile of top entrepreneurs in America. To this end, participants performed

tasks and answered questions in six areas; they were later told that they had received low scores in some

of those areas and did not fit the profile. The subjects then participated in an interview where they were

told to convince the interviewer that they had actually achieved high scores in all areas and that they did

indeed fit the profile. The interviewer’s task was to determine how he thought the subjects had actually

performed, and he was allowed to ask them any questions other than those that were part of the subjects’

tasks. For each question from the interviewer, subjects were asked to indicate whether the reply was true

or contained any false information by pressing one of two pedals hidden from the interviewer under a table.

Interviews were conducted in a double-walled sound booth and recorded to digital audio tape on

two channels using Crown CM311A Differoid headworn close-talking microphones, then downsampled

to 16 kHz. Interviews were orthographically transcribed by hand using the NIST EARS transcription

guidelines. Labels for local lies were obtained automatically from the pedal-press data and hand-corrected

for alignment, and labels for global lies were annotated during transcription based on the known scores of

the subjects versus their reported scores.

MFCCs were obtained using the python speech features module using default parameters, i.e., 25

ms windows, 13 cepstral coefficients, and 512 fast Fourier transform coefficients

Each transcript file has the same format, where the ith line is:

[i] [LABEL] [TRANSCRIPT]

where i corresponds to i.wav and i.mfcc.npy, [LABEL] is the Global Lie label, and [TRANSCRIPT]

is the actual transcript orthography. Global Lie valence and the version of the pre-interview task for

the utterance appears before the colon (e.g., T/H) and the section name appears after the colon (e.g.,

INTERACTIVE).

Global Lie valence is indicated as: T == Truth; LU == Lie Up (subject claims better performance

than was actually achieved); and LD == Lie Down (subject claims worse performance). The task version

is indicated as: H == Hard; and E == Easy. So, for example, T/H:INTERACTIVE indicates that the

subject is telling the truth based on having performed the hard version of the Interactive task.

B Appendix: Training Gaussian mixture models

Input: MFCC data X, number of components M, threshold ϵ, and maxIter

begin

Initialize θ ;

i := 0 ;

prev L := −∞ ; improvement = ∞;

while i < maxIter and improvement >= ϵ do

ComputeIntermediateResults ;

L := ComputeLikelihood (X,θ) ;

θ := UpdateParameters (θ,X,L) ;

improvement := L−prev L ;

prev L := L ;

i := i+1 ;

end

Algorithm 1: GMM training algorithm.

For ComputeIntermediateResults, it is strongly recommended that you create two M × T numPy

arrays – one to store each value from Equation 1 and the other to store each value from Equation 2. In

fact, we’ve set up the function logLik to encourage you to do this, to avoid redundant computations. You

willusethesevaluesinbothComputeLikelihoodandupdateParameters, wherethelatterisaccomplished

thus:

(cid:80)T

p(m|⃗x ;θ)

ωˆ = t=1 t

(cid:80)T

p(m|⃗x ;θ)⃗x

µ⃗ˆ m = (cid:80)t= T1 p(m|⃗xt ;θ)t (6)

t=1 t

(cid:80)T p(m|⃗x ;θ)⃗x2

Σˆ = t=1 t t −µ⃗ˆ2

m (cid:80)T

p(m|⃗x ;θ)

t=1 t

In the third equation, the square of a vector on the right-hand side is defined as the component-wise square

of each dimension in the vector. Note that you don’t need to break up Algorithm 1 into separate functions

as implied – it is only written that way above to emphasize the sequence of steps