Dynamic Programming
in Speech
CSC401/2511 2024 A3 Tutorial 02. Presenter: Ken Shi
1
Agenda
A few remarks for GMM from Piazza
-
Dynamic Programming
-
Word Error Rate with Levenshtein Distance
-
Dynamic Time Warping in Speaker Verification
-
2
GMM: Practical Tips cont’d
How do we get the in from ?
A good module that does this:
scipy.special.logsumexp
-
3
GMM: Takeaways!!
Probability of observing x in the mth Gaussian:
-
t
Prior probability of the mth Gaussian:
-
Probability of the mth Gaussian, given x :
-
t
Probability of x in the GMM:
-
t
4
Dynamic Programming (DP) Recap
Applicable problem: one that can be recursively defined into sub-problem in a way
-
that satisfies Bellman Equation.
In human language: Problems that can be divided into smaller problem of the same
-
type via induction.
E.g. : Fibonacci Sequence: A[n] = A[n - 1] + A[n - 2]
-
- Use Recursion: A[n] = fib(n - 1) + fib(n - 2) —>O(n2)
- Use DP: A[n] = A[n - 1] + A[n - 2] —>O(n)
5
2D DP
Maintain a 2D table
-
Find a formula for each cell of the table. This formula can depend on itself of an earlier
-
step
Initialize state independent cells (Those that do not depend on others)
-
Iterate through the table and complete the rest
-
Take the desired value (usually the final step of the iteration)
-
6
Word Error Rate
For ASR Evaluation
7
Levenshtein Distance
In its origin:
8
Levenshtein in Word Error Rate (WER)
Two strings: Reference Sentences (Groundtruth) and Hypothesis Sentences (Transcript)
-
Four states describing a pair of words, one from each:
-
- Match: Two words are the same
- Insertion: A new hypothesis word added to the sequence
- Deletion: An existing reference word missing
- Substitution: A complete replacement of the word
WER formula:
-
9
WER Example
Consider this sentence pair:
Reference: how to recognize speech
-
Hypothesis: how to wreck a nice beach
-
How To Wreck A Nice Beach
0 1 2 3 4 5 6
How 1 0 1 2 3 4 5
To 2 1 0 1 2 3 4
Recognize 3 2 1 1 2 3 4
Speech 4 3 2 2 2 3 4
10
WER Example
Substitution Deletion
Consider this sentence pair: / Match
Reference: how to recognize speech
-
Insertion Destination
Hypothesis: how to wreck a nice beach
-
How To Wreck A Nice Beach
S = 2
0 1 2 3 4 5 6 I = 2
D = 0
How 1 0 1 2 3 4 5 WER = (2 + 2 + 0) / 4
= 100%!
To 2 1 0 1 2 3 4
Recognize 3 2 1 1 2 3 4
Speech 4 3 2 2 2 3 4
11
Recall: A Monotonic Forward Algorithm
Remember this from the lecture?
12
Your Job
Implement initialize, step and finalize
-
- Initialize: defines the table, initializes the helper row/column
- Step: forward calculate values of the table
- Finalize: backtrace which operation has been done through the iteration and
computes WER
13
Practical Tip: Priority Operations
What happen when there is a tie?
-
Convention: Match > Substitution > Insertion > Deletion
-
It is really a choice, but for the purpose of the assignment, we want you to follow this
-
particular convention.
14
Your Task
WER is used as a measure to evaluate different ASR system (in this case, Kaldi and
-
Google)
You shouldn’t have to modify the main function to see how it is used in action
-
(provided that your implementation is correct)
Please follow the instruction in the assignment sheet carefully, as our test may look
-
into every component of your function to evaluate your operations
Interesting thing to think about: Does the cost has to be one across the board? What if
-
they don’t?
15
Dynamic Time
Warping
In Speaker Verification
16
Task: Speaker Verification
Given: Audio samples of somebody saying a specific line/word
-
Goal: Identify if the samples are coming from the same speaker.
-
Approach: Dynamic Time Warping!
-
17
Dynamic Time Warping (DTW)
Given: two temporal sequence X and Y
-
Goal: find the optimal warping path that minimizes the cost that the path take
-
Approach: Use DP to maintain an Accumulative Cost Matrix
-
Three Conditions (Constraints):
-
- Boundary condition: start by pairing the first elements from both, end by pairing
the last elements from both.
- Monotonicity condition: every step must push forward
- Step size condition: increment in forward should be bounded by certain range
18
DTW in Speaker Verification
Follows the exact recipe like WER, but instead of dealing with strings, we deal with
-
MFCCs.
The above should automatically maintain Boundary condition, Monotonicity condition
-
The above should also make the step size 1 (upper, left or upper left)
-
In terms of cost, we will use the euclidean distance between each MFCC frame
-
For the specific usage, see example from Lecture Slides!
-
19
A few tips
The main function is written for you. It chops a few voice segments from the dataset
-
and runs the algorithm. You should be able to tell what is the correct answer just by
looking at how the segments are loaded.
The default test given to you should give you an intuitive result if it does work.
-
You are welcomed to investigate further by playing around with different segments
-
you find or listen to segments of the specified time stamp
Obviously this is not a perfect algorithm. Based on lecture material, can you think of
-
why it could sometimes underperform? Does it have to do with how this task is set up?
20