Opening CSCI 4022 Fall 2021 More Clustering: GMMs and K-means Example: Consider the data set (0, 0), (1, 2), (2, 1), (4, 1), (5, 0), (5, 3). Use hierarchical clustering and Euclidean distance to group the data into 2 clusters. If there are ties in distance, merge first the data points with lower x-coordinates. Mullen: k-means and Mixtures Fall 2021 1 / 42 Opening Announcements and To-Dos Announcements: 1. HW 2 up! Due Wednesday instead of Monday since I failed to post it by last Monday... 1. Minute forms: Similarity without a whole data set (short ans: no but maybe)? 2. Too much given code (I’ll make you really hash later, I promise) Mullen: k-means and Mixtures Fall 2021 2 / 42 Opening Clustering Recap Clustering Applications: Data could be... - Different characteristics of songs: Goal: cluster together similar songs into genres - Vehicle weights, milages, other characteristics: Goal: cluster together similar vehicles into classes (SUV, sedan, hybrid...) - Sky object radiation intensities into frequency ranges Goal: cluster together into groups of similar objects. - Words in a document Goal: cluster together into groups of similar topics. Clustering Issues - Clustering looks easy in two dimensions - Clustering small amounts of data looks easy - in most cases, looks are not deceiving... - but many applications have not 2, but 10 or 10,000 dimensions. What does that even look like? - High-dimensional spaces look different: almost all pairs of points are at about the same distance! Mullen: k-means and Mixtures Fall 2021 3 / 42 Opening Hierarchical Clustering Method Agglomerative approach: hierarchical clustering. 1. Each point starts as its own cluster 2. Do: Combine the two nearest clusters into one larger cluster. 3. Stop when: a stopping condition is met... Common stopping conditions: fixed # of final clusters, or perhaps a goal of ”mean-distance to cluster center” sufficiently small. Cost - At each step, we compute all distances between pairs of clusters, then merge. - With N data points, this is O(N2) comparisons to make! ( ( N 2 ) ). - IF we want k clusters at the end, and k << N , then we need to iterate about N times. - This means O(N3) complexity... we hate that. With some cleverness, we can get this down to O(N2 logN)... that’s still rough for very large data sets. Mullen: k-means and Mixtures Fall 2021 4 / 42 Opening Hierarchical Implementation Implementation Notes At each step, we compute all distances between pairs of clusters. Then merge the nearest cluster. With N data points, this is N2 comparisons to make! (or( N 2 ) , at least). IF we want k clusters at the end, and k << N , then we need to iterate about N times to merge down to k clusters. This means O(N3) complexity. With some cleverness, we can get this down to O(N2 logN). ... that’s still rough for very large data sets. Mullen: k-means and Mixtures Fall 2021 5 / 42 Opening Hierarchical Example Example: Consider the data set (0, 0), (1, 2), (2, 1), (4, 1), (5, 0), (5, 3). Use hierarchical clustering and Euclidean distance to group the data into 2 clusters. If there are ties in distance, merge first the data points with lower x-coordinates. Mullen: k-means and Mixtures Fall 2021 6 / 42 Opening Hierarchical Example Example: Consider the data set (0, 0), (1, 2), (2, 1), (4, 1), (5, 0), (5, 3). Use hierarchical clustering and Euclidean distance to group the data into 2 clusters. If there are ties in distance, merge first the data points with lower x-coordinates. Solution: We might initially construct a full distance matrix! (0, 0) (1, 2) (2, 1) (4, 1) (5, 0) (5, 3) (0, 0) √ 5 √ 5 √ 17 5 √ 34 (1, 2) √ 2 √ 10 √ 20 √ 17 (2, 1) 2 √ 10 √ 13 (4, 1) √ 2 √ 5 (5, 0) 3 (5, 3) There are a couple of √ 2 in there, so we pick the tiebreaker of the lowest x-values, which are (1, 2) and (2, 1). Mullen: k-means and Mixtures Fall 2021 6 / 42 Opening Hierarchical Example Example: Consider the data set (0, 0), (1, 2), (2, 1), (4, 1), (5, 0), (5, 3). Use hierarchical clustering and Euclidean distance to group the data into 2 clusters. If there are ties in distance, merge first the data points with lower x-coordinates. Solution: Now we’d have a cluster inside our distance matrix. Points (1, 2) and (2, 1) get folded into a cluster with centroid at (3/2, 3/2). We could ostensibly recreate the matrix, but now in a 5x5 instead of 6x6 format. For this smaller problem, let’s proceed visually, instead. Mullen: k-means and Mixtures Fall 2021 6 / 42 Opening Hierarchical Example Example: Consider the data set (0, 0), (1, 2), (2, 1), (4, 1), (5, 0), (5, 3). Use hierarchical clustering and Euclidean distance to group the data into 2 clusters. If there are ties in distance, merge first the data points with lower x-coordinates. Solution: full combine order 1. Combine (1, 2) and (2, 1) into group red 2. Combine (4, 1) and (5, 0) into group blue 3. Fold (0, 0) into red group. 4. Fold (5, 3) into blue group. 5. STOP: we’re down to 2 groups, since every points is either red or blue. Mullen: k-means and Mixtures Fall 2021 6 / 42 kmeans k-means We can save considerable amounts of time by instead using the k-means algorithm. Setup for k-means: 1. Requires specification of a norm or distance measure: typically an L-norm or Euclidean distance. 2. Fix a value for k, the total number of clusters. 3. Initialize the clusters in some fashion, typically with only one point per cluster. Some options: 3.1 Random plan: pick k points totally at random to each be in different clusters 3.2 Hierarchical plan: do hierarchical clustering to get to k clusters from a (small) subset of the data, then randomly select one point from each cluster 3.3 Pick a first point randomly, then subsequently pick subsequent points to be as far as possible from each of the previous points. (i.e. append point with maximal minimum distance to the set of chosen points) Mullen: k-means and Mixtures Fall 2021 7 / 42 kmeans k-means Iteration Scheme for k-means: 1. For each point 1.1 Assign that point to the cluster whose centroid it is closest to. 2. Then, For each cluster 2.1 Update the centroid of that cluster to reflect any added or lost points. 3. Repeat until convergence. 3.1 Points may stop moving at all between clusters... 3.2 or centroids will stabilize and not move (much) Mullen: k-means and Mixtures Fall 2021 8 / 42 kmeans k-means Example Suppose we have some 2-D data that mostly lies along a couple of lines. Suppose that we “randomly” choose the 2nd and 5th points to initialize our clusters. Mullen: k-means and Mixtures Fall 2021 9 / 42 kmeans k-means Example Step 1: assign points to clusters Mullen: k-means and Mixtures Fall 2021 10 / 42 kmeans k-means Example Step 2: update cluster centroids Mullen: k-means and Mixtures Fall 2021 11 / 42 kmeans k-means Example Step 1: assign points to clusters Mullen: k-means and Mixtures Fall 2021 12 / 42 kmeans k-means Example Step 2: update cluster centroids Mullen: k-means and Mixtures Fall 2021 13 / 42 kmeans k-means Example Step 1: assign points to clusters Mullen: k-means and Mixtures Fall 2021 14 / 42 kmeans k-means Example Step 2: update cluster centroids Mullen: k-means and Mixtures Fall 2021 15 / 42 kmeans k-means Example Step 1: assign points to clusters Mullen: k-means and Mixtures Fall 2021 16 / 42 kmeans k-means Example Step 2: update cluster centroids Mullen: k-means and Mixtures Fall 2021 17 / 42 kmeans k-means Example Step 3: BREAK Nothing has changed! Whether our convergence check was during step 1 or step 2, we’ll break out of the loop at this stage. Mullen: k-means and Mixtures Fall 2021 18 / 42 kmeans k-means and k In this example, we choose k = 2 without putting any thought into it. But how do choose the correct value of k? 1. Sometimes an educated guess will work if you can visualize the data (at most 3 features/columns). 2. In higher dimensions, try a few different k and look at measures of how grouped up points within each cluster are. - Common measure for this: look at the average distance between each data point and its cluster’s centroid. - This average distance will always decrease as k increases, but it will start changing very little after the “right” k. Mullen: k-means and Mixtures Fall 2021 19 / 42 kmeans Choosing k Approach: Try a few different k and look at the change in the average distance between each data point and its cluster’s centroid. The average distance should decrease as k increases to about the right k, then change very little. Example: What do you think? k = 2? k = 3?, 4? 8? Sanity check says: try k = 3 The elbow plot for k-means Mullen: k-means and Mixtures Fall 2021 20 / 42 kmeans Choosing k Approach: Try a few different k and look at the change in the average distance between each data point and its cluster’s centroid. The average distance should decrease as k increases to about the right k, then change very little. Example: What do you think? k = 2? k = 3?, 4? 8? Sanity check says: try k = 3 The elbow plot for k-means Mullen: k-means and Mixtures Fall 2021 20 / 42 kmeans Choosing k Example: The elbow plot suggests that k = 3 was reasonable, as the curve levels off significantly there. Sometimes this curve is smoother and that’s ok: pick a reasonable value or come up with a statistic for “best” k that penalizes extra terms. Results for k = 3 Results for k = 4 Visually, notice that not much benefit from the k = 3→ k = 4 transition. In practice, it mostly split the lower right grouping into 2! Since the other two clusters were unchanged, their mean-distance-to-centroid contributions didn’t change either. Mullen: k-means and Mixtures Fall 2021 21 / 42 kmeans k-means and directions k-means is a circular construction of clusters. - Each cluster is uniquely defined by its center - Points are assigned to clusters based on distance (or radius from center), without considering which direction - This can be very dangerous. If there are linear trends in the data, points that are highly related can “look” far apart. Units might also matter! Mullen: k-means and Mixtures Fall 2021 22 / 42 kmeans k-means and units Units might also matter! Consider the data set with x1: vehicle weight; x2: vehicle mileage. These are on drastically different scales (lbs vs mpg?) This could mean traditional distances fail entirely: what’s the distance between (1200lbs, 20mpg), (1400lbs, 19mpg), (1100lbs, 100mpg)? - Option A: normalize the data: replace each column with the original column, but subtract the mean then divide by the standard deviation. - Option B: allow distances to somehow depend on direction. We can let direction matter by fitting ellipses instead of circles! Our favorite ellipse is the Normal Distribution Mullen: k-means and Mixtures Fall 2021 23 / 42 GMMs The Gaussian (normal) distribution You should recall the normal distributionm Gaussian distribution, or bell curve. In one dimension, it’s a beautiful little function. (squared negative exponential). The normal distribution has two arguments or parameters: µ: The mean or center of the curve. The most important location. A single scalar. σ2: The variance/width of the curve. The most common measure of dispersion or spread. A single positive scalar. A normal Mullen: k-means and Mixtures Fall 2021 24 / 42 GMMs The Gaussian (normal) distribution Formalism pdf: The probability density function of the normal distribution is f(x;µ, σ) = 1√ 2piσ2 e −(x−µ)2 2σ2 cdf: The integral of the pdf lets us answer probability questions like P (a < X < b) = ∫ b a f(x) dx - The normal is known for assigning low probabilities to outliers, or points far from µ A normal Mullen: k-means and Mixtures Fall 2021 24 / 42 GMMs The 2-D Gaussian In multiple dimensions, everything gets a bit more complex. parameters: µ: The mean or center of the surface. This is a 2-D (x, y) tuple, so we may write µ = (µ1, µ2) ∈ R2 Σ: A 2× 2 covariance matrix whose entries are [ σ21 cov(x1, x2) cov(x2, x1) σ 2 2 ] where σ21 is the variance in the first axis direction, σ22 is the variance in the second axis direction, and cov(x1, x2) = E [(X1 − E[X1]) (X2 − E[X2])] A 2-D normal’s contour plot Mullen: k-means and Mixtures Fall 2021 25 / 42 GMMs The 2-D Gaussian This is actually 5 unique parameters. parameters: µ1: The location of the center in the axis-1 dimension. µ2: The location of the center in the axis-2 dimension. σ21: The variance in the axis-1 direction. σ22: The variance in the axis-2 direction. cov: cov(x1, x2)The covariance of the data set (dimension one covariance with dimension 2). Usage: np.cov(x1, y1) A 2-D normal’s contour plot Mullen: k-means and Mixtures Fall 2021 26 / 42 GMMs The 2-D Gaussian: variance and covariance 1. The off-diagonal arguments of Σ lead to rotations. Results: if cov(x1, x2) = 0, then the Gaussian is an ellipse that’s oriented vertically/horizontally (i.e. the semi-major and semi-minor axis are parallel to the coordinate axes.). 2. σ21 and σ 2 2 determine the widths in their respective coordinate directions Results: if cov(x1, x2) = 0 and σ1 = σ2, the contours would be circles... like k-means. Allowing σ1 and σ2 to vary stretches the circle in the corresponding direction. This gives a new method of clustering: Gaussian Mixture Models [ σ21 cov(x1, x2) cov(x2, x1) σ 2 2 ] Mullen: k-means and Mixtures Fall 2021 27 / 42 GMMs The Gaussian Mixture Model (GMM): 1D Example Motivating example/cautionary tale: Suppose you go to Chuck E. Cheese’s for your niece’s birthday party. As you look around, you start to feel rather self-conscious because there seem to be very few people around your age. Needing to pass the time, because you feel so, so awkward, you collect data on everyone’s ages. Mullen: k-means and Mixtures Fall 2021 28 / 42 GMMs The Gaussian Mixture Model (GMM): 1D Example Motivating example/cautionary tale: Suppose you go to Chuck E. Cheese’s for your niece’s birthday party. As you look around, you start to feel rather self-conscious because there seem to be very few people around your age. Needing to pass the time, because you feel so, so awkward, you collect data on everyone’s ages. You are then kicked out of and permanently banned from Chuck E. Cheese’s because you were accosting children and asking about their ages... ... Fair enough. Now that your afternoon is freed up, you plot up a histogram of the age data you so creepily and painstakingly collected. It looks like this: Mullen: k-means and Mixtures Fall 2021 29 / 42 GMMs The Gaussian Mixture Model (GMM): 1D Example The task: How can we best model this distribution? It’s clearly bimodal, so a simple normal Gaussian is not appropriate. Instead, we view it as possibly the combination of two distributions: 1. The kids’ ages seem like they might be reasonably Gaussian, aside from the pesky fact of non-negativity on ages. 2. The parents’ ages might also be modeled by a different Gaussian distribution 3. This would make the overall distribution of patrons’ ages a Gaussian mixture model Mullen: k-means and Mixtures Fall 2021 30 / 42 GMMs The Gaussian Mixture Model (GMM): 1D Example The whole model: Kids’ ages: a normal X1 ∼ N(µ1, σ21) Parents’ ages: a normal X2 ∼ N(µ2, σ22) Who’s who: a patron could be either of these at some mixing proportion. Define the Bernoulli random variable ∆ where the probability that a patron is an adult (∆ = 1) by pi. Then: All patrons’ ages: X = (1−∆) ·X1 + ∆ ·X2 Mullen: k-means and Mixtures Fall 2021 31 / 42 GMMs The Gaussian Mixture Model (GMM): 1D Example The whole model: X = (1−∆) ·X1 + ∆ ·X2 Let’s unpack, since this is secretly just adding up all the possible ways we can observe a specific age: X︸︷︷︸ Prob of specific age = Prob person is adult︷ ︸︸ ︷ (1−∆) · X1︸︷︷︸ Prob an “adult” is that age + Prob person is child︷︸︸︷ ∆ · X2︸︷︷︸ Prob a “child” is that age Mullen: k-means and Mixtures Fall 2021 32 / 42 GMMs The Gaussian Mixture Model (GMM): 1D Example The whole model: X = (1−∆) ·X1 + ∆ ·X2 Definition: The GMM is a generative model, since it specifies the probabilities for new data points. 1. Sample or simulate a ∆ with a coin flip or np.random.choice 2. Based on ∆, sample a random normal from: 2.1 X1 as a N(µ1, σ 2 1) if ∆ = 0 OR 2.2 X2 as a N(µ2, σ 2 2) if ∆ = 1 Our task is sometimes to generate, but first we have to estimate the underlying parameters used in the model. To use the model, we have 5 things to estimate or choose. Θ = ( pi, µ1, σ 2 1, µ2, σ 2 2 ) Mullen: k-means and Mixtures Fall 2021 33 / 42 GMMs The Gaussian Mixture Model (GMM): 1D Example The whole model: X = (1−∆) ·X1 + ∆ ·X2 We need to estimate: Θ = ( pi, µ1, σ 2 1, µ2, σ 2 2 ) Assuming we actually know all 5 parameters, we can simply write down the full probability density function for our process. If we denote φ(x|µ, σ2) as the normal with mean µ and variance σ2, the model is now f(x|Θ) = (1− pi)φ(x|µ1, σ21) + piφ(x|µ2, σ22) Mullen: k-means and Mixtures Fall 2021 34 / 42 GMMs GMM Example: Varying Theta Here are some pdfs, depending on different choices of the parameter set Θ. Mullen: k-means and Mixtures Fall 2021 35 / 42 GMMs GMM Example: Varying Theta Here are some pdfs, depending on different choices of the parameter set Θ. Mullen: k-means and Mixtures Fall 2021 36 / 42 GMMs GMM Example: Varying Theta Here are some pdfs, depending on different choices of the parameter set Θ. Mullen: k-means and Mixtures Fall 2021 37 / 42 GMMs GMM Example: Varying Theta Here are some pdfs, depending on different choices of the parameter set Θ. This is too many adults: let’s go back to pi = .5... Mullen: k-means and Mixtures Fall 2021 38 / 42 GMMs GMM Example: Varying Theta Here are some pdfs, depending on different choices of the parameter set Θ. Changing σ1 Mullen: k-means and Mixtures Fall 2021 39 / 42 GMMs GMM Example: Varying Theta Here are some pdfs, depending on different choices of the parameter set Θ. Changing σ2 Mullen: k-means and Mixtures Fall 2021 40 / 42 GMMs GMM Example: Underview The whole model: X = (1−∆) ·X1 + ∆ ·X2 We need to estimate: Θ = ( pi, µ1, σ 2 1, µ2, σ 2 2 ) 1. It would be tedious and hand-wavy to manually try to estimate those parameters, even in one dimension. 2. Note that in 2D, each variance was a 2× 2 covariance matrix. This problem grows in size quickly : 2 Gaussians in 2D is now 9 unknowns (4 per Gaussians plus a mixture probability). Next time: some theory on how to estimate the parameters! Mullen: k-means and Mixtures Fall 2021 41 / 42 GMMs Acknowledgments Some material is adapted/adopted from Mining of Massive Data Sets, by Jure Leskovec, Anand Rajaraman, Jeff Ullman (Stanford University) http://www.mmds.org Special thanks to Tony Wong for sharing his original adaptation and adoption of slide material. Mullen: k-means and Mixtures Fall 2021 42 / 42
欢迎咨询51作业君