CMDA-3654 Fall 2019

Homework 9

Your name here

Due Nov 22nd as a .pdf upload

1

Instructions:

Delete the Instructions section from your write-up!!

I have given you this assignment as an .Rmd (R Markdown) file.

• Change the name of the file to: Lastname_Firstname_CMDA_3654_HW9.Rmd, and your output should therefore match

but with a .pdf extension.

• You need to edit the R Markdown file by filling in the chunks appropriately with your code. Output will be generated

automatically when you compile the document.

• You also need to add your own text before and after the chunks to explain what you are doing or to interpret the output.

• Feel free to add additional chunks if needed. I will not be providing assignments to you like this for the entire semester,

just long enough for you to learn how to do it for yourself.

Required: The final product that you turn in must be a .pdf file.

• You can Knit this document directly to a PDF if you have LaTeX installed (which is preferred).

• If you absolutely can’t get LaTeX installed and/or working, then you can compile to a .html first, by clicking on the

arrow button next to knit and selecting Knit to HTML.

• You must then print you .html file to a .pdf by using first opening it in a web browser and then printing to a .pdf

2

Problem 1: [30 pts] k-means clustering

Consider the Hotdog dataset shown in hotdogs.csv. Load in the data but ignore the first column as we’ll pretend we know

nothing other than the Sodium and Calorie content of the hotdogs.

a. Carry out K-means clustering with 2, 3, 4, 5 clusters. Don’t forget to scaled the data first using:

hotdogs.scaled <- scale(hotdogs[, 2:3])

b. Plot the clusters from part (a) using ggplot() and assigning different colors and plot characters to the clusters found

using kmeans().

Hint: If km.result <- kmeans(....), then the cluster assignments are in km.result$cluster, then simply add this to a

new data.frame:

hotdogs2.scaled <- cbind(hotdogs.scaled, "cluster" = as.factor(km.result$cluster))

and then use ggplot() accordingly to make the plot.

c. Install and eanble the following R libraries: cluster, NbClust, factoextra. Then use the fviz_cluster() function to

vizualize the clusters you made in part (a). Here is an example of how to use it.

fviz_cluster(km.result, data = hotdogs.scaled)

Determining the optimal number of clusters.

Recall that the basic idea behind partitioning methods, such as k-means clustering, is to define clusters such that the total

within-cluster variation or total within-cluster sum of squares is minimized: That is,

minimize

(

k∑

i=1

W (Ck)

)

There are a number of methods that we can use to determine the optimal number of clusters that should be used. One such

method is called the Elbow Method which plots the total within-cluster sum of squares versus number of clusters.

We can obtain the total within-cluster sum of squares using km.result$tot.withinss.

The elbow method suggest that you find the “elbow” of this plot, where the total within-cluster sum of squares essentially

stops reducing signficantly as the number of clusters grows.

Thankfully I can save you from this the long way by telling you about a function that can do this automatically.

d. Use the function fviz_nbclust(hotdogs.scaled, kmeans, method = "wss") to produce a plot using the Elbow

method. Using this plot, determine how many clusters you think should be used.

Discussion: There are other methods for determining the optimal number of clusters that are more sophisticated such as the

average silhouette method and the gap statistic method, but these are beyond the scope of this course.

3

Problem 2: [50 pts] Hierarchical Clustering.

Consider the mtcars dataset.

a. Using Euclidean distance as the dissimilarity measure, perform hierarchical clustering on the data, with (i) Complete

Linkage, (ii) Average Linkage, and (iii) Single Linkage. Don’t forget that you need to scale the data and compute the

distances before using the hclust() function.

b. For all three methods in (a), cut the hierarchical clustering tree at 4 clusters and report the two-way table of the car

name and the cluster it belongs to.

c. We can plot the dendrograms easily enough by doing the following:

plot(mtcars.hclust.complete, labels = rownames(mtcars),

main = "Cluster Dendrogram (Complete Linkage)")

Where the above would plot the dendrogram for the Complete Linkage case. Provide this plot and repeat the above for the

other 2 cases from part (a). Alternatively we can use the following library that makes use of ggplot2, called ggdendro.

# Vertical

ggdendrogram(mtcars.hclust.complete) + labs(title = "Cluster Dendrogram (Complete Linkage)")

# or horizontal with a different theme

ggdendrogram(mtcars.hclust.complete, rotate = T, theme_dendro = F) +

labs(title = "Cluster Dendrogram (Complete Linkage)")

Some alternative tools for plotting & customizing dendrograms can be found here: http://www.sthda.com/english/wiki/

beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning

Section 5 has some really cool examples of how to colorize clusters, etc.

Another awesome example can be found here: https://www.r-graph-gallery.com/340-custom-your-dendrogram-with-dendextend/

d. Use the elbow method to determine the optimal number of clusters that you should use. This works the same basic way

as in problem 1, but the call is slightly different because it needs to use the hcut() function (named without the ()) as

an option as seen below.

fviz_nbclust(mtcars.scaled, hcut, method = "wss")

e. Add colored rectangles around the clusters you have determined in part (d) to the dendrogram in part (c). You can

simply run the following line after the plot in (c) if you used the first method (other methods required other functions).

rect.hclust(mtcars.hclust,

k = 100, # replace with whatever you decided based upon (d)

border = border = c("red","blue","purple","magenta")

)

f. Use cutree() to obtain the cluster assignments using your decision in (d). Recode this as a factor object. Plot the mpg

versus wt and use the different colors according to your cluster assignment.

Discussion: While you can see some clustering that makes sense, some of the clusters seem to intermixed with each other.

Remember, the clustering was determined not using only the mpg and wt variables alone, but using all of the information from

all of the variables in the dataset. So the clusters are formed in multidimensional space. This can be difficult to visualize

when the number of dimensions is bigger than 3.

One solution is to rotate the multidimensional variable coordinate system into the principal component coordinate system as

we will show in the next problem. This means that if most of the variation is in the first 2 principal components, it might be

possible to more easily see the clusters in higher dimensional space using the lower dimensional representation.

4

Problem 3: [20 pts] PCA + Hierarchical Clustering.

Consider the mtcars dataset once again.

a. Use PCA to rotate the observations from the mtcars dataset into a new coordinate system using the principal components.

Remember that you have to do either scale. = TRUE if using prcomp() or cor = TRUE if using princomp(). We

want the component scores which will either be pca.result$x if you used prcomp() or pca.result$scores if you use

princomp().

b. For the principal component “variables” for the cars (which are the component scores), use the hierarchical clustering

techniques that you learned in Problem 2. Use this to determine the optimal number of clusters, show the dendrogram

and put a box around these clusters. Use Complete Linkage only. How does this compare with your answer result

from the previous problem when you used complete linkage on the original variables?

c. Using k = 4 for the number of clusters, plot PC2 versus PC1 and use different colors to show the clusters. You will need

to use cutree() to get the cluster assignments. Can you see the clusters a bit better than you did compared to plotting

mpg versus wt in the previous problem?

5