CMDA-3654 Fall 2019
Homework 9
Due Nov 22nd as a .pdf upload
1
Instructions:
Delete the Instructions section from your write-up!!
I have given you this assignment as an .Rmd (R Markdown) file.
• Change the name of the file to: Lastname_Firstname_CMDA_3654_HW9.Rmd, and your output should therefore match
but with a .pdf extension.
• You need to edit the R Markdown file by filling in the chunks appropriately with your code. Output will be generated
automatically when you compile the document.
• You also need to add your own text before and after the chunks to explain what you are doing or to interpret the output.
• Feel free to add additional chunks if needed. I will not be providing assignments to you like this for the entire semester,
just long enough for you to learn how to do it for yourself.
Required: The final product that you turn in must be a .pdf file.
• You can Knit this document directly to a PDF if you have LaTeX installed (which is preferred).
• If you absolutely can’t get LaTeX installed and/or working, then you can compile to a .html first, by clicking on the
arrow button next to knit and selecting Knit to HTML.
• You must then print you .html file to a .pdf by using first opening it in a web browser and then printing to a .pdf
2
Problem 1: [30 pts] k-means clustering
Consider the Hotdog dataset shown in hotdogs.csv. Load in the data but ignore the first column as we’ll pretend we know
nothing other than the Sodium and Calorie content of the hotdogs.
a. Carry out K-means clustering with 2, 3, 4, 5 clusters. Don’t forget to scaled the data first using:
hotdogs.scaled <- scale(hotdogs[, 2:3])
b. Plot the clusters from part (a) using ggplot() and assigning different colors and plot characters to the clusters found
using kmeans().
Hint: If km.result <- kmeans(....), then the cluster assignments are in km.result\$cluster, then simply add this to a
new data.frame:
hotdogs2.scaled <- cbind(hotdogs.scaled, "cluster" = as.factor(km.result\$cluster))
and then use ggplot() accordingly to make the plot.
c. Install and eanble the following R libraries: cluster, NbClust, factoextra. Then use the fviz_cluster() function to
vizualize the clusters you made in part (a). Here is an example of how to use it.
fviz_cluster(km.result, data = hotdogs.scaled)
Determining the optimal number of clusters.
Recall that the basic idea behind partitioning methods, such as k-means clustering, is to define clusters such that the total
within-cluster variation or total within-cluster sum of squares is minimized: That is,
minimize
(
k∑
i=1
W (Ck)
)
There are a number of methods that we can use to determine the optimal number of clusters that should be used. One such
method is called the Elbow Method which plots the total within-cluster sum of squares versus number of clusters.
We can obtain the total within-cluster sum of squares using km.result\$tot.withinss.
The elbow method suggest that you find the “elbow” of this plot, where the total within-cluster sum of squares essentially
stops reducing signficantly as the number of clusters grows.
Thankfully I can save you from this the long way by telling you about a function that can do this automatically.
d. Use the function fviz_nbclust(hotdogs.scaled, kmeans, method = "wss") to produce a plot using the Elbow
method. Using this plot, determine how many clusters you think should be used.
Discussion: There are other methods for determining the optimal number of clusters that are more sophisticated such as the
average silhouette method and the gap statistic method, but these are beyond the scope of this course.
3
Problem 2: [50 pts] Hierarchical Clustering.
Consider the mtcars dataset.
a. Using Euclidean distance as the dissimilarity measure, perform hierarchical clustering on the data, with (i) Complete
Linkage, (ii) Average Linkage, and (iii) Single Linkage. Don’t forget that you need to scale the data and compute the
distances before using the hclust() function.
b. For all three methods in (a), cut the hierarchical clustering tree at 4 clusters and report the two-way table of the car
name and the cluster it belongs to.
c. We can plot the dendrograms easily enough by doing the following:
plot(mtcars.hclust.complete, labels = rownames(mtcars),
main = "Cluster Dendrogram (Complete Linkage)")
Where the above would plot the dendrogram for the Complete Linkage case. Provide this plot and repeat the above for the
other 2 cases from part (a). Alternatively we can use the following library that makes use of ggplot2, called ggdendro.
# Vertical
ggdendrogram(mtcars.hclust.complete) + labs(title = "Cluster Dendrogram (Complete Linkage)")
# or horizontal with a different theme
ggdendrogram(mtcars.hclust.complete, rotate = T, theme_dendro = F) +
labs(title = "Cluster Dendrogram (Complete Linkage)")
Some alternative tools for plotting & customizing dendrograms can be found here: http://www.sthda.com/english/wiki/
beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning
Section 5 has some really cool examples of how to colorize clusters, etc.
Another awesome example can be found here: https://www.r-graph-gallery.com/340-custom-your-dendrogram-with-dendextend/
d. Use the elbow method to determine the optimal number of clusters that you should use. This works the same basic way
as in problem 1, but the call is slightly different because it needs to use the hcut() function (named without the ()) as
an option as seen below.
fviz_nbclust(mtcars.scaled, hcut, method = "wss")
e. Add colored rectangles around the clusters you have determined in part (d) to the dendrogram in part (c). You can
simply run the following line after the plot in (c) if you used the first method (other methods required other functions).
rect.hclust(mtcars.hclust,
k = 100, # replace with whatever you decided based upon (d)
border = border = c("red","blue","purple","magenta")
)
f. Use cutree() to obtain the cluster assignments using your decision in (d). Recode this as a factor object. Plot the mpg
versus wt and use the different colors according to your cluster assignment.
Discussion: While you can see some clustering that makes sense, some of the clusters seem to intermixed with each other.
Remember, the clustering was determined not using only the mpg and wt variables alone, but using all of the information from
all of the variables in the dataset. So the clusters are formed in multidimensional space. This can be difficult to visualize
when the number of dimensions is bigger than 3.
One solution is to rotate the multidimensional variable coordinate system into the principal component coordinate system as
we will show in the next problem. This means that if most of the variation is in the first 2 principal components, it might be
possible to more easily see the clusters in higher dimensional space using the lower dimensional representation.
4
Problem 3: [20 pts] PCA + Hierarchical Clustering.
Consider the mtcars dataset once again.
a. Use PCA to rotate the observations from the mtcars dataset into a new coordinate system using the principal components.
Remember that you have to do either scale. = TRUE if using prcomp() or cor = TRUE if using princomp(). We
want the component scores which will either be pca.result\$x if you used prcomp() or pca.result\$scores if you use
princomp().
b. For the principal component “variables” for the cars (which are the component scores), use the hierarchical clustering
techniques that you learned in Problem 2. Use this to determine the optimal number of clusters, show the dendrogram
and put a box around these clusters. Use Complete Linkage only. How does this compare with your answer result
from the previous problem when you used complete linkage on the original variables?
c. Using k = 4 for the number of clusters, plot PC2 versus PC1 and use different colors to show the clusters. You will need
to use cutree() to get the cluster assignments. Can you see the clusters a bit better than you did compared to plotting
mpg versus wt in the previous problem?
5  Email:51zuoyejun

@gmail.com