辅导案例-CSCC11

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

CSCC11: HW2 Due by 11:59pm Sunday, November 10, 2019
University of Toronto Scarborough
October 27, 2019
Please submit separate files for a) write-up (named write up hw2.pdf) in PDF (You can convert word
doc to PDF or preferably use LaTeX (https://www.overleaf.com/) ), b) Python files and c) figures (if you
choose to include them separately from the write-up). All files to be submitted on MarkUs. Do not turn
in a hard copy of the write-up.
1. k-means clustering
Given the data in the file “customer.csv” implement k-means clustering algorithm (your own imple-
mentation for k-means). The data has features (Gender, Age, Annual income) and continuous label
(Spending Score). Remember that clustering is an unsupervised learning algorithm!
(a) Implement K-means (for k = 2, 3, 4, 5 )
(b) Implement K-means++ (for k = 2, 3, 4, 5 )
(c) Plot the data clusters (use different colors for each cluster).
• Input: numpy.ndarray
• Output: Cluster plots
• Libraries: Use any libraries and functions except built-in function for kmeans.
• Functions:
– my kmeans(features,k) – returns the clusters
– my kmeans plot(clusters) – plots the clusters
• Python file name: kmeans hw2.py
2. ROC, AUC and confusion matrix (no programming required for this question)
(a) What is an ROC curve and an AUC? Define and explain with examples. List python function(s)
that can plot/calculate these.
(b) What is a confusion matrix? Define and explain with examples. List python function(s) for it.
(c) Balanced and imbalanced dataset. Define and explain with examples. List python function(s)
that can be used to solve the imbalanced dataset issue.
1
3. Random forest
Implement random forest using any Python built-in functions for the Titanic survivor data
(train and test set are provided in csv files).
• Input: Train and test sets in numpy.ndarray
• Output: Confusion matrix
• Libraries: Any library, any function
• Functions: No restriction
• Python file name: random forest hw2.py
4. PCA
Implement PCA (principal component analysis) on the Iris dataset and reduce it to 3 features and 2
features.
Load Iris dataset (all instances) from python using the commands
from sklearn import datasets
iris = datasets.load iris()
X = iris.data
• Input: numpy.ndarray (150 by 4)
• Output: Two reduced dimension matrices (150 by 3) and (150 by 2) and data plot
• Libraries: Use any libraries and functions except built-in function for PCA.
• Functions:
– my pca(data matrix, k) – this returns and prints low dim matrix
– my pca plot(low dim matrix) – plots the low dim data
• Python file name: pca hw2.py
2