辅导案例-DS300-001

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

DS300-001 Homework 3
Due on Nov 10, 2020
Overview – In this homework, you are required to implement a differentially private k-means
clustering algorithm. Most of the technical details can be found in Lectures 11 and 12. The
submission includes one .txt file (with the answer to Q1, Q2, and Q4), one completed .py file (with
the answer to Q3), and three .png files (with the clustering results in Q4) in a zipped folder.
Q1 – Consider a collection of 2-dimensional datapoints. Each datapoint x resides in the domain
of [−1, 1]× [−1, 1] (i.e., both the first and second dimensions of x lie within the interval of [−1, 1]).
If we apply k-means clustering over the dataset, what is the sensitivity of (i) computing the size
of each cluster and (ii) computing the sum of the datapoints in each cluster (note: the overall
sensitivity is the summation of the sensitivity along each dimension)?
Q2 – We now run the clustering algorithm for T iterations and apply the Laplace mechanism over
the process to achieve -differential privacy. What is the privacy budget for each iteration if we
allocate the budget equally among them? Assuming during each iteration, we allocate the same
amount of privacy budget to (i) computing the cluster size and (ii) computing the sum of datapoints
(note: we add independent noise to each dimension). What is the setting of λ for (i) and (ii)?
Q3 –Note: make sure you have a Python environment installed on your computer. If not, Anaconda
(https://www.anaconda.com) is strongly recommended as it includes all the packages necessary
for this homework.
In this question, you are required to implement differentially private k-means using the setting
of Q1 and Q2. You can run the script using the following command in a terminal:
python ./dp_kmeans.py -d data.npy -k 5 -t 10 -e 0.1
with the argument specification as follows:
• -d: the dataset to be clustered (default is “data.npy”);
• -k: the number of clusters (default is “5”);
• -t: the number of iterations (default is “10”);
• -e: the setting of (default is “None”).
You are required to complete the missing lines according to the setting of Q1 and Q2.
Q4 – Run the code (i) without providing the argument -e (i.e., the baseline k-means), (ii) -e 0.1,
and (iii) -e 1, and save the visualized clustering results (in .png format). Compare the results under
varying settings of . What is your observation about the privacy-utility tradeoff?
1

欢迎咨询51作业君