DS300-001 Homework 3 Due on Nov 10, 2020 Overview – In this homework, you are required to implement a differentially private k-means clustering algorithm. Most of the technical details can be found in Lectures 11 and 12. The submission includes one .txt file (with the answer to Q1, Q2, and Q4), one completed .py file (with the answer to Q3), and three .png files (with the clustering results in Q4) in a zipped folder. Q1 – Consider a collection of 2-dimensional datapoints. Each datapoint x resides in the domain of [−1, 1]× [−1, 1] (i.e., both the first and second dimensions of x lie within the interval of [−1, 1]). If we apply k-means clustering over the dataset, what is the sensitivity of (i) computing the size of each cluster and (ii) computing the sum of the datapoints in each cluster (note: the overall sensitivity is the summation of the sensitivity along each dimension)? Q2 – We now run the clustering algorithm for T iterations and apply the Laplace mechanism over the process to achieve -differential privacy. What is the privacy budget for each iteration if we allocate the budget equally among them? Assuming during each iteration, we allocate the same amount of privacy budget to (i) computing the cluster size and (ii) computing the sum of datapoints (note: we add independent noise to each dimension). What is the setting of λ for (i) and (ii)? Q3 –Note: make sure you have a Python environment installed on your computer. If not, Anaconda (https://www.anaconda.com) is strongly recommended as it includes all the packages necessary for this homework. In this question, you are required to implement differentially private k-means using the setting of Q1 and Q2. You can run the script using the following command in a terminal: python ./dp_kmeans.py -d data.npy -k 5 -t 10 -e 0.1 with the argument specification as follows: • -d: the dataset to be clustered (default is “data.npy”); • -k: the number of clusters (default is “5”); • -t: the number of iterations (default is “10”); • -e: the setting of (default is “None”). You are required to complete the missing lines according to the setting of Q1 and Q2. Q4 – Run the code (i) without providing the argument -e (i.e., the baseline k-means), (ii) -e 0.1, and (iii) -e 1, and save the visualized clustering results (in .png format). Compare the results under varying settings of . What is your observation about the privacy-utility tradeoff? 1
欢迎咨询51作业君