DSCI553 Foundations and Applications of Data Mining Spring 2021 Assignment 5 Clustering Deadline: Apr. 23rd 11:59 PM PST 1. Overview of the Assignment In Assignment 5, you will implement the K-Means and Bradley-Fayyad-Reina (BFR) algorithm. Doing the assignment helps you get familiar with clustering algorithms using various distance measurements. The datasets you are going to use are synthetic. 2. Assignment Requirements 2.1 Programming Language and Library Requirements a. You must use Python to implement all the tasks. You can only use standard Python libraries (i.e., external libraries like numpy or pandas are not allowed). Spark RDD is optional for Python. If you want to use Spark, please specify the following environment in your code: a. There will be 10% bonus for Scala implementation in each task. Spark RDD is required for Scala. You can get the bonus only when both Python and Scala implementations are correct. b. Spark DataFrame and DataSet are not allowed. 2.2 Programming Environment Python 3.6, Scala 2.11, and Spark 2.3.0 We will use Vocareum to automatically run and grade your submission. You must test your scripts on the local machine and the Vocareum terminal before submission. 2.3 Write your own code Do not share code with other students!! For this assignment to be an effective learning experience, you must write your own code! We emphasize this point because you may find Python implementations of some of the required functions on the Web. Please do not look for or at any such code! Plagiarism detection will combine all the code we can find from the Web (e.g., Github) as well as other students’ code from this and other (previous) sections. We will report all detected plagiarism to the university. 3. Dataset Since the BFR algorithm has a strong assumption that the clusters are normally distributed with independent dimensions, we generated synthetic datasets by the following steps: 1) Initializing some random centroids 2) For each centroid, sampling data points following the normal distribution in each dimension. Together, the mean of the normal distribution in each dimension is the centroid of the clusters, and the standard deviation is set manually 3) For each centroid, we added some data points as the outliers. Figure 1 shows an example of the data points (in CSV format). The first column is the data point indexes. The other columns represent the features/dimensions of the data points. Figure 1: An example of the data points with 3 dimensions You can access and download the following datasets either under the directory (resource/asnlib/publicdata/) on Vocareum or Google Drive (USC email only): https://drive.google.com/drive/folders/1mZgx9vHpsjwtjwbZGdnVjrVypKoZJaP9?usp=sharing a. Five folders are named from “test1” to “test5”. Each folder contains multiple files storing data points. We will treat these files as separate data chunks. In each iteration, you will load one file (one chunk of points) to the memory and process these data points with the BFR algorithm. b. Five files are named from “cluster1.json” to “cluster5.json”. Each file provides the ground truth cluster for the data points in the corresponding folder from “test1” to “test5”. The key is the data point index (as a string). The value is its corresponding cluster index. The cluster index of the outliers is -1. The number of clusters for each dataset is test1: 10, test2: 10, test3: 5, test4: 8, test5: 15. c. We have generated ten testing sets using the same method. Five of them are provided for your implementation. The rest of them are used for grading. Notice that the number of dimensions, the number of files, and the number of data points in each dataset could vary. 4. Task (8pts) You need to submit the following files on Vocareum: (all lowercase) a. [REQUIRED] Python scripts: bfr.py b. [REQUIRED FOR SCALA] Scala scripts: bfr.scala; one Jar package: hw5.jar c. [OPTIONAL] You can include other scripts to support your programs (e.g., callable functions). 4.1 Task description You will write the K-Means and Bradley-Fayyad-Reina (BFR) algorithms from scratch. You should implement K-Means as the in-memory clustering algorithm that you will use in BFR. You will iteratively load the data points from a file and process these data points with the BFR algorithm. See below pseudocode for your reference. In BFR, there are three sets of points that you need to keep track of: Discard set (DS), Compression set (CS), Retained set (RS). For each cluster in the DS and CS, the cluster is summarized by: N: The number of points SUM: the sum of the coordinates of the points SUMSQ: the sum of squares of coordinates The conceptual steps of the BFR algorithm: Please refer to the slides. The implementation details of the BFR algorithm: Please refer to Section 7. 4.2 Execution commands Python command: $ python3 bfr.py
Scala command: $ spark-submit --class bfr hw5.jar Params : the folder containing the files of data points : the number of clusters : the output file of cluster results : the output file of intermediate results 4.3 Output format a. You must write your clustering results in the JSON format (see Figure 2). The key is the data point index (as string). The value is its corresponding cluster index. The indexes for both data points and clusters start from 0. Figure 2: An example of the output clustering results b. You must output the intermediate results in the CSV format (use the same headers in Figure 3). Each line in the intermediate results represents the following information about each iteration/data trunk: - round id (starting from 1): # of rounds must be # of data chunks in the folder. - the number of clusters in the discard set - the total accumulated number of the discarded points: The number should only go up with iterations. - the number of clusters in the compression set - the total number of the compressed points - the number of points in the retained set Figure 3: An example of the intermediate results 4.3 Grading We will use normalized mutual information (NMI) score to evaluate your clustering results. To obtain the full point(0.8pt) for each dataset, the NMI should be above 0.8, and the intermediate result is correct. If either criterion (i.e., NMI and intermediate result) does not meet, you will lose all points for a dataset. The submission report will tell you which part of your intermediate results is incorrect. We will grade your code on all ten datasets. 5. About Vocareum a. Your code can directly access the datasets under the directory: ../resource/asnlib/publicdata/ b. You should upload the required files under your workspace: work/ c. You must test your scripts on both the local machine and the Vocareum terminal before submission. d. During submission period, the Vocareum will run and evaluate the results for test1 and test2. e. You will receive a submission report after Vocareum finishes executing your scripts. The submission report shows NMI for each dataset. If the intermediate results are incorrect, the submission shows the reason(s). f. The total execution time of submission period should be less than 600 seconds. The execution time of grading period need to be less than 3,000 seconds. g. Please start your assignment early! You can resubmit any script on Vocareum. We will only grade on your last submission. 6. Grading Criteria (% penalty = % penalty of possible points you get) a. You can use your free 5-day extension separately or together. You must submit a late-day request via https://forms.gle/6aDASyXAuBeV3LkWA. This form is recording the number of late days you use for each assignment. By default, we will not count the late days if no request submitted. b. There will be 10% bonus for each task if your Scala implementations are correct. Only when your Python results are correct, the bonus of Scala will be calculated. There is no partial point for Scala. c. There will be no point if your submission cannot be executed on Vocareum. d. There is no regrading. Once the grade is posted on the Blackboard, we will only regrade your assignments if there is a grading error. No exceptions. e. There will be 20% penalty for the late submission within one week and no point after that. 7. Appendix The implementation details of the BFR algorithm (you can/should have your own implementation; this is only for reference). Suppose the number of clusters is K and the number of dimensions is d. 1. Load the data points from one file. 2. Run K-Means on the data points or a random subset of the data points. For the implementation, you will apply K-means on a subset since you cannot handle all data points in the first file. Initialize the algorithm with a large number of centroids (e.g., 3 or 5 times of K) and use the Euclidean distance as the similarity measurement. 3. Among the result clusters from step 2, move all the clusters that contain only one or very few points (you define “very few”) to RS as the outliers. You will now have two groups of data points: the outlier data points and inlier data points. 4. Run the clustering algorithm again to cluster the inlier data points into K clusters. Use these K clusters as your DS. Discard these points and generate the DS statistics. 5. Run the clustering algorithm again to cluster the outlier data points using a large number of clusters (e.g., 3 or 5 times of K). Generate CS and their statistics from the clusters with more than one data point and use the remaining as your new RS. 6. The above steps finish the initialization of DS. So far, you have K DS (from step 4), some number of CS (from step 5), and some number of RS (from step 5). 7. Load the data points from another file (step 2 loads a portion of the first file, load the remaining data points from the first file). 8. For the new data points, compare them to each DS using the Mahalanobis Distance and assign them to the nearest DS clusters if the distance is < √ (e.g., 2√). 9. For the new data points which are not assigned to any DS cluster, compare them to each of the CS using the Mahalanobis Distance and assign them to the nearest CS clusters if the distance is < √ (e.g., 2√). 10. For the new data points which are not assigned to any DS or CS cluster, add them to your RS. 11. Run the clustering algorithm on the RS with a large number of centroids (e.g., 3 or 5 time of K). Generate CS and their statistics from the clusters with more than one data point and add them to your existing CS list. Use the remaining points as your new RS. 12. Merge CS clusters that have a Mahalanobis Distance < √ (e.g., 2√). 13. Output your intermediate result after all the data points in the data file have been evaluated. Repeat step 6 to 12. If this is the last run (after the last chunk of data), merge your CS and RS clusters into the closest DS clusters. 欢迎咨询51作业君