程序辅导案例 > C/C++ >

程序代写接单-CSE 347/447 Data Mining: Project 2 – Classification Algorithm

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

CSE 347/447 Data Mining: Project 2 – Classification Algorithm

Due on 11:59 PM, May 10, 2022

Standard and General Requirements

• Thisisaresearchproject,soyouarenotrequiredtoprogramfromscratch,youcancallbuilt-infunctions or import packages. Do your utmost to learn and explore how the performance of different algorithms may vary with different datasets and parameter settings. For example, on small datasets like USFGait, a simple SVM algorithm may produce better results than DNN based methods in terms of both effectiveness and efficiency. • Please note that directly copying code/text from a source without citation constitutes plagiarism which is forbidden and will result in an F in the grade. Partial credit will be given for partial solutions. • It is important for you to participate in the group works equally. If some team members are not consistent with their commitments, that makes it difficult for other team members to complete the project. You are ALLOWED to change your team members before April 29, 2022, please send me an email to discuss it before removing anyone from the group. If so, you are NOT ALLOWED to add additional members to the team. • No Late Policy: There is no late policy for this assignment, unless other arrangement is agreed to before this hard deadline. That is: late submission earns no credit. Exceptions will be made in, well, exceptional circumstances (a life-threatening illness, for example). • SubmissionInstructions:PleasesubmityourCode,ReportandSlidesas.zipfiletoCourseSite.Schedule a group meeting with TA ([email protected]) after submission to demo your code. • GroupProjectPresentation:Eachgroupisrequiredtodoan8-minutepresentationtoapanelofjudges, followed by 2 minutes of Q&A. The presentation will be held at 8:30AM-2:00PM on May 16, 2022. Please RESERVE your preferred time slot by filling out this online Excel file from sheet 2: Presentation Time. • PresentationTemplate:HereisaPresentationExampleforyourreference. Complete the following tasks: • Pleaseimplementthefollowingalgorithmsforclassification. – K-NearestNeighbors(KNN) – SupportVectorMachine(SVM) – Convolutional Neural Network (CNN) • Apply your algorithms to any three of the seven datasets described in Table 1 (of course, you can use all the datasets). You can download them at here. For the CIFAR-10 and MNIST datasets, you can also import them from Keras with the given splits. • • You are required to use the cross validation to do the hyperparameter tuning to achieve better performance. For the smaller datasets (Iyer, Cho, YaleB and USFGait), you need to use the K-fold cross validation to do this (By default, K = 3). For the larger datasets (PIE, CIFAR-10, MNIST), you can use 10% of the training data as the validation set and the rest as the training set. • Evaluate your classification algorithms with Accuracy, F1 score and AUC metrics. 1 – Hint: For multi-class classification, AUC score can be calculated based on multi-class strategies, see Lecture 13, Page 67 for details on AUC-ROC Curve Scoring Function for Multi-class Classification. Here is another example for your reference: AUC-ROC for Multi-Class Classification. • For traditional machine learning algorithms, you may consider to use PCA method to do the dimensionality reduction before classification. See an example of PCA for dimensionality reduction on the code demo Lec9: Visualization Similarity Matrix & High-Dimensional Data. Please note that to reduce the influence of randomness, for the smaller datasets (i.e., Iyer, Cho, YaleB and USFGait), you are required to run your algorithm for t times (t = 3) and report the average along with standard deviation (std) on test sets. The YaleB and USFGait datasets have provided the training and testing sets but Iyer and Cho do not, and thus you need to divide data into training and test sets by yourself. You should be careful that when tuning parameters of your algorithms, you can only use the training data, while testing data are only for the purpose of testing. Final submission. Your final submission should include the following: • Code: Three classification algorithms. Your code is expected to allow the users to choose either of the classification algorithm like Project 1, but it is an option for this project. • Report: Describe the flow of all implemented algorithms. Compare the performance of these approaches on the selected datasets in terms of Accuracy, F1, and AUC. State the pros and cons of each algorithm and any findings you get from the experiments, such as parameter sensitivity analysis. Submit your report as a PDF. • Presentation: Complete your slides and prepare for an 8-minute presentation. Submit your slides as a PDF. You may try to edit your report as a paper submitted to a journal/conference, but this is not required. In your report, you can use Tables and Figures to compare the results of different algorithms on different datasets. For example, Table 2 compares two dimensionality-reduction based KNN classification methods named “MPCA” and “FMPCA” on several datasets. ∗∗∗ Description of Datasets. Detailed information of the datasets are available in Table 1. Table 1: Detailed information of datasets. Dataset # Feature Size # Training # Testing # Classes Iyer 12 – – 11 Cho 16 – – 5 YaleB PIE USFGait CIFAR-10 MNIST 32× 32 32× 32 128 × 88 × 20 32× 32× 3 28× 28 2186 10399 630 50000 50000 228 38 1155 68 101 71 10000 10 10000 10 YaleB: The YaleB dataset is a grey-scale human face dataset with 2414 face images of 38 people. The size of each human face image is 32 × 32. Some sample face images of one person in this dataset are illustrated in Fig. 1. Based on the original dataset, we generate 3 different pairs of trainig/testing files termed ’StTrainFile/StTestFile’, each training and testing file contain 2186 and 228 faces of 38 people, respectively. Figure 1: Sample images of one person in Yale B dataset. The size of each training/testing file is 2186 × 1025 and 228 × 1025, respectively. Each row represents the feature and label corresponding to one person, and the last column corresponds to the ground truth label. 2 PIE: Similar to YaleB, the PIE dataset is also a grey-scale human face dataset with 11554 face images of 68 people. The size of each human face image is 32 × 32. Some sample face images of this dataset are illustrated in Fig. 2. Figure 2: Sample images of one person in PIE dataset. USFGait: The USFGait dataset is a third-order gait recognition dataset, the size of each gait sequence is 128 × 88 × 20. Similar to YaleB, we also generate 3 pairs of training/testing data files based on the original dataset. One sample gait sequence is illustrated in Fig. 3. CIFAR-10: The CIFAR-10 dataset consists of 60000 32 × 32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. Some sample images of CIFAR-10 dataset are illustrated in Fig. 4. More details of the CIFAR10 dataset is available at https://www.cs.toronto.edu/ kriz/cifar.html. MNIST: The MNIST handwritten digit dataset has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed- size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The dataset and more Figure 3: Illustration of Gait silhouette sequence. Figure 4: Sample images of CIFAR-10 dataset. 3 detailed description are available at http://yann.lecun.com/exdb/mnist/. Sample images of the MNIST dataset are illustrated in Fig. 5. Figure 5: Sample images of MNIST handwriting dataset. Iyer and Cho: You can also use Iyer or Cho dataset for classification, and details of these two datasets are included in the first project. Table 2: Classification Accuracy results of MPCA- and FMPCA-based KNN on real face and gait datasets. 4