University of Waterloo
ECE 657A: Data and KnowledgeModeling and Analysis
Winter 2024
Assignment 1: Data Cleaning and Dimensionality Reduction
Due: Mar 25th, 2024 11:59pm
Overview
Assignment Type: Done in groups of up to three students.
Hand in: One report (PDF) or python notebook per group, via the LEARN dropbox. Also submit the code / scripts needed to reproduce your work. (If you are submitting by PDF, if you don’t know LATEX should try to use it, it’s good practice and it will make the project report easier)
Objective: To gain experience on the use of classification.
Datasets
Available on LEARN
Dataset A: This data is the splice junctions on DNA sequences. The given data set includes 2200 samples with 57 features, in the matrix ’fea’. It is a binary class problem. The class labels are either +1 or -1, given in the vector ’gnd’. Parameter selection and classification tasks are conducted on this data set. (File:DataA.csv)
Dataset B: This data consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length. The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width. (File: DataB.csv)
Dataset C : Handwritten digits of 0, 1, 2, 3, and 4 (5 classes). This dataset contains 2066 samples with 784 features corresponding to a 28 x 28 gray-scale (0-255) image of the digit, arranged in column-wise. This data is used to illustrate the difference between feature extraction methods. (File: DataC.csv)
Guidelines
• No late submissions will be accepted.
• The answer sheets are checked for plagiarism.
• The code will check for plagiarism with the online websites. • For all the random state use seed=42
Nonlinear Dimensionality Reduction
Refer to DataA.csv
Apply the nonlinear dimensionality reduction methods LocallyLinear Embedding (LLE) and ISOMAP to the dataset C, set the number of nearest neighbors to be 5, the projected low dimension to be
Binary Classification