辅导案例-COMP4331-Assignment 1

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY
Department of Computer Science and Engineering
COMP4331:Introduction to Data Mining
Fall 2020 Assignment 1
Due time and date: 11:59pm, Oct 12 (Mon), 2020.
IMPORTANT NOTES
ˆ Your grade will be based on the correctness and clarity.
ˆ Late submission: 25 marks will be deducted for every 24 hours after the deadline.
ˆ ZERO-Tolerance on Plagiarism: All involved parties will get zero mark.
In this assignment, you are required to preprocess the following data set on house prices:
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
You can download this directly from Kaggle. Alternatively, you may run “pip install kaggle” to install the Kaggle
API, and then run “kaggle competitions download -c house-prices-advanced-regression-techniques” to obtain the data
set. Please note that you will have to first create a Kaggle account. Using the commands also require you to first
acquire the API token for Kaggle.
Tasks
Please use the file “train.csv” for the following tasks. The tasks that require you to use Python code will be indicated
by ”(Code)”, while the tasks that require to do by hand will be indicated by ”(Write)”.
1. (Write) For each attribute, determine if it is nominal, ordinal or numeric. Write your answers in the report,
listing each attribute separated by a comma. (e.g. Nominal: [attribute1], [attribute2])
2. (Code) Show the histogram for “SalePrice” (using matplotlib or seaborn) in the report.
3. (Code) Draw a boxplot for every numeric attribute using matplotlib or seaborn. Save the results in the Python
notebook file. You do not need to include these plots in your report.
4. (Code) Delete a record if any of its numeric attribute values is an outlier from the boxplot. Show the number
of deleted records in the report.
5. (Code) Remove numeric attributes that has the same value for all records. Show the removed attributes in the
report.
6. (Code) From the remaining numeric attributes, report the top 5 numeric attributes (excluding “SalePrice”)
that are most correlated with attribute “BsmtFinSF1”. Be sure to take both positive and negative correlated
attributes into account. Remove 2 of these 5 numeric attributes. Justify in the report why you pick these 2.
For the numeric attributes that are removed, show their scatter plots in your report.
7. (Write) Among the following three attributes
‘Alley”, “BldgType”, “GarageCond”,
remove the any attributes that are dependent on the attribute “GarageQual” at a significance level of 0.001.
Please show all your calculations and steps to justify why it should be removed in the report. A χ2 table is
provided here for your reference.
1
8. (Code) For each of the remaining numeric attributes, fill in the missing values with the corresponding mean
value (missing entries are denoted “NA” in the data set). Show the filled-in values in the report.
9. (Code) For attributes “MasVnrType” and “Electrical”, fill in the missing values with the corresponding most
popular value in the data set. Show the filled-in values in the report.
10. (Code) Standardize each numeric attribute such that the mean is 0 and the standard deviation is 1 using the
built-in functions from ’scikit-learn’.
11. (Code) For the remaining numeric attributes, find the smallest set of PCA features so that the proportion of
explained variance is at least 0.9. Plot the graph of the cumulative explained variance in your report. Using the
number of components found, perform PCA. Output and report the five-number summary for each component.
12. (Code) Output the resulting numeric components to a csv file with the name “numeric train.csv”, and the
categorical attributes to a csv file with the name “categorical train.csv” using ’pandas’. (Be sure to preserve
the “NA” values for the categories).
Submission Guidelines
Please submit a report (report.pdf), a Python notebook (assignment1.ipynb) for your code, and the output data
set files (numeric train.csv and categorical train.csv). Zip all the files into either [your student ID] assignment1.zip
or [your student ID] assignment1.tar.gz. Please submit the assignment by uploading the compressed file to Canvas.
Note that the assignment should be clearly legible, otherwise you may lose some points if the assignment is difficult
to read. Plagiarism will lead to zero point on this assignment.
2

欢迎咨询51作业君
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468