Page 1 of 7 ITS61504 Data Mining Assignment (Individual) DUE DATE : 6th November 2020, 8pm WEIGHTAGE: 20% SEMESTER : August 2020 STUDENT DECLARATION 1. I confirm that I am aware of the University’s Regulation Governing Cheating in a University Test and Assignment and of the guidance issued by the School of Computing and IT concerning plagiarism and proper academic practice, and that the assessed work now submitted is in accordance with this regulation and guidance. 2. I understand that, unless already agreed with the School of Computing and IT, assessed work may not be submitted that has previously been submitted, either in whole or in part, at this or any other institution. 3. I recognise that should evidence emerge that my work fails to comply with either of the above declarations, then I may be liable to proceedings under Regulation. No Student Name Student ID Date Signature Score 1 Page 2 of 7 Marking Rubrics (Lecturer’s Use Only) Attached as 2nd page in the report Criteria Weight Score Question 1 (a) 5 Question 1 (b) 5 Question 1 (c) 5 Question 1 (d) 5 Question 1 (e) 5 Question 1 (f-i) 5 Question 1 (f-ii) 5 Question 2 (a) 5 Question 2 (b) 10 Question 3 (a) 5 Question 3 (b-i) 5 Question 3 (b-ii) 5 Question 3 (b-iii) 5 Question 3 (b-iv) 5 Question 3 (c-i) 5 Question 3 (c-ii) 5 Question 4 15 Total Marks (100%) Total Marks (20%) Grading Poor >= 90% of the marks Fair < 90% to >= 75% of the marks Good < 75% to >= 40of the marks Excellent <40% of the marks Remark Softcopy Submission Page 3 of 7 Assignment 1 Marking Rubrics Criteria Weight (%) Excellent >= 90% of the marks Good < 90% to >= 75% of the marks Average < 75% to >= 40% of the marks Poor < 40% of the marks Question 1 (a) Question 1 (b) Question 1 (c) Question 1 (d) Question 1 (e) Question 1 (f-i) Question 1 (f-ii) 5 5 5 5 5 5 5 Question is correctly solved and answered. Solution is clearly elaborate and presented in step by step. Question is correctly solved and answered. Solution is NOT clearly elaborate and presented in step by step. Question is NOT correctly solved and answered. Solution is clearly elaborate and presented in step by step. Question is NOT correctly solved and answered. Solution is NOT clearly elaborate and presented in step by step. Question 2 (a) Question 2 (b) 5 10 Question 3 (a) Question 3 (b-i) Question 3 (b-ii) Question 3 (b-iii) Question 3 (b-iv) Question 3 (c-i) Question 3 (c-ii) 5 5 5 5 5 5 5 Question 4 15 Appropriate result comparison is clearly elaborate and discuss. Appropriate result comparison is not clearly elaborate and discuss. Inappropriate result comparison with elaborate and discuss. Inappropriate result comparison without elaborate and discuss. Page 4 of 7 Instruction: Please read the following assignment notes, requirements and attached marking rubrics carefully. In order to do the tasks, if it is required, you can make your own assumptions with valid justification. Note 1: Copying, cheating, attempts to cheat, plagiarism, collusion and any other attempts to gain an unfair advantage in assessment result in to award 0 marks to all parties concerned. Note 2: The Turnitin similarity for this assessment is 20% overall and lesser than 5% from a single source excluding program source codes. Note 3: All the submitted documents will be cross-checked with other students’ reports in this current and previous semester. Therefore, any similarities rather that whatever is highlighted in Note 2, will be considered as violating assessment rules and a Zero (0) mark will be given to all group members. Note 4: Severe disciplinary action will be taken against those caught violating assessment rules such as colluding, plagiarizing or transcribing. Note 5: The assignment submission document should be within 10 - 20 pages in total with a spacing of 1.5 and a font of 12pt Times New Roman. Note 6: Module Learning Outcome: On completion of this alternative assessment, students should be able to: MLO1: Describe data mining activities for various data mining techniques. MLO2: Analyze and apply appropriate data mining technique to achieve different purposes using various types of data. MLO3: Demonstrate practical data mining skills using data mining tools/languages. Page 5 of 7 DATA SET Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). Prediction task is to determine whether a person makes over 50K a year (Dua & Graff, 2017). Below is a modified sample of the dataset. To access more records of this modified dataset you can Click Here. Age Work_Class Education Marital_Status Sex Hours_Per_week Income 39 State-gov Bachelors Never-married Male 40 <=50K 50 Self-emp-not-inc Bachelors Married-civ-spouse Male 13 <=50K 38 Private HS-grad Divorced Male 40 <=50K 53 ? 11th Married-civ-spouse Male 40 <=50K 28 Private Bachelors Married-civ-spouse Female 40 <=50K 37 Private Masters Married-civ-spouse Female 40 <=50K 49 Private 9th Married-spouse-absent Female 16 <=50K 52 Self-emp-not-inc HS-grad Married-civ-spouse Male 45 >50K 31 ? Masters Never-married Female 50 >50K 42 Private Bachelors Married-civ-spouse Male 40 >50K 37 Private Some-college Married-civ-spouse Male 80 >50K 30 State-gov Bachelors Married-civ-spouse Male 40 >50K 23 Private Bachelors Never-married Female 30 <=50K 32 Private Assoc-acdm Never-married Male 50 <=50K 40 Private Assoc-voc Married-civ-spouse Male 40 >50K 34 Private 7th-8th Married-civ-spouse Male 45 <=50K 25 Self-emp-inc HS-grad Never-married Male 35 <=50K 32 Private HS-grad Never-married Male 40 <=50K 38 Private 11th Married-civ-spouse Male 50 <=50K Important: All the calculations must be clearly presented step by step. 1. Association rule mining a) Explain what preprocessing techniques are needed for this dataset. Using those techniques, clean the data if necessary. b) Discretize attributes Age and Hours_Per_week into three discrete levels. c) Manually calculate all items sets (from one item to the maximum number of items you can find) using Apriori method. Prune the item sets with minimum support of 50% and minimum confidence of 75%. You can add more records from the modified dataset if it’s required. d) Describe what will happen if you change minimum support and minimum confidence. e) List out all association rules (strong rules) with this meta template * => “>50K” Page 6 of 7 f) Discuss the following issues: i) Measuring interestingness by confidence is criticized as giving misleading conclusion in some cases; give an example of this criticism. ii) R offers different metric types to be used to sort the best rules, briefly explains them (e.g. confidence, lift, leverage and conviction) in plain English. 2. Decision tree a) Give proper labels to the attributes that have values with range of numbers. b) Construct a decision tree manually (based on information gain), display your computation steps. Then, extracting classification rules from the tree that you have created. 3. Naïve Bayesian a) Construct a probability table manually. Note that you are not going to discretize the Hours_Per_week in this exercise (instead, you will calculate probability by assuming that the data has a normal distribution). b) Predict whether a person with the following characteristics would have income higher than 50K or not (clearly display your computation steps): i) The person is 35 years old. He/she has Bachelor degree and works 43 per week. ii) What will be changed if this person works for governments or works in Private companies? iii) What is the effect of marital status on the Income? iv) If the person is 55 years old, without higher education degree (having less than bachelor degree). He/she is a married person and works 50 per week in a private company. Predict his level of income. c) Discuss the following issues: i) How do you deal with missing values in Naïve Bayes? ii) Give example calculations using your own example (e.g. may be the attribute lucky is missing from the data you want to classify). Page 7 of 7 4. Compare your result with RStudio Confirm your computation of the above exercise using RStudio. Print out the RStudio outputs and compare them to your computation. Make sure you understand how the RStudio output is read (e.g. how do you read a support and confidence from the reported rules in Apriori package; how to read a prior, likelihood in Naïve Bayes, etc). Submission Requirements (1) Project Report should be in softcopy with file formats of PDF or DOC. Also, the dataset or the URL of the dataset plus R scripts files must be submitted as well. Expected components are: a) Cover page b) Marking Rubric c) Report content (dataset, Q1, Q2, Q3 and Q4). d) R Programming coding (in the project report document and also attached R files into the submitted files) e) References (at the bottom of the report, you should list all the references that you have used and the references should be cited in the documentation) (2) Softcopy version to be submitted online to TIMeS before 6th November 2020, 8pm. (3) The student may need to present his/her work if there is a need of clarification in the submitted report or attached files. In this case, the student will be informed at least 24 hours before the presentation time. References: Dua, D., & Graff, C. (2017). {UCI} Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml
欢迎咨询51作业君