辅导案例-ITS61504

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Page 1 of 7

ITS61504 Data Mining
Assignment (Individual)

DUE DATE : 6th November 2020, 8pm
WEIGHTAGE: 20%
SEMESTER : August 2020

STUDENT DECLARATION
1. I confirm that I am aware of the University’s Regulation Governing Cheating in a University Test and
Assignment and of the guidance issued by the School of Computing and IT concerning plagiarism and
proper academic practice, and that the assessed work now submitted is in accordance with this regulation
and guidance.
2. I understand that, unless already agreed with the School of Computing and IT, assessed work may not be
submitted that has previously been submitted, either in whole or in part, at this or any other institution.
3. I recognise that should evidence emerge that my work fails to comply with either of the above declarations,
then I may be liable to proceedings under Regulation.
No Student Name Student ID Date Signature Score
1

Page 2 of 7

Marking Rubrics (Lecturer’s Use Only)
Attached as 2nd page in the report
Criteria Weight Score
Question 1 (a) 5
Question 1 (b) 5
Question 1 (c) 5
Question 1 (d) 5
Question 1 (e) 5
Question 1 (f-i) 5
Question 1 (f-ii) 5
Question 2 (a) 5
Question 2 (b) 10
Question 3 (a) 5
Question 3 (b-i) 5
Question 3 (b-ii) 5
Question 3 (b-iii) 5
Question 3 (b-iv) 5
Question 3 (c-i) 5
Question 3 (c-ii) 5
Question 4 15
Total Marks (100%)

Total Marks (20%)

Grading
Poor >= 90% of the marks
Fair < 90% to >= 75% of the marks
Good < 75% to >= 40of the marks
Excellent <40% of the marks

Remark

Softcopy
Submission

Page 3 of 7

Assignment 1 Marking Rubrics

Criteria Weight
(%) Excellent
>= 90%
of the marks
Good
< 90% to
>= 75%
of the marks
Average
< 75% to
>= 40%
of the marks
Poor
< 40%
of the marks

Question 1 (a)
Question 1 (b)
Question 1 (c)
Question 1 (d)
Question 1 (e)
Question 1 (f-i)
Question 1 (f-ii)
5
5
5
5
5
5
5
Question is
correctly solved
and answered.
Solution is clearly
elaborate and
presented in step
by step.
Question is
correctly solved
and answered.
Solution is NOT
clearly elaborate
and presented in
step by step.
Question is NOT
correctly solved
and answered.
Solution is clearly
elaborate and
presented in step
by step.
Question is NOT
correctly solved
and answered.
Solution is NOT
clearly elaborate
and presented in
step by step.
Question 2 (a)
Question 2 (b)
5
10
Question 3 (a)
Question 3 (b-i)
Question 3 (b-ii)
Question 3 (b-iii)
Question 3 (b-iv)
Question 3 (c-i)
Question 3 (c-ii)
5
5
5
5
5
5
5
Question 4 15 Appropriate
result comparison
is clearly
elaborate and
discuss.
Appropriate
result comparison
is not clearly
elaborate and
discuss.
Inappropriate
result comparison
with elaborate
and discuss.
Inappropriate
result
comparison
without
elaborate and
discuss.
Page 4 of 7
Instruction: Please read the following assignment notes, requirements and attached marking rubrics
carefully. In order to do the tasks, if it is required, you can make your own assumptions with valid
justification.

Note 1: Copying, cheating, attempts to cheat, plagiarism, collusion and any other attempts to gain an
unfair advantage in assessment result in to award 0 marks to all parties concerned.

Note 2: The Turnitin similarity for this assessment is 20% overall and lesser than 5% from a single source
excluding program source codes.

Note 3: All the submitted documents will be cross-checked with other students’ reports in this current and
previous semester. Therefore, any similarities rather that whatever is highlighted in Note 2, will be
considered as violating assessment rules and a Zero (0) mark will be given to all group members.

Note 4: Severe disciplinary action will be taken against those caught violating assessment rules such as
colluding, plagiarizing or transcribing.

Note 5: The assignment submission document should be within 10 - 20 pages in total with a spacing of
1.5 and a font of 12pt Times New Roman.

Note 6: Module Learning Outcome: On completion of this alternative assessment, students should be able
to:
MLO1: Describe data mining activities for various data mining techniques.
MLO2: Analyze and apply appropriate data mining technique to achieve different purposes using various
types of data.
MLO3: Demonstrate practical data mining skills using data mining tools/languages.

Page 5 of 7

DATA SET

Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean
records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)
&& (HRSWK>0)). Prediction task is to determine whether a person makes over 50K a year (Dua &
Graff, 2017). Below is a modified sample of the dataset. To access more records of this modified
dataset you can Click Here.

Age Work_Class Education Marital_Status Sex Hours_Per_week Income
39 State-gov Bachelors Never-married Male 40 <=50K
50 Self-emp-not-inc Bachelors Married-civ-spouse Male 13 <=50K
38 Private HS-grad Divorced Male 40 <=50K
53 ? 11th Married-civ-spouse Male 40 <=50K
28 Private Bachelors Married-civ-spouse Female 40 <=50K
37 Private Masters Married-civ-spouse Female 40 <=50K
49 Private 9th Married-spouse-absent Female 16 <=50K
52 Self-emp-not-inc HS-grad Married-civ-spouse Male 45 >50K
31 ? Masters Never-married Female 50 >50K
42 Private Bachelors Married-civ-spouse Male 40 >50K
37 Private Some-college Married-civ-spouse Male 80 >50K
30 State-gov Bachelors Married-civ-spouse Male 40 >50K
23 Private Bachelors Never-married Female 30 <=50K
32 Private Assoc-acdm Never-married Male 50 <=50K
40 Private Assoc-voc Married-civ-spouse Male 40 >50K
34 Private 7th-8th Married-civ-spouse Male 45 <=50K
25 Self-emp-inc HS-grad Never-married Male 35 <=50K
32 Private HS-grad Never-married Male 40 <=50K
38 Private 11th Married-civ-spouse Male 50 <=50K

Important: All the calculations must be clearly presented step by step.

1. Association rule mining

a) Explain what preprocessing techniques are needed for this dataset. Using those techniques,
clean the data if necessary.
b) Discretize attributes Age and Hours_Per_week into three discrete levels.
c) Manually calculate all items sets (from one item to the maximum number of items you can
find) using Apriori method. Prune the item sets with minimum support of 50% and minimum
confidence of 75%. You can add more records from the modified dataset if it’s required.
d) Describe what will happen if you change minimum support and minimum confidence.
e) List out all association rules (strong rules) with this meta template * => “>50K”
Page 6 of 7

f) Discuss the following issues:
i) Measuring interestingness by confidence is criticized as giving misleading conclusion in
some cases; give an example of this criticism.
ii) R offers different metric types to be used to sort the best rules, briefly explains them
(e.g. confidence, lift, leverage and conviction) in plain English.

2. Decision tree

a) Give proper labels to the attributes that have values with range of numbers.
b) Construct a decision tree manually (based on information gain), display your computation
steps. Then, extracting classification rules from the tree that you have created.

3. Naïve Bayesian

a) Construct a probability table manually. Note that you are not going to discretize the
Hours_Per_week in this exercise (instead, you will calculate probability by assuming that the
data has a normal distribution).
b) Predict whether a person with the following characteristics would have income higher than
50K or not (clearly display your computation steps):
i) The person is 35 years old. He/she has Bachelor degree and works 43 per week.
ii) What will be changed if this person works for governments or works in Private
companies?
iii) What is the effect of marital status on the Income?
iv) If the person is 55 years old, without higher education degree (having less than bachelor
degree). He/she is a married person and works 50 per week in a private company. Predict
his level of income.
c) Discuss the following issues:
i) How do you deal with missing values in Naïve Bayes?
ii) Give example calculations using your own example (e.g. may be the attribute lucky is
missing from the data you want to classify).

Page 7 of 7

4. Compare your result with RStudio

Confirm your computation of the above exercise using RStudio. Print out the RStudio outputs and
compare them to your computation. Make sure you understand how the RStudio output is read (e.g.
how do you read a support and confidence from the reported rules in Apriori package; how to read a
prior, likelihood in Naïve Bayes, etc).

Submission Requirements

(1) Project Report should be in softcopy with file formats of PDF or DOC. Also, the dataset
or the URL of the dataset plus R scripts files must be submitted as well. Expected
components are:
a) Cover page
b) Marking Rubric
c) Report content (dataset, Q1, Q2, Q3 and Q4).
d) R Programming coding (in the project report document and also attached R files into the
submitted files)
e) References (at the bottom of the report, you should list all the references that you have used
and the references should be cited in the documentation)

(2) Softcopy version to be submitted online to TIMeS before 6th November 2020, 8pm.
(3) The student may need to present his/her work if there is a need of clarification in the
submitted report or attached files. In this case, the student will be informed at least 24
hours before the presentation time.

References:
Dua, D., & Graff, C. (2017). {UCI} Machine Learning Repository. Retrieved from
http://archive.ics.uci.edu/ml

欢迎咨询51作业君