# 辅导代写接单-FIT3152 Data analytics – 2023

1
Faculty of
Information
Technology
FIT3152 Data analytics – 2023
Quiz and Practical Activity – Sample Answers
Your task
You will be given a set of multiple choice and longer questions to answer.
The questions will cover topics taught during Weeks 1 – 9.
Value and
Structure
This assignment is worth 20% of your total marks for the unit.
It has 30 marks in total, comprised of
6 multiple choice questions of 1 Mark each,
3 free responses of 2 Marks each, and
3 grouped free responses of 6 Marks each.
Time
You will have 1 Hour during tutorial time to complete the test.
Due Date
Your scheduled tutorial during Week 11
Submission
Via Moodle Quiz
Generative
AI Use
In this assessment, you must not use generative artificial intelligence (AI) to
generate any materials or content in relation to the assessment task.
Late
Penalties
This activity can only be deferred/re-scheduled on medical or other serious
grounds with relevant documentation.
Instructions
Answer the questions on the Moodle Quiz.
The activity is closed book, therefore lecture and tutorial notes or online
references are not permitted.
You may use any calculator (physical or digital).
You must keep your camera on if you are in an online tutorial.
NOTE
You will be asked to stop this activity early and submit what you have done if:
You are found to be using any software other than that permitted.
You are found to be accessing web sites or online resources other than the
Moodle Quiz.
You are found to be communicating with any other student.
You are found using online resources besides the Moodle Quiz.
You are found to be cheating in any way.2
Multiple Choice (1 Mark)
The following points (P1 – P6) are to be clustered using hierarchical clustering and applying MIN to
the distance matrix below. Which pair of points are in the first merge?
A.
P2, P4
B.
P3, P4
C.
P1, P6
D.
P1, P4
E.
P4, P5
P1 P2 P3 P4
P5 P6
P1 0.0 0.4 2.5 1.5 1.4 0.2
P2
0.0 0.4 3.9 1.7 0.6
P3
0.0 2.8 0.8 1.9
P4
0.0 0.1 2.0
P5
0.0 1.3
P6
0.03
Multiple Choice (1 Mark)
The table below shows a classification model for 10 customers based on whether or not they did
buy a new product (did buy = 1, did not buy = 0), and the confidence level of the prediction.
Customer
Confidence-buy
Did-buy
C01
0.8823
0
C02
0.5547
0
C03
0.6469
1
C04
0.1252
0
C05
0.7050
0
C06
0.7065
1
C07
0.1441
0
C08
0.7398
1
C09
0.7865
1
C10
0.4874
0
What is the lift value if you target the top 50% of customers that the classifier is most confident of?
A.
0.2
B.
0.5
C.
1.5
D.
2.0
E.
2.54
Multiple Choice (1 Mark)
The ROC chart for a classification problem is given below.
Give an estimate of classifier performance (AUC).
AUC = 0.7917 (Exact)
A.
0.1
B.
0.2
C.
0.5
D.
0.6
E.
0.85
Multiple Choice (1 Mark)
15 observations were sampled at random from the Iris data set. The dendrogram resulting from
clustering, based on their sepal and petal measurements, is below.
What is the smallest number of clusters that would put all of species Setosa (observations 1:50) in
a cluster their own.
A.
1
B.
2
C.
3
D.
5
E.
156
Multiple Choice (1 Mark)
Predict the output from the following commands:
> X <- c(1, 2)
> Y <- c(3, 4)
> X + Y
A.
4, 6
B.
3, 7
C.
10
D.
1234
E.
1, 2, 3, 47
Multiple Choice (1 Mark)
An artificial neural network (ANN) is to be used to classify whether or not to Buy a certain product
based on Popularity, Sales and Performance. An extract of the data is below.
(a)
How many input nodes does the ANN require for this problem? [1 Mark]
Pop (3) + Sales (1) + Perf (1) = 5
A.
1
B.
2
C.
3
D.
4
E.
5
ID
Popularity
Sales
Performance Buy
1
low
330000
0.87
Maybe
2
medium
40000
0.22
No
3
low
50000
NA
Yes
4
high
30000
0
Yes
5
low
100000
0.1
No
6
medium
NA
0.06
No
...
...
...
...
...8
Free Response (2 Marks)
The table below shows a classification model for 10 customers based on whether or not they did
buy a new product (did buy = 1, did not buy = 0), and the confidence level of the prediction.
Customer
Confidence-buy
Did-buy
50%CL
C01
0.8823
0
1
C02
0.5547
0
1
C03
0.6469
1
1
C04
0.1252
0
0
C05
0.7050
0
1
C06
0.7065
1
1
C07
0.1441
0
0
C08
0.7398
1
1
C09
0.7865
1
1
C10
0.4874
0
0
If a confidence level of 50% or greater is required for a positive classification, what is the Accuracy
of the model?
TP = 4; FP = 3; TN = 3; FN = 0
[1 Mark all correct]
Acc = (TP + TN)/(TP+FP+TN+FN) = 7/10
[1 Mark or H]9
Free Response (2 Marks)
A k-Means clustering algorithm is fitted to the iris data, as shown below.
rm(list = ls())
data("iris")
ikfit = kmeans(iris[,1:2], 4, nstart = 10)
ikfit
table(actual = iris\$Species, fitted = ikfit\$cluster)
Based on the R code and output below, answer the following questions.
> ikfit
K-means clustering with 4 clusters of sizes 24, 53, 41, 32
Cluster means:
Sepal.Length Sepal.Width
1 4.766667 2.891667
2 5.924528 2.750943
3 6.880488 3.097561
4 5.187500 3.637500
Within cluster sum of squares by cluster:
[1] 4.451667 8.250566 10.634146 4.630000
(between_SS / total_SS = 78.6 %)
> table(actual = iris\$Species, fitted = ikfit\$cluster)
fitted
actual 1 2 3 4
setosa 18 0 0 32
versicolor 5 34 11 0
virginica 1 19 30 0
If clustering was used to discriminate between the irises, what would be the accuracy of the
model? Explain your reasoning.
Assign each displacement to the cluster having greatest number of
members. Assume these are the TPs and then work out accuracy as
usual. [1 Mark]
For example: assume C1 and C4 are setosa, C2 is versicolor, C3 is
virginica. Correct classified = (18 + 34 + 30 + 32)/Total = 150,
Accuracy = 0.76. accept any reasonable similar approach. [1 Mark]10
Free Response (2 Marks)
Use the data below and Naïve Bayes classification to predict whether the following test instance
will be happy or not.
Test instance: (Age Range = young, Occupation = professor, Gender = F, Happy = ? )
YES
P(young/Y)
P(professor/Y) P(F/Ys)
Product
p(yes) 0.5
0.250
0.250
0.500
0.016
NO
P(young/N)
P(professor/No) P(F/N)
Product
p(no)
0.5
0.250
0.250
0.750
0.023
Correct calculations [1 Mark]
So classify as Happy = No [1 Mark or H]
ID
Age Range
Occupation
Gender
Happy
1
Young
Tutor
F
Yes
2
Middle-aged
Professor
F
No
3
Old
Tutor
M
Yes
4
Middle-aged
professor
M
Yes
5
Old
Tutor
F
Yes
6
Young
Lecturer
M
No
7
Middle-aged
lecturer
F
No
8
Old
Tutor
F
No11
Free Response (6 Marks)
The DunHumby (DH) data frame records the Date a Customer shops at a store, the number of Days
since their last shopping visit, and amount Spentfor 20 customers. The first 4 rows are shown below.
> head(DH)
customer_id visit_date visit_delta visit_spend
<int> <chr> <int> <dbl>
1 40 04-04-10 NA 44.8
2 40 06-04-10 2 69.7
3 40 19-04-10 13 44.6
4 40 01-05-10 12 30.4
The following R code is run:
DHY = DH[as.Date(DH\$visit_date,"%d-%m-%y") < as.Date("01-01-11","%d-%m-%y"),]
CustSpend = as.table(by(DHY\$visit_spend, DHY\$customer_id, sum))
CustSpend = sort(CustSpend, decreasing = TRUE)
CustSpend = head(CustSpend, 12)
CustSpend = as.data.frame(CustSpend)
colnames(CustSpend) = c("customer_id", "amtspent")
DHYZ = DHY[(DHY\$customer_id %in% CustSpend\$customer_id),]
write.csv(DHYZ, "DHYZ.csv", row.names = FALSE)
g = ggplot(data = DHYZ) + geom_histogram(mapping = aes(x = visit_spend)) +
facet_wrap(~ customer_id, nrow = 3)
Describe the data contained in the data frame “CustSpend.” [2 Marks]
Total spend for each customer (before date)
[1 Mark]
For top 12 customers
[1 Mark]
Describe the data contained in the data frame “DHYZ.” [2 Marks]
DH data frame (cols removed Difference added)
[1 Mark]
For top 12 customers (in CustSpend)
[1 Mark]
Describe the contents of the graphic shown by plot “g.” [2 Marks]
Histogram of visit spend
[1 Mark]
Facetted by customer (for top 12)
[1 Mark]12
Free Response (6 Marks)
A World Health study is examining how life expectancy varies between men and women in different
countries and at different times in history. The table below shows a sample of the data that has
been recorded. There are approximately 15,000 records in all.
Country
Year of Birth
Gender
Age at Death
Australia
1818
M
9
Afghanistan
1944
F
40
USA
1846
F
12
India
1926
F
6
China
1860
F
32
India
1868
M
54
Australia
1900
F
37
China
1875
F
75
England
1807
M
15
France
1933
M
52
Egypt
1836
M
19
USA
1906
M
58
Using one of the graphic types from the Visualization Zoo (see formulae and references for a list of
types), or another graph type of your choosing, suggest a suitable graphic to help the researcher
display as many variables as clearly as possible.
Explain your decision. Which graph elements correspond to the variables you want to display?
Appropriate main graphic by name
[1 Mark]
For example scatter plot or heat map. Accept another type with
justification.
Mapping of variables to graphic (Country)
[1 Mark]
Age at death and other vars are grouped by country using colour or
position or labels. Other mapping with justification.
Mapping of variables to graphic (Year of birth)
[1 Mark]
Year of birth is position or panel. Other mapping with
justification.
Mapping of variables to graphic (Gender)
[1 Mark]
Panel, position or colour. Other mapping with
justification.
Mapping of variables to graphic (Age at death)
[1 Mark]
Size, colour or position. Other mapping with justification.
Data reduction or summary calculation
[1 Mark]
How data is grouped and reduced. Averaging etc.13
Free Response (6 Marks)
A researcher wants to predict the prevalence of crime in towns, using the following data.
Crm: Crime rate in the town;
Ind: Proportion of the town zoned industrial.
Pol: Air pollution in the town (ppm)
Rms: Number of main rooms in the house
Tax: Land tax paid (\$)
Str: Student to teacher ratio in local schools
Zone: Socio-economic zone of house location
Val: Value of the house (\$000)
> head(Cdata)
Crm Ind Pol Rms Tax Str Zone Val
1 0.00632 2.31 0.538 6 296 15.3 0 2400
2 0.02731 7.07 0.469 6 242 17.8 1 2160
3 0.02729 7.07 0.469 7 242 17.8 0 3470
4 0.03237 2.18 0.458 6 222 18.7 0 3340
Based on the R code and output below, answer the following questions.
> contrasts(Cdata\$Zone) = contr.treatment(3)
> Crime = lm(Crm~.,data = Cdata); summary(Crime)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.162875 5.324457 -0.97 0.333
Ind -0.160716 0.078481 -2.05 0.041 *
Pol 4.791271 4.443372 1.08 0.281
Rms 0.051432 0.500037 0.10 0.918
Tax 0.025699 0.002902 8.86 <2e-16 ***
Str 0.041439 0.177346 0.23 0.815
Zone1 -1.843825 1.198360 -1.54 0.125
Zone2 3.244316 1.702931 1.91 0.057 .
Val -0.001216 0.000582 -2.09 0.037 *
---
> contrasts(Cdata\$Zone)
2 3
0 0 0
1 1 0
How does the proportion of the town zoned industrial affect crime rate? How reliable is the
evidence?
Increasing proportion of industrial reduces crime rate [1 Mark]
Reliable. Significance is high (p < 0.05)
[1 mark]
How does air pollution affect crime rates? How reliable is the evidence?
Have positive coefficient but can’t really tell.
[1 Mark]
Reliability low. P-value/ Significance is low (0.281) [1 mark]
Why is Zone ‘0’ not defined in the regression output? How is it included in the model?
Zone 0 is the default contrast (having coefficient 0) [1 mark]
It is implicitly included in the intercept
[1 mark]14
Free Response (6 Marks) – Extra Example!!
The table below shows the survey results from 12 people, who were asked whether they would
accept a job offer based on the attributes: Salary, Distance, and Social. We want to build a
decision tree to assist with future decisions of whether a person would accept a Job or not.
ID
Salary
Distance
Social
Job
1
Medium
Far
Poor
No
2
High
Far
Good
Yes
3
Low
Near
Poor
No
4
Medium
Moderate
Good
Yes
5
High
Far
Poor
Yes
6
Medium
Far
Good
Yes
7
Medium
Moderate
Poor
No
8
Medium
Near
Good
Yes
9
High
Moderate
Poor
Yes
10
Medium
Near
Poor
Yes
11
Medium
Moderate
Poor
Yes
12
Low
Moderate
Good
No
What is the entropy of Job?
Yes = 8 instances, No = 4 instances. [1 mark]
Entropy = -8/12 log2(8/12)-4/12 log2(4/12) = 0.9184. [1 mark]
Without calculating information gain, which attribute would you choose to be the root of the
decision tree? Explain why.
Salary
[1 Mark]
Purest leaves. (High/Low homogenous leaves)
[1 Mark]
What is the information gain of the attribute you chose for the previous question?
Entropy(Salary = high) = 0; Entropy(Salary = low) = 0
Entropy(Salary = medium)=-5/7log2(5/7)–2/7log2(2/7)= 0.8632
[1 Mark]
EEntropy(Salary) = 7/12*0.8632 = 0.5035 [1 Mark or H]
Information gain = 0.9184 – 0. 5035 = 0.4149
[1 Mark or H up to max of 2 Marks]15
Formulas and references
The Visualization Zoo – Graphic Types
Time-Series Data
Index Charts
Stacked Graphs
Small Multiples
Horizon Graphs
Statistical Distributions
Stem-and-Leaf Plots
Q-Q Plots
SPLOM
Parallel Coordinates
Maps
Flow Maps
Choropleth Maps
Graduated Symbol Maps
Cartograms
Hierarchies
Node-Link diagrams
Adjacency Diagrams
Enclosure Diagrams
Networks
Force-Directed Layouts
Arc Diagrams
Matrix Views
Entropy
If S is an arbitrary collection of examples with a
binary class attribute, then:
???????(?) = −?!"???#(?!")−?!#???#(?!#)
= −
?
!"
?
???# 2 ?
!"
?
3 − ?
!#
?
???# 2 ?
!#
?
3
where ?1 ??? ?2 are the two classes.
?!" ??? ?!# are the probability of being in
Class 1 or Class 2 respectively. ?!" ??? ?!# are
the number of examples in each class. ? is the
total number of examples.
Note: ???#? = \$%&
!"
'
\$%&
!"
# = \$%&
!"'
(.*("
Information gain
The ????(?, ?) of an attribute A relative to a
collection of examples, S, with v groups having
|?+| elements is:
????(?, ?) = ???????(?) − 3 |?
!
|
|?
|
!∈#\$%&'((*)
∗ ???????(?!)
Accuracy
??? =
?? + ??
?? + ?? + ?? + ??
ROC
??? =
??
?? + ?? ,
??? =
??
?? + ??
Naïve Bayes’
??? ?????? ?", ?#, … , ?, ??? ????? ?, the
classification probability is
?,?#|?\$ ∩ ?% … ∩ ?&1 =
?,?#1 ∙ ?,?\$ ∩ ?% … ∩ ?&|?#1
?(?\$ ∩ ?% … ∩ ?&)
For Bayesian classification, a new point is
classified to ?- if ?,?#1 ∗ P,?\$|?#1 ∗ P,?\$|?#1 ∗ … ∗
P,?'|?#1 is maximised.
Naïve Bayes assumes ?(? ∩ ?) = ?(?) ∗
?(?) etc.

Email:51zuoyejun

@gmail.com