程序辅导案例 > Program >

辅导代写接单-FIT3152 Data analytics – 2023

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Faculty of

Information

Technology

FIT3152 Data analytics – 2023

Quiz and Practical Activity – Sample Answers

Your task

• You will be given a set of multiple choice and longer questions to answer.

The questions will cover topics taught during Weeks 1 – 9.

Value and

Structure

• This assignment is worth 20% of your total marks for the unit.

• It has 30 marks in total, comprised of

• 6 multiple choice questions of 1 Mark each,

• 3 free responses of 2 Marks each, and

• 3 grouped free responses of 6 Marks each.

Time

• You will have 1 Hour during tutorial time to complete the test.

Due Date

Your scheduled tutorial during Week 11

Submission

• Via Moodle Quiz

Generative

AI Use

• In this assessment, you must not use generative artificial intelligence (AI) to

generate any materials or content in relation to the assessment task.

Late

Penalties

• This activity can only be deferred/re-scheduled on medical or other serious

grounds with relevant documentation.

Instructions

• Answer the questions on the Moodle Quiz.

• The activity is closed book, therefore lecture and tutorial notes or online

references are not permitted.

• You may use any calculator (physical or digital).

• You must keep your camera on if you are in an online tutorial.

NOTE

You will be asked to stop this activity early and submit what you have done if:

• You are found to be using any software other than that permitted.

• You are found to be accessing web sites or online resources other than the

Moodle Quiz.

• You are found to be communicating with any other student.

• You are found using online resources besides the Moodle Quiz.

• You are found to be cheating in any way.2

Multiple Choice (1 Mark)

The following points (P1 – P6) are to be clustered using hierarchical clustering and applying MIN to

the distance matrix below. Which pair of points are in the first merge?

P2, P4

P3, P4

P1, P6

P1, P4

P4, P5

P1 P2 P3 P4

P5 P6

P1 0.0 0.4 2.5 1.5 1.4 0.2

0.0 0.4 3.9 1.7 0.6

0.0 2.8 0.8 1.9

0.0 0.1 2.0

0.0 1.3

0.03

Multiple Choice (1 Mark)

The table below shows a classification model for 10 customers based on whether or not they did

buy a new product (did buy = 1, did not buy = 0), and the confidence level of the prediction.

Customer

Confidence-buy

Did-buy

C01

0.8823

C02

0.5547

C03

0.6469

C04

0.1252

C05

0.7050

C06

0.7065

C07

0.1441

C08

0.7398

C09

0.7865

C10

0.4874

What is the lift value if you target the top 50% of customers that the classifier is most confident of?

0.2

0.5

1.5

2.0

2.54

Multiple Choice (1 Mark)

The ROC chart for a classification problem is given below.

Give an estimate of classifier performance (AUC).

AUC = 0.7917 (Exact)

0.1

0.2

0.5

0.6

0.85

Multiple Choice (1 Mark)

15 observations were sampled at random from the Iris data set. The dendrogram resulting from

clustering, based on their sepal and petal measurements, is below.

What is the smallest number of clusters that would put all of species Setosa (observations 1:50) in

a cluster their own.

156

Multiple Choice (1 Mark)

Predict the output from the following commands:

> X <- c(1, 2)

> Y <- c(3, 4)

> X + Y

4, 6

3, 7

1234

1, 2, 3, 47

Multiple Choice (1 Mark)

An artificial neural network (ANN) is to be used to classify whether or not to Buy a certain product

based on Popularity, Sales and Performance. An extract of the data is below.

(a)

How many input nodes does the ANN require for this problem? [1 Mark]

Pop (3) + Sales (1) + Perf (1) = 5

Popularity

Sales

Performance Buy

low

330000

0.87

Maybe

medium

40000

0.22

low

50000

Yes

high

30000

Yes

low

100000

0.1

medium

0.06

...

...8

Free Response (2 Marks)

The table below shows a classification model for 10 customers based on whether or not they did

buy a new product (did buy = 1, did not buy = 0), and the confidence level of the prediction.

Customer

Confidence-buy

Did-buy

50%CL

C01

0.8823

C02

0.5547

C03

0.6469

C04

0.1252

C05

0.7050

C06

0.7065

C07

0.1441

C08

0.7398

C09

0.7865

C10

0.4874

If a confidence level of 50% or greater is required for a positive classification, what is the Accuracy

of the model?

TP = 4; FP = 3; TN = 3; FN = 0

[1 Mark all correct]

Acc = (TP + TN)/(TP+FP+TN+FN) = 7/10

[1 Mark or H]9

Free Response (2 Marks)

A k-Means clustering algorithm is fitted to the iris data, as shown below.

rm(list = ls())

data("iris")

ikfit = kmeans(iris[,1:2], 4, nstart = 10)

ikfit

table(actual = iris$Species, fitted = ikfit$cluster)

Based on the R code and output below, answer the following questions.

> ikfit

K-means clustering with 4 clusters of sizes 24, 53, 41, 32

Cluster means:

Sepal.Length Sepal.Width

1 4.766667 2.891667

2 5.924528 2.750943

3 6.880488 3.097561

4 5.187500 3.637500

Within cluster sum of squares by cluster:

[1] 4.451667 8.250566 10.634146 4.630000

(between_SS / total_SS = 78.6 %)

> table(actual = iris$Species, fitted = ikfit$cluster)

fitted

actual 1 2 3 4

setosa 18 0 0 32

versicolor 5 34 11 0

virginica 1 19 30 0

If clustering was used to discriminate between the irises, what would be the accuracy of the

model? Explain your reasoning.

Assign each displacement to the cluster having greatest number of

members. Assume these are the TPs and then work out accuracy as

usual. [1 Mark]

For example: assume C1 and C4 are setosa, C2 is versicolor, C3 is

virginica. Correct classified = (18 + 34 + 30 + 32)/Total = 150,

Accuracy = 0.76. accept any reasonable similar approach. [1 Mark]10

Free Response (2 Marks)

Use the data below and Naïve Bayes classification to predict whether the following test instance

will be happy or not.

Test instance: (Age Range = young, Occupation = professor, Gender = F, Happy = ? )

YES

P(young/Y)

P(professor/Y) P(F/Ys)

Product

p(yes) 0.5

0.250

0.500

0.016

P(young/N)

P(professor/No) P(F/N)

Product

p(no)

0.5

0.250

0.750

0.023

Correct calculations [1 Mark]

So classify as Happy = No [1 Mark or H]

Age Range

Occupation

Gender

Happy

Young

Tutor

Yes

Middle-aged

Professor

Old

Tutor

Yes

Middle-aged

professor

Yes

Old

Tutor

Yes

Young

Lecturer

Middle-aged

lecturer

Old

Tutor

No11

Free Response (6 Marks)

The DunHumby (DH) data frame records the Date a Customer shops at a store, the number of Days

since their last shopping visit, and amount Spentfor 20 customers. The first 4 rows are shown below.

> head(DH)

customer_id visit_date visit_delta visit_spend

1 40 04-04-10 NA 44.8

2 40 06-04-10 2 69.7

3 40 19-04-10 13 44.6

4 40 01-05-10 12 30.4

The following R code is run:

DHY = DH[as.Date(DH$visit_date,"%d-%m-%y") < as.Date("01-01-11","%d-%m-%y"),]

CustSpend = as.table(by(DHY$visit_spend, DHY$customer_id, sum))

CustSpend = sort(CustSpend, decreasing = TRUE)

CustSpend = head(CustSpend, 12)

CustSpend = as.data.frame(CustSpend)

colnames(CustSpend) = c("customer_id", "amtspent")

DHYZ = DHY[(DHY$customer_id %in% CustSpend$customer_id),]

write.csv(DHYZ, "DHYZ.csv", row.names = FALSE)

g = ggplot(data = DHYZ) + geom_histogram(mapping = aes(x = visit_spend)) +

facet_wrap(~ customer_id, nrow = 3)

Describe the data contained in the data frame “CustSpend.” [2 Marks]

Total spend for each customer (before date)

[1 Mark]

For top 12 customers

[1 Mark]

Describe the data contained in the data frame “DHYZ.” [2 Marks]

DH data frame (cols removed Difference added)

[1 Mark]

For top 12 customers (in CustSpend)

[1 Mark]

Describe the contents of the graphic shown by plot “g.” [2 Marks]

Histogram of visit spend

[1 Mark]

Facetted by customer (for top 12)

[1 Mark]12

Free Response (6 Marks)

A World Health study is examining how life expectancy varies between men and women in different

countries and at different times in history. The table below shows a sample of the data that has

been recorded. There are approximately 15,000 records in all.

Country

Year of Birth

Gender

Age at Death

Australia

1818

Afghanistan

1944

USA

1846

India

1926

China

1860

India

1868

Australia

1900

China

1875

England

1807

France

1933

Egypt

1836

USA

1906

…

Using one of the graphic types from the Visualization Zoo (see formulae and references for a list of

types), or another graph type of your choosing, suggest a suitable graphic to help the researcher

display as many variables as clearly as possible.

Explain your decision. Which graph elements correspond to the variables you want to display?

Appropriate main graphic by name

[1 Mark]

For example scatter plot or heat map. Accept another type with

justification.

Mapping of variables to graphic (Country)

[1 Mark]

Age at death and other vars are grouped by country using colour or

position or labels. Other mapping with justification.

Mapping of variables to graphic (Year of birth)

[1 Mark]

Year of birth is position or panel. Other mapping with

justification.

Mapping of variables to graphic (Gender)

[1 Mark]

Panel, position or colour. Other mapping with

justification.

Mapping of variables to graphic (Age at death)

[1 Mark]

Size, colour or position. Other mapping with justification.

Data reduction or summary calculation

[1 Mark]

How data is grouped and reduced. Averaging etc.13

Free Response (6 Marks)

A researcher wants to predict the prevalence of crime in towns, using the following data.

Crm: Crime rate in the town;

Ind: Proportion of the town zoned industrial.

Pol: Air pollution in the town (ppm)

Rms: Number of main rooms in the house

Tax: Land tax paid ($)

Str: Student to teacher ratio in local schools

Zone: Socio-economic zone of house location

Val: Value of the house ($000)

> head(Cdata)

Crm Ind Pol Rms Tax Str Zone Val

1 0.00632 2.31 0.538 6 296 15.3 0 2400

2 0.02731 7.07 0.469 6 242 17.8 1 2160

3 0.02729 7.07 0.469 7 242 17.8 0 3470

4 0.03237 2.18 0.458 6 222 18.7 0 3340

Based on the R code and output below, answer the following questions.

> contrasts(Cdata$Zone) = contr.treatment(3)

> Crime = lm(Crm~.,data = Cdata); summary(Crime)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -5.162875 5.324457 -0.97 0.333

Ind -0.160716 0.078481 -2.05 0.041 *

Pol 4.791271 4.443372 1.08 0.281

Rms 0.051432 0.500037 0.10 0.918

Tax 0.025699 0.002902 8.86 <2e-16 ***

Str 0.041439 0.177346 0.23 0.815

Zone1 -1.843825 1.198360 -1.54 0.125

Zone2 3.244316 1.702931 1.91 0.057 .

Val -0.001216 0.000582 -2.09 0.037 *

---

> contrasts(Cdata$Zone)

2 3

0 0 0

1 1 0

How does the proportion of the town zoned industrial affect crime rate? How reliable is the

evidence?

Increasing proportion of industrial reduces crime rate [1 Mark]

Reliable. Significance is high (p < 0.05)

[1 mark]

How does air pollution affect crime rates? How reliable is the evidence?

Have positive coefficient but can’t really tell.

[1 Mark]

Reliability low. P-value/ Significance is low (0.281) [1 mark]

Why is Zone ‘0’ not defined in the regression output? How is it included in the model?

Zone 0 is the default contrast (having coefficient 0) [1 mark]

It is implicitly included in the intercept

[1 mark]14

Free Response (6 Marks) – Extra Example!!

The table below shows the survey results from 12 people, who were asked whether they would

accept a job offer based on the attributes: Salary, Distance, and Social. We want to build a

decision tree to assist with future decisions of whether a person would accept a Job or not.

Salary

Distance

Social

Job

Medium

Far

Poor

High

Far

Good

Yes

Low

Near

Poor

Medium

Moderate

Good

Yes

High

Far

Poor

Yes

Medium

Far

Good

Yes

Medium

Moderate

Poor

Medium

Near

Good

Yes

High

Moderate

Poor

Yes

Medium

Near

Poor

Yes

Medium

Moderate

Poor

Yes

Low

Moderate

Good

What is the entropy of Job?

Yes = 8 instances, No = 4 instances. [1 mark]

Entropy = -8/12 log2(8/12)-4/12 log2(4/12) = 0.9184. [1 mark]

Without calculating information gain, which attribute would you choose to be the root of the

decision tree? Explain why.

Salary

[1 Mark]

Purest leaves. (High/Low homogenous leaves)

[1 Mark]

What is the information gain of the attribute you chose for the previous question?

Entropy(Salary = high) = 0; Entropy(Salary = low) = 0

Entropy(Salary = medium)=-5/7log2(5/7)–2/7log2(2/7)= 0.8632

[1 Mark]

EEntropy(Salary) = 7/12*0.8632 = 0.5035 [1 Mark or H]

Information gain = 0.9184 – 0. 5035 = 0.4149

[1 Mark or H up to max of 2 Marks]15

Formulas and references

The Visualization Zoo – Graphic Types

Time-Series Data

• Index Charts

• Stacked Graphs

• Small Multiples

• Horizon Graphs

Statistical Distributions

• Stem-and-Leaf Plots

• Q-Q Plots

• SPLOM

• Parallel Coordinates

Maps

• Flow Maps

• Choropleth Maps

• Graduated Symbol Maps

• Cartograms

Hierarchies

• Node-Link diagrams

• Adjacency Diagrams

• Enclosure Diagrams

Networks

• Force-Directed Layouts

• Arc Diagrams

• Matrix Views

Entropy

If S is an arbitrary collection of examples with a

binary class attribute, then:

???????(?) = −?!"???#(?!")−?!#???#(?!#)

= −

???# 2 ?

3 − ?

???# 2 ?

where ?1 ??? ?2 are the two classes.

?!" ??? ?!# are the probability of being in

Class 1 or Class 2 respectively. ?!" ??? ?!# are

the number of examples in each class. ? is the

total number of examples.

Note: ???#? = $%&

$%&

# = $%&

!"'

(.*("

Information gain

The ????(?, ?) of an attribute A relative to a

collection of examples, S, with v groups having

|?+| elements is:

????(?, ?) = ???????(?) − 3 |?

!∈#$%&'((*)

∗ ???????(?!)

Accuracy

??? =

?? + ??

?? + ?? + ?? + ??

ROC

??? =

?? + ?? ,

??? =

?? + ??

Naïve Bayes’

??? ?????? ?", ?#, … , ?, ??? ????? ?, the

classification probability is

?,?#|?$ ∩ ?% … ∩ ?&1 =

?,?#1 ∙ ?,?$ ∩ ?% … ∩ ?&|?#1

?(?$ ∩ ?% … ∩ ?&)

For Bayesian classification, a new point is

classified to ?- if ?,?#1 ∗ P,?$|?#1 ∗ P,?$|?#1 ∗ … ∗

P,?'|?#1 is maximised.

Naïve Bayes assumes ?(? ∩ ?) = ?(?) ∗

?(?) etc.