CITS3401 Final Exam Marking Semester 1, 2020

Q1: (3 marks) Suppose the balance of Jim and Lucy before March 20 is zero.

The transactional table of 3 days (20th-22nd March) is shown below. Assume that a daily

snapshot is taken at the end of each day. What is the corresponding daily snapshot table for

this transactional table? (You may write the answer in the format of comma separated rows.)

Answer:

Jim Mar 20 15

Lucy Mar 20 40

Jim Mar 21 15

Lucy Mar 21 40

Jim Mar 22 5

Lucy Mar 22 45

Q2: (5 marks) Suppose you are asked to design a data warehouse for the Fitness First gyms

with two fact tables and three dimension tables. The first fact table Visits is to keep track of

the check-in and check-out information of the customers to the club, which can be used to

analyse the duration of each visit. The second fact table Members is to store the customer

membership information for club income analysis.

Given three dimensions Time, Location and Gender, and a measure duration in the Visits

fact table, a query across all three dimensions could be "the average time of each visit for

WA branches by female gym goers".

1. If we roll up along the Gender dimension, what would be a sensible query based on

the original query?

2. Similarly, given a measure fee in the Members fact table, what would be a sensible

query?

3. What schema would you use for this data warehouse?

Sample Answer:

1. The average time of each visit for WA branches

2. What is the total fee of members in WA branches?

3. Galaxy Schema or Fact Constellation

Q3: (6 marks) Given the table below, and assume that a customer can be uniquely identified

by the date-of-birth and name; the city a customer lives in is a slowly changing

dimension, answer the following questions.

1. Given the history preserving strategy depicted in the table below, briefly explain how

to compute the total expenditure of each customer using SQL.

2. Choose an alternative strategy for handling slowly changing dimensions, and explain

how to compute the total expenditure of each customer.

3. Discuss the pros and cons of each strategy.

Sample Answer:

1. Select Name, Date of Birth, Sum(Expenditure) from CustomerTable group by Name, Date of

Birth.

2. Overwriting history. In this strategy, the name of the city where the customer previously

lived is replaced by the new city name. The “Expenditure” column is the total expenditure of

each customer.

3. Overwriting history is simple to implement and more space efficient. History preserving

strategy is more accurate in historical reporting, pre-computed aggregates are not affected,

but the table grows over time.

Q4: (6 marks) Given a 3-D cuboid ABC evenly divided into 64 chunks, the cardinality of A

is 1600, B is 400 and C is 40. The total number of memory units to store the 2-D cuboid BC

is 400x40, AB is 1600x400 and AC is 1600x40, respectively.

A goal in multi-way array aggregation is to minimise the memory consumption. Suppose

• only one chunk of ABC can be stored in the main memory at any time, and

• the ABC cuboid can only be scanned once.

1. How many memory units in total are needed to store the temporary aggregated results

of the AB, BC and AC cuboids?

2. Briefly explain how to compute the total memory units needed to store the temporary

aggregated results.

Sample Answer: 1.

72,000.

2. To minimise the memory consumption when performing the multi-way array aggregation, the

whole 2D cuboid for the smallest one, a chunk for the largest 2D cuboid and a column for the

medium one are needed to store the temporary aggregated results. That is “the whole BC + a

column of AC + a chunk of AB” = 400x40+400x40+400x100 = 72,000.

Q5: (10 marks) A supermarket has 5 transactions shown in the table below. Suppose

the minimum support of a frequent pattern is 3.

List two of the closed frequent patterns.

List a max-pattern.

Hints:

• A frequent pattern is closed if none of its immediate super-patterns has the same

support as .

• A frequent pattern is a max-pattern if is frequent and there exists no frequent

super-pattern ⊃.

Sample Answer: (note: students may use different orders, e.g. {Bread, Cake} and {Cake, Bread})

Closed patterns: {Bread, Cake}, {Break, Cake, Milk}

Max-pattern: {Bread, Cake, Milk}

Q6: (10 marks) Given the data set in the table below, we like to train a decision tree to predict

who is likely to buy a computer.

The formula of gini index is shown below.

where D is the data set and p j is the relative frequency of class j in D. If a data set D is split

on the attribute A into two subsets D 1 and D 2, the gini index giniA(D) is defined as:

1. Explain how to calculate the gini index of the attribute "Student".

2. What is the reduction in impurity if the root node uses the Student attribute?

Sample Answer:

1. Using the “Student” attribute splits the data into two subsets: one corresponds to customers

who are students and the other corresponds to those who are not. The gini index for student

customers is gini(D1)=(1 - 12)=0, and the gini index for not student customers is

gini(D2)=(1-1/2 * 1/2 – 1/2 * 1/2) = 0.5. So, the gini index ginistudent(D) = 1/2 * 0 + 1/2 * 0.5 =

0.25.

2. The reduction in impurity is gini(D) – ginistudent(D) = 0.125, where gini(D)=1-1/4 * 1/4 – 3/4 *

3/4 = 0.375.

Q7: (10 marks) A university would like to cluster undergraduate students into groups based

on the following attributes:

• Age (numerical)

• Gender (categorical)

• Residency (categorical)

• Degree (categorical)

• ATAR or other University Entry Test Score (numerical)

• WAM (Current Weighted Course Average) (numerical)

1. Given two clustering algorithms k-Means and DBScan, which would be more suitable

for this clustering task? Explain why.

2. Suggest distance measures to be used to calculate the similarity between two students.

Sample Answer:

1. Both k-means and DBScan are valid recommendation, even though DBScan may be better

because the number of groups is not limited and it doesn’t need to compute “means” which

may be tricky to compute for categorical attributes. (i) If recommend k-means, the university

has direct control on the number of groups they want to obtain. (ii) If recommend DBScan,

the university doesn’t need to specify the number of groups.

2. A distance measure that combines both categorical and numerical attributes. The similarity

of categorical attribute can be computed as follows for two students i and j.

Numerical attributes can be computed using Euclidean distance. And the final distance can

be measured by the following equation, where the weight of each type of attribute is

denoted by .

欢迎咨询51作业君