CITS3401 Final Exam Marking Semester 1, 2020
Q1: (3 marks) Suppose the balance of Jim and Lucy before March 20 is zero.
The transactional table of 3 days (20th-22nd March) is shown below. Assume that a daily
snapshot is taken at the end of each day. What is the corresponding daily snapshot table for
this transactional table? (You may write the answer in the format of comma separated rows.)
Answer:
Jim Mar 20 15
Lucy Mar 20 40
Jim Mar 21 15
Lucy Mar 21 40
Jim Mar 22 5
Lucy Mar 22 45
Q2: (5 marks) Suppose you are asked to design a data warehouse for the Fitness First gyms
with two fact tables and three dimension tables. The first fact table Visits is to keep track of
the check-in and check-out information of the customers to the club, which can be used to
analyse the duration of each visit. The second fact table Members is to store the customer
membership information for club income analysis.
Given three dimensions Time, Location and Gender, and a measure duration in the Visits
fact table, a query across all three dimensions could be "the average time of each visit for
WA branches by female gym goers".
1. If we roll up along the Gender dimension, what would be a sensible query based on
the original query?
2. Similarly, given a measure fee in the Members fact table, what would be a sensible
query?
3. What schema would you use for this data warehouse?
Sample Answer:
1. The average time of each visit for WA branches
2. What is the total fee of members in WA branches?
3. Galaxy Schema or Fact Constellation
Q3: (6 marks) Given the table below, and assume that a customer can be uniquely identified
by the date-of-birth and name; the city a customer lives in is a slowly changing
dimension, answer the following questions.
1. Given the history preserving strategy depicted in the table below, briefly explain how
to compute the total expenditure of each customer using SQL.
2. Choose an alternative strategy for handling slowly changing dimensions, and explain
how to compute the total expenditure of each customer.
3. Discuss the pros and cons of each strategy.
Sample Answer:
1. Select Name, Date of Birth, Sum(Expenditure) from CustomerTable group by Name, Date of
Birth.
2. Overwriting history. In this strategy, the name of the city where the customer previously
lived is replaced by the new city name. The “Expenditure” column is the total expenditure of
each customer.
3. Overwriting history is simple to implement and more space efficient. History preserving
strategy is more accurate in historical reporting, pre-computed aggregates are not affected,
but the table grows over time.
Q4: (6 marks) Given a 3-D cuboid ABC evenly divided into 64 chunks, the cardinality of A
is 1600, B is 400 and C is 40. The total number of memory units to store the 2-D cuboid BC
is 400x40, AB is 1600x400 and AC is 1600x40, respectively.
A goal in multi-way array aggregation is to minimise the memory consumption. Suppose
• only one chunk of ABC can be stored in the main memory at any time, and
• the ABC cuboid can only be scanned once.
1. How many memory units in total are needed to store the temporary aggregated results
of the AB, BC and AC cuboids?
2. Briefly explain how to compute the total memory units needed to store the temporary
aggregated results.
Sample Answer: 1.
72,000.
2. To minimise the memory consumption when performing the multi-way array aggregation, the
whole 2D cuboid for the smallest one, a chunk for the largest 2D cuboid and a column for the
medium one are needed to store the temporary aggregated results. That is “the whole BC + a
column of AC + a chunk of AB” = 400x40+400x40+400x100 = 72,000.
Q5: (10 marks) A supermarket has 5 transactions shown in the table below. Suppose
the minimum support of a frequent pattern is 3.
List two of the closed frequent patterns.
List a max-pattern.
Hints:
• A frequent pattern is closed if none of its immediate super-patterns has the same
support as .
• A frequent pattern is a max-pattern if is frequent and there exists no frequent
super-pattern ⊃.
Sample Answer: (note: students may use different orders, e.g. {Bread, Cake} and {Cake, Bread})
Closed patterns: {Bread, Cake}, {Break, Cake, Milk}
Max-pattern: {Bread, Cake, Milk}
Q6: (10 marks) Given the data set in the table below, we like to train a decision tree to predict
who is likely to buy a computer.
The formula of gini index is shown below.
where D is the data set and p j is the relative frequency of class j in D. If a data set D is split
on the attribute A into two subsets D 1 and D 2, the gini index giniA(D) is defined as:
1. Explain how to calculate the gini index of the attribute "Student".
2. What is the reduction in impurity if the root node uses the Student attribute?
Sample Answer:
1. Using the “Student” attribute splits the data into two subsets: one corresponds to customers
who are students and the other corresponds to those who are not. The gini index for student
customers is gini(D1)=(1 - 12)=0, and the gini index for not student customers is
gini(D2)=(1-1/2 * 1/2 – 1/2 * 1/2) = 0.5. So, the gini index ginistudent(D) = 1/2 * 0 + 1/2 * 0.5 =
0.25.
2. The reduction in impurity is gini(D) – ginistudent(D) = 0.125, where gini(D)=1-1/4 * 1/4 – 3/4 *
3/4 = 0.375.
Q7: (10 marks) A university would like to cluster undergraduate students into groups based
on the following attributes:
• Age (numerical)
• Gender (categorical)
• Residency (categorical)
• Degree (categorical)
• ATAR or other University Entry Test Score (numerical)
• WAM (Current Weighted Course Average) (numerical)
1. Given two clustering algorithms k-Means and DBScan, which would be more suitable
for this clustering task? Explain why.
2. Suggest distance measures to be used to calculate the similarity between two students.
Sample Answer:
1. Both k-means and DBScan are valid recommendation, even though DBScan may be better
because the number of groups is not limited and it doesn’t need to compute “means” which
may be tricky to compute for categorical attributes. (i) If recommend k-means, the university
has direct control on the number of groups they want to obtain. (ii) If recommend DBScan,
the university doesn’t need to specify the number of groups.
2. A distance measure that combines both categorical and numerical attributes. The similarity
of categorical attribute can be computed as follows for two students i and j.
Numerical attributes can be computed using Euclidean distance. And the final distance can
be measured by the following equation, where the weight of each type of attribute is
denoted by .  Email:51zuoyejun

@gmail.com