CITS3401 Final Exam Marking Semester 1, 2020 Q1: (3 marks) Suppose the balance of Jim and Lucy before March 20 is zero. The transactional table of 3 days (20th-22nd March) is shown below. Assume that a daily snapshot is taken at the end of each day. What is the corresponding daily snapshot table for this transactional table? (You may write the answer in the format of comma separated rows.) Answer: Jim Mar 20 15 Lucy Mar 20 40 Jim Mar 21 15 Lucy Mar 21 40 Jim Mar 22 5 Lucy Mar 22 45 Q2: (5 marks) Suppose you are asked to design a data warehouse for the Fitness First gyms with two fact tables and three dimension tables. The first fact table Visits is to keep track of the check-in and check-out information of the customers to the club, which can be used to analyse the duration of each visit. The second fact table Members is to store the customer membership information for club income analysis. Given three dimensions Time, Location and Gender, and a measure duration in the Visits fact table, a query across all three dimensions could be "the average time of each visit for WA branches by female gym goers". 1. If we roll up along the Gender dimension, what would be a sensible query based on the original query? 2. Similarly, given a measure fee in the Members fact table, what would be a sensible query? 3. What schema would you use for this data warehouse? Sample Answer: 1. The average time of each visit for WA branches 2. What is the total fee of members in WA branches? 3. Galaxy Schema or Fact Constellation Q3: (6 marks) Given the table below, and assume that a customer can be uniquely identified by the date-of-birth and name; the city a customer lives in is a slowly changing dimension, answer the following questions. 1. Given the history preserving strategy depicted in the table below, briefly explain how to compute the total expenditure of each customer using SQL. 2. Choose an alternative strategy for handling slowly changing dimensions, and explain how to compute the total expenditure of each customer. 3. Discuss the pros and cons of each strategy. Sample Answer: 1. Select Name, Date of Birth, Sum(Expenditure) from CustomerTable group by Name, Date of Birth. 2. Overwriting history. In this strategy, the name of the city where the customer previously lived is replaced by the new city name. The “Expenditure” column is the total expenditure of each customer. 3. Overwriting history is simple to implement and more space efficient. History preserving strategy is more accurate in historical reporting, pre-computed aggregates are not affected, but the table grows over time. Q4: (6 marks) Given a 3-D cuboid ABC evenly divided into 64 chunks, the cardinality of A is 1600, B is 400 and C is 40. The total number of memory units to store the 2-D cuboid BC is 400x40, AB is 1600x400 and AC is 1600x40, respectively. A goal in multi-way array aggregation is to minimise the memory consumption. Suppose • only one chunk of ABC can be stored in the main memory at any time, and • the ABC cuboid can only be scanned once. 1. How many memory units in total are needed to store the temporary aggregated results of the AB, BC and AC cuboids? 2. Briefly explain how to compute the total memory units needed to store the temporary aggregated results. Sample Answer: 1. 72,000. 2. To minimise the memory consumption when performing the multi-way array aggregation, the whole 2D cuboid for the smallest one, a chunk for the largest 2D cuboid and a column for the medium one are needed to store the temporary aggregated results. That is “the whole BC + a column of AC + a chunk of AB” = 400x40+400x40+400x100 = 72,000. Q5: (10 marks) A supermarket has 5 transactions shown in the table below. Suppose the minimum support of a frequent pattern is 3. List two of the closed frequent patterns. List a max-pattern. Hints: • A frequent pattern is closed if none of its immediate super-patterns has the same support as . • A frequent pattern is a max-pattern if is frequent and there exists no frequent super-pattern ⊃. Sample Answer: (note: students may use different orders, e.g. {Bread, Cake} and {Cake, Bread}) Closed patterns: {Bread, Cake}, {Break, Cake, Milk} Max-pattern: {Bread, Cake, Milk} Q6: (10 marks) Given the data set in the table below, we like to train a decision tree to predict who is likely to buy a computer. The formula of gini index is shown below. where D is the data set and p j is the relative frequency of class j in D. If a data set D is split on the attribute A into two subsets D 1 and D 2, the gini index giniA(D) is defined as: 1. Explain how to calculate the gini index of the attribute "Student". 2. What is the reduction in impurity if the root node uses the Student attribute? Sample Answer: 1. Using the “Student” attribute splits the data into two subsets: one corresponds to customers who are students and the other corresponds to those who are not. The gini index for student customers is gini(D1)=(1 - 12)=0, and the gini index for not student customers is gini(D2)=(1-1/2 * 1/2 – 1/2 * 1/2) = 0.5. So, the gini index ginistudent(D) = 1/2 * 0 + 1/2 * 0.5 = 0.25. 2. The reduction in impurity is gini(D) – ginistudent(D) = 0.125, where gini(D)=1-1/4 * 1/4 – 3/4 * 3/4 = 0.375. Q7: (10 marks) A university would like to cluster undergraduate students into groups based on the following attributes: • Age (numerical) • Gender (categorical) • Residency (categorical) • Degree (categorical) • ATAR or other University Entry Test Score (numerical) • WAM (Current Weighted Course Average) (numerical) 1. Given two clustering algorithms k-Means and DBScan, which would be more suitable for this clustering task? Explain why. 2. Suggest distance measures to be used to calculate the similarity between two students. Sample Answer: 1. Both k-means and DBScan are valid recommendation, even though DBScan may be better because the number of groups is not limited and it doesn’t need to compute “means” which may be tricky to compute for categorical attributes. (i) If recommend k-means, the university has direct control on the number of groups they want to obtain. (ii) If recommend DBScan, the university doesn’t need to specify the number of groups. 2. A distance measure that combines both categorical and numerical attributes. The similarity of categorical attribute can be computed as follows for two students i and j. Numerical attributes can be computed using Euclidean distance. And the final distance can be measured by the following equation, where the weight of each type of attribute is denoted by .
欢迎咨询51作业君