/CITS5504-Data-Warehousing, Sem1, 2020, Final Exam Date and Time of Exam Creation: 05/20/2020 6:19PM AWST | Total Exam Points: 50 | Est. Completion Time: 0mins Avg. Point Biserial: 0.00 | Upper 27%: 0.00 | Lower 27%: 0.00 | Disc. Index: 0.00 | Difficulty: 0.00 | Total Questions: 7 | Multiple Choice: 0 | Fill in the Blank: 1 | True/False: 0 | Essay: 6 1. (3 marks) Suppose the balance of Jim and Lucy before March 20 is zero. The transactional table of 3 days (20th-22nd March) is shown below. Assume that a daily snapshot is taken at the end of each day. What is the corresponding daily snapshot table for this transactional table? (You may write the answer in the format of comma separated rows.) (Please write your essay question on a separate piece of paper) 2. (5 marks) Suppose you are asked to design a data warehouse for the Fitness First gyms with two fact tables and three dimension tables. The first fact table Visits is to keep track of the check-in and check-out information of the customers to the club, which can be used to analyse the duration of each visit. The second fact table Members is to store the customer membership information for club income analysis. Given three dimensions Time, Location and Gender, and a measure duration in the Visits fact table, a query across all three dimensions could be "the average time of each visit for WA branches by female gym goers". 1. If we roll up along the Gender dimension, what would be a sensible query based on the original query? 2. Similarly, given a measure fee in the Members fact table, what would be a sensible query? 3. What schema would you use for this data warehouse? (Please write your essay question on a separate piece of paper) 3. (6 marks) Given the table below, and assume that a customer can be uniquely identified by the date-of-birth and name; the city a customer lives in is a slowly changing dimension, answer the following questions. 1. Given the history preserving strategy depicted in the table below, briefly explain how to compute the total expenditure of each customer using SQL. 2. Choose an alternative strategy for handling slowly changing dimensions, and explain how to compute the total expenditure of each customer. 3. Discuss the pros and cons of each strategy. (Please write your essay question on a separate piece of paper) Question ID: 4873 | Point Value: | Categories: Question ID: 4921 | Point Value: | Categories: Question ID: 4689 | Point Value: | Categories: /4. (6 marks) Given a 3-D cuboid ABC evenly divided into 64 chunks, the cardinality of A is 1600, B is 400 and C is 40. The total number of memory units to store the 2-D cuboid BC is 400x40, AB is 1600x400 and AC is 1600x40, respectively. A goal in multi-way array aggregation is to minimise the memory consumption. Suppose only one chunk of ABC can be stored in the main memory at any time, and the ABC cuboid can only be scanned once. 1. How many memory units in total are needed to store the temporary aggregated results of the AB, BC and AC cuboids? 2. Briefly explain how to compute the total memory units needed to store the temporary aggregated results. (Please write your essay question on a separate piece of paper) 5. (10 marks) A supermarket has 5 transactions shown in the table below. Suppose the minimum support of a frequent pattern is 3. List two of the closed frequent patterns: List a max-pattern: Hints: A frequent pattern is closed if none of its immediate super-patterns has the same support as . A frequent pattern is a max-pattern if is frequent and there exists no frequent super-pattern ⊃. Question ID: 5043 | Point Value: | Categories: Question ID: 3100 | Point Value: | Categories: 1 2 3 1. Cake, Bread, Bread, Cake 2. Cake, Bread, Milk, Bread, Cake, Milk, Cake, Milk, Bread, Milk, Cake, Bread, Milk, Bread, Cake, Bread, Milk, Cake 3. Cake, Milk, Bread, Milk, Cake, Bread, Cake, Bread, Milk, Bread, Milk, Cake, Milk, Bread, Cake, Bread, Cake, Milk /6. (10 marks) Given the data set in the table below, we like to train a decision tree to predict who is likely to buy a computer. The formula of gini index is shown below. where D is the data set and p is the relative frequency of class j in D. If a data set D is split on the attribute A into two subsets D and D , the gini index gini (D) is defined as: 1. Explain how to calculate the information gain of the attribute "Student". 2. What is the reduction in impurity if the root node uses the Student attribute? (Please write your essay question on a separate piece of paper) 7. (10 marks) A university would like to cluster undergraduate students into groups based on the following attributes: Age (numerical) Gender (categorical) Residency (categorical) Degree (categorical) ATAR or other University Entry Test Score (numerical) WAM (Current Weighted Course Average) (numerical) 1. Given two clustering algorithms k-Means and DBScan, which would be more suitable for this clustering task? Explain why. 2. Suggest distance measures to be used to calculate the similarity between two students. (Please write your essay question on a separate piece of paper) Question ID: 7452 | Point Value: | Categories: Question ID: 2073 | Point Value: | Categories: j 1 2 A
欢迎咨询51作业君