辅导案例-1COMP9318
1COMP9318: Data Warehousing and Data Mining — L2: Data Warehousing and OLAP — 2n Why and What are Data Warehouses? Data Analysis Problems n The same data found in many different systems n Example: customer data across different departments n The same concept is defined differently n Heterogeneous sources n Relational DBMS, OnLine Transaction Processing (OLTP) n Unstructured data in files (e.g., MS Excel) and documents (e.g., MS Word) 3 Data Analysis Problems (Cont’d) n Data is suited for operational systems n Accounting,billing,etc. n Do not support analysis across business functions n Data quality is bad n Missing data, imprecise data, different use of systems n Data are “volatile” n Data deleted in operational systems (6months) n Data change over time – no historical information 4 5Solution: Data Warehouse n Defined in many different ways, but not rigorously. n A decision support database that is maintained separately from the organization’s operational database n Support information processing by providing a solid platform of consolidated, historical data for analysis. n “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon n Data warehousing: n The process of constructing and using data warehouses 6Data Warehouse—Subject-Oriented n Organized around major subjects, such as customer, product, sales. n Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. n Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. 7Data Warehouse—Integrated n Constructed by integrating multiple, heterogeneous data sources n relational databases, flat files, on-line transaction records n Data cleaning and data integration techniques are applied. n Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources n E.g., Hotel price: currency, tax, breakfast covered, etc. n When data is moved to the warehouse, it is converted. 8Data Warehouse—Time Variant n The time horizon for the data warehouse is significantly longer than that of operational systems. n Operational database: current value data. n Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) n Every key structure in the data warehouse n Contains an element of time, explicitly or implicitly n But the key of operational data may or may not contain “time element”. 9Data Warehouse—Non-Volatile 1. A physically separate store of data transformed from the operational environment. 2. Operational update of data does not occur in the data warehouse environment. n Does not require transaction processing, recovery, and concurrency control mechanisms n Requires only two operations in data accessing: n initial loading of data and access of data. 10 Data Warehouse Architecture n Extract data from operational data sources n clean, transform n Bulk load/refresh n warehouse is offline n OLAP-server provides multidimensional view n Multidimensional-olap (Essbase, oracle express) n Relational-olap (Redbrick, Informix, Sybase, SQL server) 11 Data Warehouse Architecture Function-oriented systems Subject-oriented systems All subjects, integrated Advanced analysis 12 Why Separate Data Warehouse? n High performance for both systems n DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery n Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation. n Different functions and different data: n missing data: Decision support requires historical data which operational DBs do not typically maintain n data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources n data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled 13 Why OLAP Servers? n Different workload: n OLTP (on-line transaction processing) n Major task of traditional relational DBMS n Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. n OLAP (on-line analytical processing) n Major task of data warehouse system n Data analysis and decision making n Queries hard/infeasible for OLTP, e.g., n Which week we have the largest sales? n Does the sales of dairy products increase over time? n Generate a spread sheet of total sales by state and by year. n Difficult to represent these queries by using SQL ç Why? 14 OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated historical, summarized, multidimensional integrated, consolidated usage repetitive ad-hoc access read/write index/hash on prim. key lots of scans unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response Comparisons 15 Databases Data Warehouses Purpose Many purposes; Flexible and general One purpose: Data analysis Conceptual Model ER Multidimensional Logical Model (Normalized) Relational Model (Denormalized) Star schema / Data cube/cuboids Physical Model Relational Tables ROLAP: Relational tables MOLAP: Multidimensional arrays Query Language SQL (hard for analytical queries) MDX (easier for analytical queries) Query Processing B+-tree/hash indexes, Multiple join optimization, Materialized views Bitmap/Join indexes, Star join, Materialized data cube 16 n The Multidimensional Model The Multidimensional Model n A data warehouse is based on a multidimensional data model which views data in the form of a data cube, which is a multidimensional generalization of 2D spread sheet. n Key concepts: n Facts: the subject it models n Typically transactions in this course; other types includes snapshots, etc. n Measures: numbers that can be aggregated n Dimensions: context of the measure n Hierarchies: n Provide contexts of different granularities (aka. grains) n Goals for dimensional modeling: n Surround facts with as much relevant context (dimensions) as possible ç Why? 17 Supermarket Example n Subject: analyze total sales and profits n Fact: Each Sales Transaction n Measure: Dollars_Sold, Amount_Sold, Cost n Calculated Measure: Profit n Dimensions: n Store n Product n Time 18 19 Visualizing the Cubes n A valid instance of the model is a data cube total Sales product p1 p2 p3 p4 ci ty NY $454 - - $925 LA $468 $800 - - SD $296 - $240 - SF $652 - $540 $745 product city 652 540 745 296 240 468 800 454 925 Concepts: cell, fact (=non-empty cell), measure, dimensions Q: How to generalize it to 3D? 3D Cube and Hierarchies pr od uc t cit y month ALL ALL ALL category region year product country quarter state month week city day store PRODUCT LOCATION TIME NY Book176 Fe b DIMENSIONS Sales of book176 in NY in Jan can be found in this cell Ja n M ar Concepts: hierarchy (a tree of dimension values), level 20 Hierarchies category product Concepts: hierarchy (a tree of dimension values), level ALL Food Beverage Book … … … ALL product category subcategory ALL brand Bo ok 17 6 Which design is better? Why? 21 The (city, moth) Cuboid pr od uc t cit y month ALL ALL ALL category region year product country quarter state month week city day store PRODUCT LOCATION TIME NY Book176 Fe b DIMENSIONS Sales of ALL_PROD in NY in Jan Ja n M ar 22 23 All the Cuboids Date Pr od uc t C ou nt ry TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico Assume: no other non-ALL levels on all dimensions. 24 All the Cuboids /2 Total annual sales of TV in U.S.A.Date Pr od uc t C ou nt ryall allTV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico all Assume: no other non-ALL levels on all dimensions. Lattice of the cuboids product, quarter, country product countryquarter quarter, countryproduct,quarter product, store Base cuboid 3-dim cuboid 2-dim cuboid 1-dim cuboid 0-dim cuboid n n-dim cube can be represented as (D1, D2, …, Dd), where Di is the set of allowed values on the i-th dimension. n if Di = Li (a particular level), then Di = all descendant dimension values of Li. n ALL can be omitted and hence reduces the effective dimensionality n A complete cube of d-dimensions consists of cuboids, where ni is the number of levels (excluding ALL) on i-th dimension. n They collectively form a lattice. dY i=1 (ni + 1) 25 Properties of Operations n All operations are closed under the multidimensional model n i.e., both input and output of an operation is a cube n So that they can be composed 26 Q: What’s the analogy in the Relational Model? 27 Common OLAP Operations n Roll-up: move up the hierarchy COMP9318: Data Warehousing and Data Mining pr od uc t cit y month ALL ALL ALL category region year product country quarter state month week city day store PRODUCT LOCATION TIME NY Book176 Q 2 DIMENSIONSSales of book176 in NY in Q1 here Q 1 Q 3 Q: what should be its value? pr od uc t cit y month NY Book176 Fe b Ja n M ar ALL ALL ALL category region year product country quarter state month week city day store PRODUCT LOCATION TIME DIMENSIONS 28 cit y pr od uc tNY Book176 quarter 29 Common OLAP Operations n Drill-down: move down the hierarchy n more fine-grained aggregation 30 Slice and Dice Queries n Slice and Dice: select and project on one or more dimension values cit y product month Product = Book176 The output cube has smaller dimensionality than the input cube 31 Pivoting n Pivoting: aggregate on selected dimensions n usually 2 dims (cross- tabulation) pivot on (city, month) Sales (of all products) in NY in Q1 =sum( ??? ) pr od uc t NY Book176 month Q 2 Q 1 Q 3 cit y month City … … … … … … … … NY Q2Q1 Q3 A Reflective Pause n Let’s review the definition of data cubes again. n Key message: n Disentangle the “object” from its “representation” or “implementation” 32 Modeling Exercise 1: Monthly Phone Service Billing 33 Theme: analyze the income/revenue of Telstra Solution n FACT n MEASURE n DIMENSIONS 34 35 n The Logical Model 36 Logical Models n Two main approaches: n Using relational DB technology: n Star schema, Snowflake schema, Fact constellation n Using multidimensional technology: n Just as multidimensional data cube 37 38 Universal Schema è Star Schema n Many data warehouses adopt a star schema to represent the multidimensional model n Each dimension is represented by a dimension-table n LOCATION (location_key, store, street_address, city, state, country, region) n dimension tables are not normalized n Transactions are described through a fact-table n each tuple consists of a logical pointer to each of the dimension- tables (foreign-key) and a list of measures (e.g. sales $$$) Store City State Prod Brand Category $Sold #Sold Cost S136 Syd NSW 76Ha Nestle Biscuit 40 10 18 S173 Melb Vic 76Ha Nestle Biscuit 20 5 11 The universal schema for supermarket The Star Schema time_key day day_of_the_week month quarter year TIME location_key store street_address city state country region LOCATION SALES product_key product_name category brand color supplier_name PRODUCT time_key product_key location_key units_sold amount Think why: (1) Denormalized once from the universal schema (2) Controlled redundancy 39 40 Typical Models for Data Warehouses n Modeling data warehouses: dimensions & measures n Star schema: A fact table in the middle connected to a set of dimension tables n Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake n Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 41 Example of Star Schema time_key day day_of_the_week month quarter year time location_key street city state_or_province country location Sales Fact Table time_key time_keyitem_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_type item branch_key branch_name branch_type branch 42 Example of Snowflake Schema time_key day day_of_the_week month quarter year time location_key street city_key location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key item branch_key branch_name branch_type branch supplier_key supplier_type supplier city_key city state_or_province country city item_key item_keybranch_key 43 Example of Fact Constellation time_key day day_of_the_week month quarter year time location_key street city province_or_state country location Sales Fact Table time_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_type item branch_key branch_name branch_type branch Shipping Fact Table time_key item_key Shipper_key from_location to_location dollars_cost units_shipped shipper_key shipper_name location_key shipper_type shipper 44 Advantages of Star Schema n Facts and dimensions are clearly depicted n dimension tables are relatively static, data is loaded (append mostly) into fact table(s) n easy to comprehend (and write queries) “Find total sales per product-category in our stores in Europe” SELECT PRODUCT.category, SUM(SALES.amount) FROM SALES, PRODUCT,LOCATION WHERE SALES.product_key = PRODUCT.product_key AND SALES.location_key = LOCATION.location_key AND LOCATION.region=“Europe” GROUP BY PRODUCT.category Operations: Slice (Loc.Region.Europe) + Pivot (Prod.category) n Query Language 45 Query Language n Two approaches: n Using relational DB technology: SQL (with extensions such as CUBE/PIVOT/UNPIVOT) n Using multidimensional technology: MDX 46 SELECT PRODUCT.category, SUM(SALES.amount) FROM SALES, PRODUCT,LOCATION WHERE SALES.product_key = PRODUCT.product_key AND SALES.location_key = LOCATION.location_key AND LOCATION.region=“Europe” GROUP BY PRODUCT.category SELECT {[PRODUCT].[category]} on ROWS, {[MEASURES].[amount]} on COLUMNS FROM [SALES] WHERE ([LOCATION].[region].[Europe]) Operations: Slice (Loc.Region.Europe) + Pivot (Prod.category, Measures.amnt) n Physical Model + Query Processing Techniques 47 Physical Model + Query Processing Techniques n Two main approaches: n Using relational DB technology: ROLAP n Using multidimensional technology: MOLAP n Hybrid: HOLAP n Base cuboid: ROLAP n Other cuboids: MOLAP 48 49 Q1: Selection on low-cardinality attributes time_key day day_of_the_week month quarter year TIME location_key store street_address city state country region LOCATION SALES customer_key customer_name region type CUSTOMER time_key customer_key location_key units_sold amount {measures JO IN JOIN Sregion=“Europe” • Ignoring the final GROUP BY for now • Omitting the Product dimension Indexing OLAP Data: Bitmap Index (1) BI on dimension tables n Index on an attribute (column) with low distinct values n Each distinct values, v, is associated with a n-bit vector (n = #rows) n The i-th bit is set if the i-th row of the table has the value v for the indexed column n Multiple BIs can be efficiently combined to enable optimized scan of the table Cust Region Type C1 Asia Retail C2 Europe Dealer C3 Asia Dealer C4 America Retail C5 Europe Dealer Custom BI on Customer.Region Chap 4.2 in [JPT10] v bitmap Asia 1 0 1 0 0 Europe 0 1 0 0 1 America 0 0 0 1 0 50 Indexing OLAP Data: Bitmap Index /2 (1) Bitmap join index (BI on Fact Table Joined with Dimension tables) n Conceptually, perform a join, map each dimension value to the bitmap of corresponding fact table rows. https://docs.oracle.com/cd/B28359_01/server.111/b28313/indexes.htm#CIHGAFFF -- ORACLE SYNTAX – CREATE BITMAP INDEX sales_cust_region_bjix ON sales(customer.cust_region) FROM sales, customer WHERE sales.cust_id = customers.cust_id; 51 52 Indexing OLAP Data: Bitmap Index /3 Sales Cust Region Type C1 Asia Retail C2 Europe Dealer C3 Asia Dealer C4 America Retail C5 Europe Dealer Customer time customer loc Sale 101 C1 100 1 173 C1 200 2 208 C2 100 3 863 C3 200 5 991 C1 100 8 1001 C2 200 13 1966 C4 100 21 2017 C5 200 34 BI on Sales(Customer.Region) v bitmap Asia 11011000 Europe 00100101 America 00000010 53 Q2: Selection on high-cardinality attributes time_key day day_of_the_week month quarter year TIME location_key store street_address city state country region LOCATION SALES customer_key customer_name region type CUSTOMER time_key customer_key location_key units_sold amount {measures JO IN JOIN Scity=“Kingsford” 54 R102 R117 R118 R124 Indexing OLAP Data: Join Indices n Join index relates the values of the dimensions of a star schema to rows in the fact table. n a join index on city maintains for each distinct city a list of ROW-IDs of the tuples recording the sales in the city n Join indices can span multiple dimensions OR n can be implemented as bitmap- indexes (per dimension) n use bit-op for multiple-joins city = Sydney city = Randwick city = Kingsford city = Coogee 1 1 1 1 Chap 4.3 in [JPT10] SalesLocation Kingsford R102,R117,R118,R124 55 Q3: Arbitrary selections on Dimensions time_key day day_of_the_week month quarter year TIME location_key store street_address city state country region LOCATION SALES customer_key customer_name region type Customer time_key product_key location_key units_sold amount {measures JO IN JOIN Scity =~ /\S+ford/ SisSchoolHolidy(day) Sregion = “Asia” 56 Star Query and Star Join (Cont.) time 1 2 Cust 10 20 X loc 100 200 X time cust loc 1 10 100 1 10 200 1 20 100 1 20 200 2 10 100 2 10 200 2 20 100 2 20 200 time cust loc sold 1 10 100 7 1 10 150 13 1 20 150 2 2 20 200 16 … … … … 1000 2000 500 86 Sales Time Customer Loc thousands of tuples millions of tuples Sales !" σ1(Time) !" σ2(Cust) !" σ3(Loc) è Sales !" (σ1(Time) X σ2(Cust) X σ3(Loc)) Usually only part of the dim tables because of the selection predicates Chap 4.4 in [JPT10] 57 Q4: Coarse-grain Aggregations n “Find total sales per customer type in our stores in Europe” n Join-index will prune ¾ of the data (uniform sales), but the remaining ¼ is still large (several millions transactions) n Index is unclustered n High-level aggregations are expensive!!!!! region country state city store LOCATON ÞLong Query Response Times ÞPre-computation is necessary ÞPre-computation is most beneficial Cuboids = GROUP BYs n Multidimensional aggregation = selection on corresponding cuboid n Materialize some/all of the cuboids n A complex decision involving cuboid sizes, query workload, and physical organization 58 GB(type, city)( Sales !" σ1(Time) !" σ2(Cust) !" σ3(Loc)) ) GB(type, city)( σ1’2’3’(Cuboid(Year, Type, City)) ) σ1 selects some Years, σ2 selects some Brands, σ3 selects some Cities, Two Issues n How to store the materialized cuboids? n How to compute the cuboids efficiently? 59 60 Store Product_key sum(amout) 1 1 454 1 4 925 2 1 468 2 2 800 3 1 296 3 3 240 4 1 625 4 3 240 4 4 745 1 ALL 1379 2 ALL 1268 3 ALL 536 4 ALL 1937 ALL 1 1870 ALL 2 800 ALL 3 780 ALL 4 1670 ALL ALL 5120 CUBE BY in ROLAP Product Sales 1 2 3 4 ALL 1 454 - - 925 1379 2 468 800 - - 1268 3 296 - 240 - 536 4 652 - 540 745 1937 St or e ALL 1870 800 780 1670 5120 SELECT LOCATION.store, SALES.product_key, SUM (amount) FROM SALES, LOCATION WHERE SALES.location_key=LOCATION.location_key CUBE BY SALES.product_key, LOCATION.store 4 Group-bys here: (store,product) (store) (product) () • Need to write 4 queries!!! • Compute them independently 61 Top-down Approach n Model dependencies among the aggregates: most detailed “view” can be computed from view (product,store,quarter) by summing-up all quarterly sales product,store,quarter product storequarter none store,quarterproduct,quarter product, store A B M 1 1 10 1 3 20 1 2 30 1 1 40 2 4 50 Cuboid AB A B M 1 (all) ??? 2 (all) ??? Cuboid A A B M (all) 1 ??? (all) 2 ??? (all) 3 ??? (all) 4 ??? Cuboid B 62 Bottom-Up Approach (BUC) n BUC (Beyer & Ramakrishnan, SIGMOD’99) n Ideas n Compute the cube from bottom up n Divide-and-conquer n A simpler recursive version: n BUC-SR A B … 1 1 … 1 3 … 1 2 … 1 1 … 2 … … … Understanding Recursion /1 n Powerful computing/problem-solving techniques n Examples n Factorial: n f(n) = 1, if n = 1 n f(n) = f(n-1) * n, if n ≥ 1 n Quick sort: n Sort([x]) = [x] n Sort([x1, …, pivot, … xn]) = sort[ys] ++ sort[zs]), where ys = [ x | x in xi, x ≤ pivot ] zs = [ x | x <- xi, x > pivot ] 63 f(0) = 0! = ??? List comprehension in Haskell or python Understanding Recursion /2 n Let C(n, m) be the number of ways to select m balls from n numbered balls n Show that C(n, m) = C(n-1, m-1) + C(n-1, m) n Example: m = 3, n = 5 n Consider any ball in the 5 balls, e.g., ‘d’ 64 entire solution space ?1 ?2 ?3 d ?1 ?2 ?1 (no d) ?2 (no d) ?3 (no d) a b c d e?i in { } Key Points n Sub-problems need to be “smaller”, so that a simple/trivial boundary case can be reached n Divide-and-conquer n There may be multiple ways the entire solution space can be divided into disjoint sub-spaces, each of which can be conquered recursively. 65 n Reduce Cube(in 2D) to Cube(in 1D) 66 Geometric Intuition /1 M11 M12 M13 [Step 1] M21 M22 M23 [Step 1] [Step 2] [Step 2] [Step 2] [Step 3] a1 a2 b1 b2 b3 M11 M12 M13 [Step 1] b1 b2 b3 M21 M22 M23 [Step 1] [Step 2] [Step 2] [Step 2] [Step 3] [a1] ⨉ [a2] ⨉ [*] ⨉ * n Reduce Cube(in 3D) to Cube(in 2D) 67 Geometric Intuition /2 A B C A B C A B C n Reduce Cube(in 3D) to Cube(in 2D) 68 Geometric Intuition /3 A B C A B C A B C Algebraic Derivation n How to compute n-dim cube on (n+1)-dim base cuboid (array)? n What do output tuples look like? n How to compute (n+1)-dim cube on (n+1)-dim base cuboid (array)? n What else do we need? 69 [{r1-r5}, BC] r1 r2 r3 r4 r5 A B C M 1 1 1 10 1 1 2 20 1 2 1 30 1 3 1 40 2 1 1 50 [{r1-r5}, ABC] BUC-SR (Simple Recursion)* n BUC-SR(data, dims) n If (dims is empty) n Output (sum(data)) n Else n Dims = [dim1, rest_of_dims] n For each distinct value v of dim1 n slice_v = slice of data on “dim1 = v” n BUC-SR(slice_v, rest_of_dims) n data’ = Project(data, rest_of_dims) n BUC-SR(data’, rest_of_dims) 70 Boundary case: data is essentially a list of measure values General case: 1)Slice on dim1. Call BUC-SR recursively for each slice 2)Project out dim1, and call BUC-SR on it recursively You need to fix the incomplete output part of the algorithm! Example 71 [{r1-r5}, AB] [{r1-r4}, B] r1 r2 r3 r4 r5 B M 1 10 1 20 2 30 3 40 B M 1 30 2 30 3 40 * 100 A B M 1 1 30 1 2 30 1 3 40 1 * 100 [{r5}, B] B M 1 50 B M 1 50 * 50 A B M 2 1 50 2 * 50 [{r1’-r5’}, B] B M 1 80 2 30 3 40 B M 1 80 2 30 3 40 * 150 A B M * 1 80 * 2 30 * 3 40 * * 150 A B M 1 1 10 1 1 20 1 2 30 1 3 40 2 1 50 1 2 3 * 1 30 30 40 100 2 50 50 * 80 30 40 150 OutputInternal OutputInput Try a 3D-Cube by Yourself 7219/2/20 [{r1-r5}, ABC] r1 r2 r3 r4 r5 A B C M 1 1 1 10 1 1 2 20 1 2 1 30 1 3 1 40 2 1 1 50 MOLAP n (Sparse) array-based multidimensional storage engine n Pros: n small size (esp. for dense cubes) n fast in indexing and query processing n Cons: n scalability n conversion from relational data 73 74 Multidimensional Array time item dollars_sold Q1 home entertainment 605 Q2 home entertainment 680 Q3 home entertainment 812 Q4 home entertainment 927 Q1 computer 825 Q2 computer 952 Q3 computer 1023 Q4 computer 1038 Q1 phone 14 Q2 phone 31 Q3 phone 30 Q4 phone 38 Q1 security 400 Q2 security 512 Q3 security 501 Q4 security 580 Mappings time value Q1 0 Q2 1 Q3 2 Q4 3 item value home entertain ment 0 computer 1 phone 2 security 3 time item dollars_s old offset 0 0 605 0 1 0 680 4 2 0 812 8 3 0 927 12 0 1 825 1 1 1 952 5 2 1 1023 9 3 1 1038 13 0 2 14 2 1 2 31 6 2 2 30 10 3 2 38 14 0 3 400 3 1 3 512 7 2 3 501 11 3 3 580 15 f(time, item) = 4*time + item Step 1 Step 2: An injective map from cell to offset 75 Multidimensional Array offset dollars_sold 0 605 1 825 2 14 3 400 4 680 5 952 6 31 7 512 8 812 9 1023 10 30 11 501 12 927 13 1038 14 38 15 580 Dense MD array 605 825 14 400 680 952 31 512 812 1023 30 501 927 1038 38 580 n Think: how to decode a slot? Step 3: If dense, only need to store sorted slots 76 The Sparse Case time item dollars_sold Q1 home entertainment 605 Q2 home entertainment 680 Q3 home entertainment 812 Q4 home entertainment 927 Q1 computer 825 Q2 computer 952 Q3 computer 1023 Q4 computer 1038 Q1 phone 14 Q2 phone 31 Q3 phone 30 Q4 phone 38 Q1 security 400 Q2 security 512 Q3 security 501 Q4 security 580 Mappings time value Q1 0 Q2 1 Q3 2 Q4 3 item value home entertain ment 0 computer 1 phone 2 security 3 time item dollars_s old offset 0 0 605 0 1 0 680 4 2 0 812 8 3 0 927 12 0 1 825 1 1 1 952 5 2 1 1023 9 3 1 1038 13 0 2 14 2 1 2 31 6 2 2 30 10 3 2 38 14 0 3 400 3 1 3 512 7 2 3 501 11 3 3 580 15 f(time, item) = 4*time + item Step 1 Step 2: An injective map from cell to offset /* same table but with many rows deleted to make it sparse */ /* same table but with many rows deleted to make it sparse */ 77 Multidimensional Array offset dollars_sold 0 605 15 580 Dense MD array 605 - - - - - - - - - - - - - - 580 n Think: how to decode a slot? n Multidimensional array is typically sparse n Use sparse array (i.e., offset + value) n Could use chunk to further reduce the space n Space usage: n (d+1)*n*4 vs 2*n*4 n HOLAP: n Store all non-base cuboid in MD array n Assign a value for ALL Choice 2Choice 1