辅导案例-1COMP9318

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

1COMP9318: Data Warehousing
and Data Mining
— L2: Data Warehousing and OLAP —
2n Why and What are Data Warehouses?
Data Analysis Problems
n The same data found in many different systems
n Example: customer data across different
departments
n The same concept is defined differently
n Heterogeneous sources
n Relational DBMS, OnLine Transaction
Processing (OLTP)
n Unstructured data in files (e.g., MS Excel) and
documents (e.g., MS Word)
3
Data Analysis Problems (Cont’d)
n Data is suited for operational systems
n Accounting,billing,etc.
n Do not support analysis across business
functions
n Data quality is bad
n Missing data, imprecise data, different use of
systems
n Data are “volatile”
n Data deleted in operational systems (6months)
n Data change over time – no historical
information 4
5Solution: Data Warehouse
n Defined in many different ways, but not rigorously.
n A decision support database that is maintained separately from
the organization’s operational database
n Support information processing by providing a solid platform of
consolidated, historical data for analysis.
n “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
n Data warehousing:
n The process of constructing and using data warehouses
6Data Warehouse—Subject-Oriented
n Organized around major subjects, such as customer,
product, sales.
n Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing.
n Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process.
7Data Warehouse—Integrated
n Constructed by integrating multiple, heterogeneous data
sources
n relational databases, flat files, on-line transaction
records
n Data cleaning and data integration techniques are
applied.
n Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
n E.g., Hotel price: currency, tax, breakfast covered, etc.
n When data is moved to the warehouse, it is converted.
8Data Warehouse—Time Variant
n The time horizon for the data warehouse is significantly
longer than that of operational systems.
n Operational database: current value data.
n Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
n Every key structure in the data warehouse
n Contains an element of time, explicitly or implicitly
n But the key of operational data may or may not contain
“time element”.
9Data Warehouse—Non-Volatile
1. A physically separate store of data transformed from
the operational environment.
2. Operational update of data does not occur in the data
warehouse environment.
n Does not require transaction processing, recovery,
and concurrency control mechanisms
n Requires only two operations in data accessing:
n initial loading of data and access of data.
10
Data Warehouse Architecture
n Extract data from
operational data sources
n clean, transform
n Bulk load/refresh
n warehouse is offline
n OLAP-server provides
multidimensional view
n Multidimensional-olap
(Essbase, oracle
express)
n Relational-olap
(Redbrick, Informix,
Sybase, SQL server)
11
Data Warehouse Architecture
Function-oriented
systems
Subject-oriented
systems
All subjects,
integrated
Advanced
analysis
12
Why Separate Data Warehouse?
n High performance for both systems
n DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
n Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
n Different functions and different data:
n missing data: Decision support requires historical data which
operational DBs do not typically maintain
n data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
n data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
13
Why OLAP Servers?
n Different workload:
n OLTP (on-line transaction processing)
n Major task of traditional relational DBMS
n Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll,
registration, accounting, etc.
n OLAP (on-line analytical processing)
n Major task of data warehouse system
n Data analysis and decision making
n Queries hard/infeasible for OLTP, e.g.,
n Which week we have the largest sales?
n Does the sales of dairy products increase over time?
n Generate a spread sheet of total sales by state and by year.
n Difficult to represent these queries by using SQL ç Why?
14
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date
detailed, flat relational
isolated
historical,
summarized, multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write
index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

Comparisons
15
Databases Data Warehouses
Purpose Many purposes; Flexible and
general
One purpose: Data analysis
Conceptual Model ER Multidimensional
Logical Model (Normalized) Relational Model (Denormalized) Star schema /
Data cube/cuboids
Physical Model Relational Tables ROLAP: Relational tables
MOLAP: Multidimensional arrays
Query Language SQL (hard for analytical
queries)
MDX (easier for analytical
queries)
Query Processing B+-tree/hash indexes, Multiple
join optimization, Materialized
views
Bitmap/Join indexes, Star join,
Materialized data cube
16
n The Multidimensional Model
The Multidimensional Model
n A data warehouse is based on a multidimensional data
model which views data in the form of a data cube, which
is a multidimensional generalization of 2D spread sheet.
n Key concepts:
n Facts: the subject it models
n Typically transactions in this course; other types includes
snapshots, etc.
n Measures: numbers that can be aggregated
n Dimensions: context of the measure
n Hierarchies:
n Provide contexts of different granularities (aka. grains)
n Goals for dimensional modeling:
n Surround facts with as much relevant context
(dimensions) as possible ç Why?
17
Supermarket Example
n Subject: analyze total sales and profits
n Fact: Each Sales Transaction
n Measure: Dollars_Sold, Amount_Sold, Cost
n Calculated Measure: Profit
n Dimensions:
n Store
n Product
n Time
18
19
Visualizing the Cubes
n A valid instance of the model is a data cube
total
Sales
product
p1 p2 p3 p4
ci
ty

NY $454 - - $925
LA $468 $800 - -
SD $296 - $240 -
SF $652 - $540 $745

product
city
652 540 745
296 240
468 800
454 925
Concepts: cell, fact (=non-empty cell), measure,
dimensions
Q: How to generalize it to 3D?
3D Cube and Hierarchies
pr
od
uc
t
cit
y
month
ALL ALL ALL
category region year
product country quarter
state month week
city day
store
PRODUCT LOCATION TIME
NY
Book176
Fe
b
DIMENSIONS
Sales of book176 in NY in Jan can be found in this cell
Ja
n
M
ar
Concepts: hierarchy (a tree of dimension values), level
20
Hierarchies
category
product
Concepts: hierarchy (a tree of dimension values), level
ALL
Food Beverage Book
… … …
ALL
product
category
subcategory
ALL
brand
Bo
ok
17
6
Which design is better? Why?
21
The (city, moth) Cuboid
pr
od
uc
t
cit
y
month
ALL ALL ALL
category region year
product country quarter
state month week
city day
store
PRODUCT LOCATION TIME
NY
Book176
Fe
b
DIMENSIONS
Sales of ALL_PROD in NY in Jan
Ja
n
M
ar
22
23
All the Cuboids
Date
Pr
od
uc
t
C
ou
nt
ry
TV
VCR
PC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
Assume: no other non-ALL
levels on all dimensions.
24
All the Cuboids /2
Total annual sales
of TV in U.S.A.Date
Pr
od
uc
t
C
ou
nt
ryall
allTV
VCR
PC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
all
Assume: no other non-ALL
levels on all dimensions.
Lattice of the cuboids
product, quarter, country
product countryquarter
quarter, countryproduct,quarter product, store
Base cuboid
3-dim cuboid
2-dim cuboid
1-dim cuboid
0-dim cuboid
n n-dim cube can be represented as (D1, D2, …, Dd), where Di is the set
of allowed values on the i-th dimension.
n if Di = Li (a particular level), then Di = all descendant dimension
values of Li.
n ALL can be omitted and hence reduces the effective
dimensionality
n A complete cube of d-dimensions consists of cuboids,
where ni is the number of levels (excluding ALL) on i-th dimension.
n They collectively form a lattice.
dY
i=1
(ni + 1)
25
Properties of Operations
n All operations are closed under the
multidimensional model
n i.e., both input and output of an operation is a
cube
n So that they can be composed
26
Q: What’s the analogy in the Relational Model?
27
Common OLAP Operations
n Roll-up: move up the
hierarchy
COMP9318: Data Warehousing and Data Mining
pr
od
uc
t
cit
y
month
ALL ALL ALL
category region year
product country quarter
state month week
city day
store
PRODUCT LOCATION TIME
NY
Book176
Q
2
DIMENSIONSSales of book176 in NY in Q1 here
Q
1
Q
3
Q: what should be its value?
pr
od
uc
t
cit
y
month
NY
Book176
Fe
b
Ja
n
M
ar
ALL ALL ALL
category region year
product country quarter
state month week
city day
store
PRODUCT LOCATION TIME
DIMENSIONS
28
cit
y
pr
od
uc
tNY
Book176
quarter
29
Common OLAP Operations
n Drill-down: move down
the hierarchy
n more fine-grained
aggregation
30
Slice and Dice Queries
n Slice and Dice: select and project on one or more
dimension values
cit
y
product
month
Product = Book176
The output cube has smaller
dimensionality than the input cube
31
Pivoting
n Pivoting: aggregate on
selected dimensions
n usually 2 dims (cross-
tabulation)
pivot on (city, month)
Sales (of all products) in
NY in Q1
=sum( ??? )
pr
od
uc
t
NY
Book176
month
Q
2
Q
1
Q
3
cit
y
month
City
… …
… …
… …
… …
NY
Q2Q1 Q3
A Reflective Pause
n Let’s review the definition of data cubes again.
n Key message:
n Disentangle the “object” from its
“representation” or “implementation”
32
Modeling Exercise 1: Monthly Phone
Service Billing
33
Theme: analyze the income/revenue of Telstra
Solution
n FACT
n MEASURE
n DIMENSIONS
34
35
n The Logical Model
36
Logical Models
n Two main approaches:
n Using relational DB technology:
n Star schema, Snowflake schema, Fact constellation
n Using multidimensional technology:
n Just as multidimensional data cube
37
38
Universal Schema è Star Schema
n Many data warehouses adopt a star schema to represent
the multidimensional model
n Each dimension is represented by a dimension-table
n LOCATION (location_key, store, street_address, city, state,
country, region)
n dimension tables are not normalized
n Transactions are described through a fact-table
n each tuple consists of a logical pointer to each of the dimension-
tables (foreign-key) and a list of measures (e.g. sales $$$)
Store City State Prod Brand Category $Sold #Sold Cost
S136 Syd NSW 76Ha Nestle Biscuit 40 10 18
S173 Melb Vic 76Ha Nestle Biscuit 20 5 11
The universal schema for supermarket
The Star Schema
time_key
day
day_of_the_week
month
quarter
year
TIME
location_key
store
street_address
city
state
country
region
LOCATION
SALES
product_key
product_name
category
brand
color
supplier_name
PRODUCT
time_key
product_key
location_key
units_sold
amount
Think why:
(1) Denormalized once from the universal schema
(2) Controlled redundancy 39
40
Typical Models for Data Warehouses
n Modeling data warehouses: dimensions & measures
n Star schema: A fact table in the middle connected to a
set of dimension tables
n Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
n Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
41
Example of Star Schema
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
state_or_province
country
location
Sales Fact Table
time_key
time_keyitem_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
42
Example of Snowflake Schema
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
item
branch_key
branch_name
branch_type
branch
supplier_key
supplier_type
supplier
city_key
city
state_or_province
country
city
item_key
item_keybranch_key
43
Example of Fact Constellation
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_state
country
location
Sales Fact Table
time_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Shipping Fact Table
time_key
item_key
Shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_key
shipper_name
location_key
shipper_type
shipper
44
Advantages of Star Schema
n Facts and dimensions are clearly depicted
n dimension tables are relatively static, data is
loaded (append mostly) into fact table(s)
n easy to comprehend (and write queries)
“Find total sales per product-category in our stores in Europe”
SELECT PRODUCT.category, SUM(SALES.amount)
FROM SALES, PRODUCT,LOCATION
WHERE SALES.product_key = PRODUCT.product_key
AND SALES.location_key = LOCATION.location_key
AND LOCATION.region=“Europe”
GROUP BY PRODUCT.category
Operations: Slice (Loc.Region.Europe) + Pivot (Prod.category)
n Query Language
45
Query Language
n Two approaches:
n Using relational DB technology: SQL (with extensions
such as CUBE/PIVOT/UNPIVOT)
n Using multidimensional technology: MDX
46
SELECT PRODUCT.category,
SUM(SALES.amount)
FROM SALES, PRODUCT,LOCATION
WHERE SALES.product_key =
PRODUCT.product_key
AND SALES.location_key =
LOCATION.location_key
AND LOCATION.region=“Europe”
GROUP BY PRODUCT.category
SELECT
{[PRODUCT].[category]} on ROWS,
{[MEASURES].[amount]} on COLUMNS
FROM [SALES]
WHERE ([LOCATION].[region].[Europe])
Operations: Slice (Loc.Region.Europe) + Pivot (Prod.category, Measures.amnt)
n Physical Model + Query Processing Techniques
47
Physical Model + Query Processing
Techniques
n Two main approaches:
n Using relational DB technology: ROLAP
n Using multidimensional technology: MOLAP
n Hybrid: HOLAP
n Base cuboid: ROLAP
n Other cuboids: MOLAP
48
49
Q1: Selection on low-cardinality attributes
time_key
day
day_of_the_week
month
quarter
year
TIME
location_key
store
street_address
city
state
country
region
LOCATION
SALES
customer_key
customer_name
region
type
CUSTOMER
time_key
customer_key
location_key
units_sold
amount
{measures
JO
IN
JOIN
Sregion=“Europe”
• Ignoring the final GROUP BY for now
• Omitting the Product dimension
Indexing OLAP Data: Bitmap Index
(1) BI on dimension tables
n Index on an attribute (column) with low distinct values
n Each distinct values, v, is associated with a n-bit vector (n =
#rows)
n The i-th bit is set if the i-th row of the table has the value v
for the indexed column
n Multiple BIs can be efficiently combined to enable optimized scan
of the table
Cust Region Type
C1 Asia Retail
C2 Europe Dealer
C3 Asia Dealer
C4 America Retail
C5 Europe Dealer
Custom BI on Customer.Region
Chap 4.2 in [JPT10]
v bitmap
Asia 1 0 1 0 0
Europe 0 1 0 0 1
America 0 0 0 1 0
50
Indexing OLAP Data: Bitmap Index /2
(1) Bitmap join index (BI on Fact Table Joined with
Dimension tables)
n Conceptually, perform a join, map each dimension
value to the bitmap of corresponding fact table rows.
https://docs.oracle.com/cd/B28359_01/server.111/b28313/indexes.htm#CIHGAFFF
-- ORACLE SYNTAX –
CREATE BITMAP INDEX sales_cust_region_bjix
ON sales(customer.cust_region)
FROM sales, customer
WHERE sales.cust_id = customers.cust_id;
51
52
Indexing OLAP Data: Bitmap Index /3
Sales
Cust Region Type
C1 Asia Retail
C2 Europe Dealer
C3 Asia Dealer
C4 America Retail
C5 Europe Dealer
Customer
time customer loc Sale
101 C1 100 1
173 C1 200 2
208 C2 100 3
863 C3 200 5
991 C1 100 8
1001 C2 200 13
1966 C4 100 21
2017 C5 200 34
BI on Sales(Customer.Region)
v bitmap
Asia 11011000
Europe 00100101
America 00000010
53
Q2: Selection on high-cardinality attributes
time_key
day
day_of_the_week
month
quarter
year
TIME
location_key
store
street_address
city
state
country
region
LOCATION
SALES
customer_key
customer_name
region
type
CUSTOMER
time_key
customer_key
location_key
units_sold
amount
{measures
JO
IN
JOIN
Scity=“Kingsford”
54
R102
R117
R118
R124
Indexing OLAP Data: Join Indices
n Join index relates the values of
the dimensions of a star schema
to rows in the fact table.
n a join index on city
maintains for each distinct
city a list of ROW-IDs of
the tuples recording the
sales in the city
n Join indices can span multiple
dimensions OR
n can be implemented as bitmap-
indexes (per dimension)
n use bit-op for multiple-joins
city = Sydney
city = Randwick
city = Kingsford
city = Coogee
1
1
1
1
Chap 4.3 in [JPT10]
SalesLocation
Kingsford R102,R117,R118,R124
55
Q3: Arbitrary selections on Dimensions
time_key
day
day_of_the_week
month
quarter
year
TIME
location_key
store
street_address
city
state
country
region
LOCATION
SALES
customer_key
customer_name
region
type
Customer
time_key
product_key
location_key
units_sold
amount
{measures
JO
IN
JOIN
Scity =~ /\S+ford/
SisSchoolHolidy(day)
Sregion = “Asia”
56
Star Query and Star Join (Cont.)
time
1
2
Cust
10
20
X
loc
100
200
X
time cust loc
1 10 100
1 10 200
1 20 100
1 20 200
2 10 100
2 10 200
2 20 100
2 20 200
time cust loc sold
1 10 100 7
1 10 150 13
1 20 150 2
2 20 200 16
… … … …
1000 2000 500 86
Sales
Time Customer Loc
thousands
of tuples
millions
of tuples
Sales !" σ1(Time) !" σ2(Cust) !"
σ3(Loc) è
Sales !" (σ1(Time) X σ2(Cust) X
σ3(Loc))
Usually only part of the dim tables
because of the selection predicates
Chap 4.4 in [JPT10]
57
Q4: Coarse-grain Aggregations
n “Find total sales per customer type in our stores in Europe”
n Join-index will prune ¾ of the data (uniform sales),
but the remaining ¼ is still large (several millions
transactions)
n Index is unclustered
n High-level aggregations are expensive!!!!! region
country
state
city
store
LOCATON
ÞLong Query Response Times
ÞPre-computation is necessary
ÞPre-computation is most beneficial
Cuboids = GROUP BYs
n Multidimensional aggregation = selection on
corresponding cuboid
n Materialize some/all of the cuboids
n A complex decision involving cuboid sizes,
query workload, and physical organization
58
GB(type, city)( Sales !" σ1(Time) !" σ2(Cust) !" σ3(Loc)) )
GB(type, city)( σ1’2’3’(Cuboid(Year, Type, City)) )
σ1 selects some Years, σ2 selects some Brands,
σ3 selects some Cities,
Two Issues
n How to store the materialized cuboids?
n How to compute the cuboids efficiently?
59
60
Store Product_key sum(amout)
1 1 454
1 4 925
2 1 468
2 2 800
3 1 296
3 3 240
4 1 625
4 3 240
4 4 745
1 ALL 1379
2 ALL 1268
3 ALL 536
4 ALL 1937
ALL 1 1870
ALL 2 800
ALL 3 780
ALL 4 1670
ALL ALL 5120
CUBE BY in ROLAP
Product
Sales
1 2 3 4 ALL
1 454 - - 925 1379
2 468 800 - - 1268
3 296 - 240 - 536
4 652 - 540 745 1937
St
or
e
ALL 1870 800 780 1670 5120

SELECT LOCATION.store, SALES.product_key, SUM (amount)
FROM SALES, LOCATION
WHERE SALES.location_key=LOCATION.location_key
CUBE BY SALES.product_key, LOCATION.store
4 Group-bys here:
(store,product)
(store)
(product)
()
• Need to write
4 queries!!!
• Compute them
independently
61
Top-down Approach
n Model dependencies among the aggregates:
most detailed “view”
can be computed from view
(product,store,quarter) by
summing-up all quarterly sales
product,store,quarter
product storequarter
none
store,quarterproduct,quarter product, store
A B M
1 1 10
1 3 20
1 2 30
1 1 40
2 4 50
Cuboid AB
A B M
1 (all) ???
2 (all) ???
Cuboid A
A B M
(all) 1 ???
(all) 2 ???
(all) 3 ???
(all) 4 ???
Cuboid B
62
Bottom-Up Approach (BUC)
n BUC (Beyer & Ramakrishnan,
SIGMOD’99)
n Ideas
n Compute the cube from
bottom up
n Divide-and-conquer
n A simpler recursive version:
n BUC-SR
A B …
1 1 …
1 3 …
1 2 …
1 1 …
2 … …
…
Understanding Recursion /1
n Powerful computing/problem-solving
techniques
n Examples
n Factorial:
n f(n) = 1, if n = 1
n f(n) = f(n-1) * n, if n ≥ 1
n Quick sort:
n Sort([x]) = [x]
n Sort([x1, …, pivot, … xn]) = sort[ys] ++ sort[zs]), where
ys = [ x | x in xi, x ≤ pivot ]
zs = [ x | x <- xi, x > pivot ]
63
f(0) = 0! =
???
List comprehension
in Haskell or
python
Understanding Recursion /2
n Let C(n, m) be the number of ways to select m balls from n
numbered balls
n Show that C(n, m) = C(n-1, m-1) + C(n-1, m)
n Example: m = 3, n = 5
n Consider any ball in the 5 balls, e.g., ‘d’
64
entire solution space
?1 ?2 ?3
d ?1 ?2 ?1 (no d) ?2 (no d) ?3 (no d)
a b c d e?i in { }
Key Points
n Sub-problems need to be “smaller”, so that a
simple/trivial boundary case can be reached
n Divide-and-conquer
n There may be multiple ways the entire solution
space can be divided into disjoint sub-spaces,
each of which can be conquered recursively.
65
n Reduce Cube(in 2D) to Cube(in 1D)
66
Geometric Intuition /1
M11 M12 M13 [Step 1]
M21 M22 M23 [Step 1]
[Step 2] [Step 2] [Step 2] [Step 3]
a1
a2
b1 b2 b3
M11 M12 M13 [Step 1]
b1 b2 b3
M21 M22 M23 [Step 1]
[Step 2] [Step 2] [Step 2] [Step 3]
[a1] ⨉
[a2] ⨉
[*] ⨉
*
n Reduce Cube(in
3D) to Cube(in 2D)
67
Geometric Intuition /2
A
B
C
A
B
C
A
B
C
n Reduce Cube(in
3D) to Cube(in 2D)
68
Geometric
Intuition /3
A
B
C
A
B
C
A
B
C
Algebraic Derivation
n How to compute n-dim cube on (n+1)-dim base
cuboid (array)?
n What do output tuples look like?
n How to compute (n+1)-dim cube on (n+1)-dim
base cuboid (array)?
n What else do we need?
69
[{r1-r5}, BC]
r1
r2
r3
r4
r5
A B C M
1 1 1 10
1 1 2 20
1 2 1 30
1 3 1 40
2 1 1 50
[{r1-r5}, ABC]
BUC-SR (Simple Recursion)*
n BUC-SR(data, dims)
n If (dims is empty)
n Output (sum(data))
n Else
n Dims = [dim1, rest_of_dims]
n For each distinct value v of dim1
n slice_v = slice of data on “dim1 = v”
n BUC-SR(slice_v, rest_of_dims)
n data’ = Project(data, rest_of_dims)
n BUC-SR(data’, rest_of_dims)
70
Boundary case:
data is essentially a list
of measure values
General case:
1)Slice on dim1. Call
BUC-SR recursively for
each slice
2)Project out dim1, and
call BUC-SR on it
recursively
You need to fix the incomplete output part of the algorithm!
Example
71
[{r1-r5}, AB]
[{r1-r4}, B]
r1
r2
r3
r4
r5
B M
1 10
1 20
2 30
3 40
B M
1 30
2 30
3 40
* 100
A B M
1 1 30
1 2 30
1 3 40
1 * 100
[{r5}, B]
B M
1 50
B M
1 50
* 50
A B M
2 1 50
2 * 50
[{r1’-r5’}, B]
B M
1 80
2 30
3 40
B M
1 80
2 30
3 40
* 150
A B M
* 1 80
* 2 30
* 3 40
* * 150
A B M
1 1 10
1 1 20
1 2 30
1 3 40
2 1 50
1 2 3 *
1 30 30 40 100
2 50 50
* 80 30 40 150
OutputInternal OutputInput
Try a 3D-Cube by Yourself
7219/2/20
[{r1-r5}, ABC]
r1
r2
r3
r4
r5
A B C M
1 1 1 10
1 1 2 20
1 2 1 30
1 3 1 40
2 1 1 50
MOLAP
n (Sparse) array-based multidimensional storage
engine
n Pros:
n small size (esp. for dense cubes)
n fast in indexing and query processing
n Cons:
n scalability
n conversion from relational data
73
74
Multidimensional Array
time item dollars_sold
Q1
home
entertainment 605
Q2
home
entertainment 680
Q3
home
entertainment 812
Q4
home
entertainment 927
Q1 computer 825
Q2 computer 952
Q3 computer 1023
Q4 computer 1038
Q1 phone 14
Q2 phone 31
Q3 phone 30
Q4 phone 38
Q1 security 400
Q2 security 512
Q3 security 501
Q4 security 580
Mappings
time value
Q1 0
Q2 1
Q3 2
Q4 3
item value
home
entertain
ment 0
computer 1
phone 2
security 3
time item
dollars_s
old offset
0 0 605 0
1 0 680 4
2 0 812 8
3 0 927 12
0 1 825 1
1 1 952 5
2 1 1023 9
3 1 1038 13
0 2 14 2
1 2 31 6
2 2 30 10
3 2 38 14
0 3 400 3
1 3 512 7
2 3 501 11
3 3 580 15
f(time, item) = 4*time + item
Step 1
Step 2: An injective map from cell to offset
75
Multidimensional Array
offset dollars_sold
0 605
1 825
2 14
3 400
4 680
5 952
6 31
7 512
8 812
9 1023
10 30
11 501
12 927
13 1038
14 38
15 580
Dense MD array
605
825
14
400
680
952
31
512
812
1023
30
501
927
1038
38
580
n Think: how to decode a slot?
Step 3: If dense, only need to store sorted slots
76
The Sparse Case
time item dollars_sold
Q1
home
entertainment 605
Q2
home
entertainment 680
Q3
home
entertainment 812
Q4
home
entertainment 927
Q1 computer 825
Q2 computer 952
Q3 computer 1023
Q4 computer 1038
Q1 phone 14
Q2 phone 31
Q3 phone 30
Q4 phone 38
Q1 security 400
Q2 security 512
Q3 security 501
Q4 security 580
Mappings
time value
Q1 0
Q2 1
Q3 2
Q4 3
item value
home
entertain
ment 0
computer 1
phone 2
security 3
time item
dollars_s
old offset
0 0 605 0
1 0 680 4
2 0 812 8
3 0 927 12
0 1 825 1
1 1 952 5
2 1 1023 9
3 1 1038 13
0 2 14 2
1 2 31 6
2 2 30 10
3 2 38 14
0 3 400 3
1 3 512 7
2 3 501 11
3 3 580 15
f(time, item) = 4*time + item
Step 1
Step 2: An injective map from cell to offset
/* same table but with
many rows deleted to
make it sparse */
/* same table but with
many rows deleted to
make it sparse */
77
Multidimensional Array
offset dollars_sold
0 605
15 580
Dense MD array
605
-
-
-
-
-
-
-
-
-
-
-
-
-
-
580
n Think: how to decode a slot?
n Multidimensional array is
typically sparse
n Use sparse array (i.e.,
offset + value)
n Could use chunk to
further reduce the space
n Space usage:
n (d+1)*n*4 vs 2*n*4
n HOLAP:
n Store all non-base cuboid
in MD array
n Assign a value for ALL
Choice 2Choice 1