NoSQL
COMP9311 24T2; Week 9
By Zhengyi Yang, UNSW
Notice
- Additional consultation will be scheduled before the final exam
- Project 1: marks and sample solution released
- Assignment 2 is due at 10pm this Sunday
2
MyExperience Survey
➢ The UNSW MyExperience survey is open, participation is highly encouraged.
➢ “Please participate in the myExperience Survey and take the opportunity to
share your constructive thoughts on your learning experience. Your contributions
help your teachers and shape the future of education at UNSW.”
➢ You can access the survey by logging into Moodle or accessing
https://myexperience.unsw.edu.au directly.
➢ More information: https://www.student.unsw.edu.au/myexperience
3
Guest Lecture Next Week
• We have an exciting guest lecture scheduled for next Tuesday.
• Speaker: Google Cloud
• Free pizza and drinks will be provided during the break
4
Acknowledgements
Some parts of this slides are adopted from
- “Database System Concepts” - 7th Edition
- Slides by Prof. George Kollios (Boston University)
- Slides by Asst Prof. Risa B. Myers (Rice University)
- Slides by Dr. David Novak (Masaryk University)
- Slides by Prof. Ying Zhang (UTS)
- Slides by Dr. Longbin Lai (UNSW)
- Neo4j Educator Resources
- Slides by myself at Rust Meetup Sydeny 2020
5
NoSQL is Hot!
NoSQL stands for “not only SQL”:
➢ Non-tabular databases and store
data differently than relational
tables
➢ Firstly proposed in 2009
6
Big Data
▪ Very large volumes of data being collected
• Driven by growth of web, social media, and more recently internet-of-
things
• Web logs were an early source of data
▪ Analytics on web logs has great value for advertisements, web site
structuring, what posts to show to a user, etc
▪ Big Data: differentiated from data handled by earlier generation databases
• Volume: much larger amounts of data stored
• Velocity: much higher rates of insertions
• Variety: many types of data, beyond relational data
7
Some Old Numbers
• Facebook:
▪ 130TB/day: user logs
▪ 200-400TB/day: 83 million pictures
• Google: > 25 PB/day processed data
• Gene sequencing: 100M kilobases
per day per machine
▪ Sequence 1 human cell costs Illumina $1k
▪ Sequence 1 cell for every infant by 2015?
▪ 10 trillion cells / human body
• Total data created in 2010: 1.ZettaByte (1,000,000 PB)/year
▪ ~60% increase every year
8
Big Data is not only Databases
Big data is more about data analytics and on-line querying
Many components:
• Storage systems
• Database systems
• Data mining and statistical algorithms
• Visualization
9
Features of RDBMS
• Data stored in columns and tables
• Relationships represented by data
• Data Manipulation Language
• Data Definition Language
• Transactions (ACID)
• Abstraction from physical layer
• Applications specify what, not how
• Physical layer can change without modifying applications
10
The Value of Relational Databases
• A (mostly) standard data model
• Many well developed technologies
➢ physical organization of the data, search indexes, query optimization, search
operator implementations
• Good concurrency control (ACID)
➢ transactions: atomicity, consistency, isolation, durability
• Many reliable integration mechanisms
➢ “shared database integration” of applications
• Well-established: familiar, mature, support,...
11
Data Management: Trends & Requirements
Trends Requirements
• Volume of data • Real database scalability
massive
➢ database distribution
• Cloud comp. (IaaS)
➢ dynamic resource management
• Velocity of data
➢ horizontally scaling systems
• Big users
• Frequent update operations
• Variety of data • Massive read throughput
• Flexible database schema
➢ semi-structured data
12
RDBMS for Big Data
● relational schema
● but current data are
○ data in tuples naturally flexible
○ a priori known schema
● schema normalization ● inefficient for large data
● slow in distributed
○ data split into tables (3NF)
environment
○ queries merge the data
● transaction support ● full transactions very
inefficient in distributed
○ trans. management with ACID
enviroment.
○ Atomicity, Consistency, Isolation, Durability
○ safety first
13
CAP Theorem
At most two of the following three can be
maximized at one time
• Consistency
• Each client has the same view of the data
• Availability
• Each client can always read and write
• Partition Tolerance
• System works well across distributed physical networks
14
Summary: Querying Big Data
▪ Transaction processing systems that need very high scalability
• Many applications willing to sacrifice ACID properties and other
database features, if they can get very high scalability
▪ Query processing systems that
• Need very high scalability, and
• Need to support non-relation data
15
Processing Data - Terms
● OLTP: Online Transaction Processing (DBMSs)
○ Database applications
○ Storing, querying, multi-user access
● OLAP: Online Analytical Processing (Warehousing)
○ Answer multi-dimensional analytical queries
○ Financial/marketing reporting, budgeting,
forecasting, …
● HTAP: Hybrid Transaction/Analytical Processing
16
NoSQL Databases
● NoSQL: Database technologies that are (mostly):
○ Not using the relational model (nor the SQL language)
○ Designed to run on large clusters (horizontally scalable)
○ No schema - fields can be freely added to any record
○ Open source
○ Based on the needs of the current big data era
● Other characteristics (often true):
○ easy replication support (fault-tolerance, query efficiency)
○ Simple API
○ Eventually consistent (not ACID)
17
Just Another Temporary Trend?
● There have been other trends here before
○ object databases, XML databases, etc.
● But NoSQL databases:
○ are answer to real practical problems big companies have
○ are often developed by the biggest players
○ outside academia but based on solid theoretical results
■ e.g. old results on distributed processing
○ widely used
18
The End of RDBMS?
● Relational databases are not going away
○ are ideal for a lot of structured data, reliable, mature, etc.
● RDBMS became one option for data storage
Using different data stores in different circumstances
Two trends:
1. NoSQL databases implement standard RDBMS features
2. RDBMS are adopting NoSQL principles
19
NoSQL Properties
1. Flexible scalability
○ horizontal scalability instead of vertical
2. Dynamic schema of data
○ different levels of flexibility for different types of DB
3. Efficient reading
○ spend more time storing the data, but read fast
○ keep relevant information together
4. Cost saving
○ designed to run on commodity hardware
○ typically open-source (with a support from a company)
20
NoSQL Databases
● Key-value stores
● Document databases
● Column-family stores
● Graph databases
21
NoSQL Databases by Data Models
22
Key-Value Stores
● Come from a research paper by Amazon (Dynamo)
○ Global Distributed Hash Table (Key-Value Stores)
● A simple hash table (map), primarily used when all
accesses to the database are via primary key
○ key-value mapping
● In RDBMS world: A table with two columns:
○ ID column (primary key)
○ DATA column storing the value (unstructured binary large object)
23
Key-Value Stores - Operations
▪ Key-value stores support
• put(key, value): used to store values with an associated key,
• get(key): which retrieves the stored value associated with the specified
key
• delete(key): remove the key and its associated value
• scan(from, to): range scan
▪ Some systems also support range queries on key values
24
Key-Value Stores - Architecture
1. Embedded systems
a. the system is a library and the DB runs within your
system
2. Large-scale Distributed stores
a. distributed hash table (DHT)
25
Key-Value Stores
● Why?
○ Simple Data Model: Hash Table is well-studied
○ Good Scalability: Small System Cost, via good look-up locality and
caching
● Why not?
○ Poor to complex (interconnected) data
26
Key-Value Stores - Vendors
Project
Voldemort
Ranked list: http://db-engines.com/en/ranking/key-value+store
27
Document Stores
● Basic concept of data: Document
● Documents are self-describing pieces of data
○ Hierarchical tree data structures
○ Nested associative arrays (maps), collections, scalars
○ XML, JSON (JavaScript Object Notation), BSON, …
● Documents in a collection should be “similar”
○ Their schema can differ
● Documents stored in the value part of key-value
○ Key-value stores where the values are examinable
○ Building search indexes on various keys/fields
28
Document Stores - Example
key=3 -> { "personID": 3,
"firstname": "Martin",
"likes": [ "Biking","Photography" ],
"lastcity": "Boston",
"visited": [ "NYC", "Paris" ] }
key=5 -> { "personID": 5,
"firstname": "Pramod",
"citiesvisited": [ "Chicago", "London","NYC" ],
"addresses": [
{ "state": "AK",
"city": "DILLINGHAM" },
{ "state": "MH",
"city": "PUNE" } ],
"lastcity": "Chicago“ }
29
MongoDB
- humongous => Mongo
- Data are organized in collections. A collection stores a set of
documents.
- Collection like table and document like record
- but: each document can have a different set of attributes
even in the same collection
- Semi-structured schema!
- Only requirement: every document should have an “_id” field
30
Example MongoDB
{ "_id”:ObjectId("4efa8d2b7d284dad101e4bc9"),
"Last Name": ” Cousteau",
"First Name": ” Jacques-Yves",
"Date of Birth": ”06-1-1910" },
{ "_id": ObjectId("4efa8d2b7d284dad101e4bc7"),
"Last Name": "PELLERIN",
"First Name": "Franck",
"Date of Birth": "09-19-1983",
"Address": "1 chemin des Loges",
"City": "VERSAILLES" }
31
MongoDB - Features
● JSON-style documents
○ actually uses BSON (JSON's binary format)
● replication for high availability
● auto-sharding for scalability
● document-based queries
● can create an index on any attribute
● for faster reads
32
MongoDB vs RDBMS
RDBMS MongoDB Equivalent
database database
table collection
row document
attributes fields (field-name:value pairs)
primary key the ‘_id’ field, which is the key
associated with the document
33
Relationships in MongoDB
Two options:
1. store references to other documents using their _id
values
2. embed documents within other documents
34
Example
Her e is an example of embedded relationship:
{
"_id":ObjectId("52ffc33cd85242f436000001"),
"contact": "987654321",
And here an example of reference based
"dob": "01-01-1991",
"name": "Tom Benzamin",
{
"address": [
{ "_id":ObjectId("52ffc33cd85242f436000001"),
"building": "22 A, Indiana Apt", "contact": "987654321",
"pincode": 123456, "dob": "01-01-1991",
"city": "Los Angeles",
"name": "Tom Benzamin",
"state": "California"
"address_ids": [
},
ObjectId("52ffc4a5d85242602e000000"),
{
ObjectId("52ffc4a5d85242602e000001")
"building": "170 A, Acropolis Apt",
"pincode": 456789, ]
"city": "Chicago", }
"state": "Illinois"
}
]
}
35
MongoDB - Queries
● Query language expressed via JSON
● clauses: where, sort, count, sum, etc.
SQL: SELECT * FROM users
MongoDB: db.users.find()
-
SELECT * FROM users WHERE personID = 3
-
db.users.find( { "personID": 3 } )
-
SELECT firstname, lastcity FROM users WHERE personID = 5
-
db.users.find( { "personID": 5}, {firstname:1, lastcity:1} )
36
Document Stores - Vendors
MS Azure
DocumentDB
Ranked list: http://db-engines.com/en/ranking/document+store
37
Column-Family Stores
● Origin from Google’s BigTable
● Also known as wide-column or columnar
● Data model: rows that have many columns associated
with a row key
● Column families are groups of related data (columns) that
are often accessed together
38
Column-Family Stores - Main Idea
- Each table tends to have many attributes (from thousands ~ millions)
- In most applications (in OLAP) we are only interested in a few
attributes
- Traditional raw-based
- Store each record in a sequential file
- We need to read the whole record to access only one attribute
- Column-based
- Store the data by putting the same attribute in a sequential file
- Faster access and better compression
39
Column-Family Stores - Example 1
40
Column-Family Stores - Example 2
For a customer we typically access all
profile information at the same time, but
not customer’s orders.
41
BigTable
●
Google’s BigTable
○ Drives MapReduce, and the following: Apache Hadoop, Hadoop File System (HDFS), HBase,
Apache Cassandra
●
2008: Google published the Bigtable Paper
○ “BigTable = sparse, distributed, persistent, multi-dimensional sorted map indexed by (row_key,
column_key, timestamp)”
column names
“contents:html” “param:lang” “param:enc” “a:cnnsi.com” “a:ihned.cz”
t
2
t ...
6 t t t t
t ... 2 2 3 7
8
”com.ccn.www” ... EN UTF-8 CNN.com CNN
column column column column column
column family
row key row
42
Column-Family Stores
● Why?
○ Optimized for OLAP
○ Semi-Structured Data: Each column can define its own schema
● Why not?
○ Not good for
■ OLTP
■ Incremental Data Loading
■ Row-specific Queries
43
Column-Family Stores - Vendors
Ranked list: http://db-engines.com/en/ranking/wide+column+store
44
Graph Data Structure
- A graph is a structure in mathematics (graph theory)
- Famous problem: Seven Bridges of Königsberg
- Optimised for handling highly connected data
Edge
Vertex
45
Graph Database
Data Model
- Vertices (Nodes) -> Entities
- Edges -> Relations
Are we going to learn ER model again?
46
Graphs are Everywhere!
Road Networks
Internet
Social Networks
Knowledge Graphs
Biological Networks
47
Graphs are Large!
#Edges Ratio
#Vetices Ratio #Bytes Ratio >1 trillion
<10K 17.8% connections
<10K 17.3% <100MB 19.0%
10K-100K 17.1%
10K-100K 17.3% 100MB-1G 15.7%
100K-1M 10.1% >60 trillion
100K-1M 15.0% 1G-10G 20.7% URLs
1M-10M 6.9%
1M-10M 13.4% 10G-100G 14.1%
10M-100M 16.3% >60 billion
10M-100M 15.7% 100G-1T 16.5% edges every 30
100M-1B 16.3% days
>100M 21.3% >1T 14.0%
>1B 15.5%
Small to medium sized companies
48
Information vs Knowledge
49
History
50
Graph DBMS Landscape
DBMS popularity trend by database model
The graph database landscape in 2019
between 2013 and 2019 – DB-Engine
51
Advantages
• Performance
❑ Traditional Joins are inefficient
❑ Billion-scale data are common, e.g., Facebook social network , Google web graph
• Flexibility
❑ Real-world entities may not have a fixed schema. It is not feasible to design 1000
attributes for a table.
❑ Relationships among entities can be arbitrary. It is not feasible to use 1000 tables to
model 1000 types of relationships.
• Agility
❑ Business requirements changes over time
❑ Today’s development practices are agile, test-driven
52
Applications
- Social Network: Facebook, Twitter, LinkedIn …
- E-commerce: eBay, Amazon, Alibaba …
- Banking: JPMorgan, Citi, UBS …
- Telecom: Verizon, Orange, AT&T …
- IoT: nest
- Search Engine: Google
- Navigation: Google Maps
- Bioinformatics: DNAnexue
- …
53
Graph Technology Landspace
54
Neo4j
- The most popular Graph Database at present
- Cypher query language
- Developed in Java and open-source
- Resources:
- Neo4j Cypher Manual
- Neo4j Developer Resources
55
Property Graph Model
● Nodes
(Entities)
● Relationships
● Properties
● Labels
56
Graph DB Data Modeling
- Flexible & adaptive schema compared to RDBMs
- What you sketch = what you store in the database
57
Neo4j vs Relational Model
• Retains ACID transaction properties
• Foreign keys not necessary as they are
represented as relationships
• Relationships are stored/represented per relational
record/row instead of per table
58
Neo4j vs Relational Model - Example
Student Enrolls
netId FirstName netId Course
abc1 Albert abc1 COMP 430
def2 Danielle def2 COMP 430
ghi3 Gary abc1 COMP 431
stu7 Sandeep
yz10 Yusin
Majors
netId Major
ghi3 STAT
ghi3 COMP
abc1 COMP
def2 ECE
stu7 STAT
59
Property or Node?
• Dept for a course could be a property of course or a
node
• How to decide? It depends
• Standardization of terms
• How often will it be updated?
• Searchability
• Readability
• Reusability
60
MongoDB vs Neo4j
61
Choosing a DBMS
• Efficient data storage
• Structure (e.g. networks)
• Sparsity
• Performance
• Types of queries / analysis
• Need for visualization
• Built in algorithms or tools
• Reads vs. Writes
• Quantity of data
• Importance
• Objects
• Relationships
62
When to use a graph database?
●
If…
○
Your data has many M-M relationships
Relational (A) Graph (B)
○
Care about referential integrity
Relational Graph
○
You highly value the relationships of the data
(especially when you may consider the relationship to be more important than the elements
themselves)
Relational Graph
○
…case by case question
63
Which to use?
A Relational
1. …if you were given a set of well-structured
data
B NoSQL - Mongo
2. …if you are scraping data from the Internet
3. …if you are storing tax info C NoSQL – Neo4j
4. …if the dataset is extremely large
D It depends
5. …if you are trying to build a friend network
64
Nodes and Properties
●N ode
○
Typically used to represent entities
○
Has zero or more relationships
○
Each node is an instance of an entity
●Property
○
Key/value pairs
○
Belong to nodes and relationships
Type: student Type: fruit
Name: Jane Name:
Doe Pear
65
Property Types
Number (Integer and Float specifically)
●
String
●
Boolean
●
Temporal type - Date, Time, LocalTime, DateTime,
●
LocalDateTime and Duration
66
Relationships
● Edge that connects nodes
● Relationships must have a direction and a type
● Can have properties
67
Questions
1. What are some examples of relationships that are directional?
2. What are some examples of relationships that are non-
directional?
68
Questions
1. What are some examples of relationships that are directional?
○Think of Twitter
2. What are some examples of relationships that are non-
directional?
○Thinks of Facebook
69
Labels
●Grouping mechanism for nodes
●Used to define constraints and/or indexes
●Faster lookup compared to checking a property
Label
Student
netId: abc1
70
T/F Questions
1. There's a 1-1 mapping from the relational model to a
graph database
2. Each node must have the same properties as all other
nodes
3. Relationships cannot have properties
71
Cypher Query Language
An example of Cypher:
Find Sushi restaurants in New York
that Philip’s friends like
72
Cypher Introduction
● Declarative graph query language
● Shares many keywords and query structures with SQL
● Comments can be added with “//” or "/* */"
● Case insensitive except for
○ Labels
○ Property keys
○ Relationship types
73
Cypher Patterns
● Used to describe the shape of what you are looking for
● Node: ( )
● Relationship: -, ->, <-
● Relationship identifier: [ ]
● Labels :
● Variables: n, node, foo
● Not only for querying, also for creating new nodes,
relations etc.
74
Cypher Patterns 1
●Any directional relationship
MATCH (n:Student)-->(m:Student)
RETURN n,m;
Students with a relation with another
Student
n and m are variables
Student is a label
75
Cypher Patterns 2
Specific relationships
●
MATCH (n:Student)-[:MENTORS]->(m:Student)
RETURN DISTINCT n.netId;
netIds of students who mentor other students
● Note the text output
● Can return properties
76
Cypher Patterns 3
Nodes with different labels
●
MATCH (n:Student)-[:ENROLL]->(m:Course)
RETURN n, m;
Students enrolled in courses
77
Cypher clauses – MATCH/WHERE
Find pattern (MATCH), then filter results (WHERE)
●
○
WHERE is part of MATCH-WHERE clause
○
can be replaced with OPTIONAL MATCH, WITH for future constraints
Find node with firstname ‘Albert’
●
MATCH (n {name: 'Albert'})
RETURN n;
Also
●
MATCH (n)
WHERE n.name = 'Albert'
RETURN n;
78
Cypher clauses – RETURN
RETURN is equivalent to SELECT in SQL, it returns the
●
specified nodes or properties
RETURN netIds of students with relationships with other
●
students
MATCH (a:Student)-->(otherNode:Student)
RETURN a.netId, otherNode.netId;
79
Types of Graph Queries
- Graph Pattern Matching
- Given a graph pattern, find subgraphs in the database graph that match the query.
- Can be augmented with other (relational-like) features, such as projection.
MATCH (p:Person)-[:LIKES]->(:Language {name = "SQL"})
RETURN p.name
- Graph Navigation
- A flexible querying mechanism to navigate the topology of the data.
- Called path queries, since they require to navigate using paths (potentially
variable length).
MATCH (p:Person)-[:KNOWS*1..2]->(:Person {name = "Alice"})
RETURN p.name
80
More Cypher clauses
- CREATE
- DELETE
- SET
- ORDER BY
- LIMIT
- WITH
- Aggregations:
- COUNT
- COLLECT
- SUM
- …
- …
81
Graph Algorithms
● The real power of graph databases
● Can save huge amounts of programming effort
● Include
○ Centrality - node importance
○ Community detection – node connectivity and partitions
○ Path finding – routes through the network
○ Similarity – of nodes
○ Link prediction – closeness of nodes
● https://neo4j.com/docs/graph-data-science/current/
82
Neo4j
More resources can be found on
neo4j.com
83
Related Courses at UNSW CSE
- COMP9312: Data Analytics for Graphs
- COMP9313: Big Data Management
84
Learning Outcome
NoSQL vs RDBMS
-
Data Models
-
- Key-value
- Document
- Column-family
- Graph
85