程序辅导案例 > Program >

程序代写案例-COMP2000 8

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

StuDocu is not sponsored or endorsed by any college or university
COMP2000 8 - Revision notes from sem2 2019
Elements of Data Processing (University of Melbourne)
Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
COMP20008
ELEMENTS OF DATA PROCESSING

Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
LECTURE 1 – DATA WRANGLING

What is data wrangling?
• The process of organizing, converting, mapping data from one format into another

What activities it encompasses?
• Data integration, enrichment, aggregation structuring, storage, visualization and publishing

Why is it useful?
• Raw data includes wrong and missing values therefore not useful
• A connection is needed between the dataset for analysis to be done (Data wrangling connects datasets)

Why is it challenging?
• It requires reformatting and validating data (need transformation), which are difficult to specify and
evaluate (on the steps and methods to be done)

Why is it done?
• Make data more accessible
• Have better insights about data

LECTURE 2 – DATA FORMATS

What are relational databases and examples
• Database that provides access to datapoints that relate to one another (based on the idea of tables)
• Examples: MySQL, Oracle database

The role that relational databases play in data wrangling
• Highly structured -> easier to analyze, query, store, clean, process, maintain consistency and security
(most importantly with many users)
• It is easy to wrangle but hard to get information in the database (in the first place)

Regular expression
• For finding pattern in unstructured data (ie. Text)
• Metacharacters:

• Regular expression for email

CSV? Spreadsheet? Differences?
• CSV – Comma separated values
• Spreadsheet – electronic documents in which data is arranged in rows and columns to show
relationships and can be manipulated and used in calculation
• Differences: CSV is human readable
Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
• CSV lacks the formatting information of an xls file
• Lacks the integrity of relational databases

Motivation for XML and JSON
• XML allows complex schema definitions (via regular expressions)
o Allows formal validation
o Consider the data design more closely
• JSON is more streamed line, lightweight and compressed
o Appeals to programmers who look for speed and efficiency
o Used to store data in NoSQL bases

Differences between XML and HTML

XML attributes vs. elements and when to use which
• Attributes are designed to contain data related to specific element or provide additional information
about the elements
• Elements are basic building blocks for XML, can behave as containers to hold information
• ie. location = “Parkville” /> (campus – element , location – attribute)
• Attribute is something that is self-contained (i.e. color, ID, name, etc)
• Element is something that does or could have attributes of its own or contain other elements

Well-formed and valid XML
• Well-formed means it is not allowed to write an XML file with errors in it (i.e. an unclosed tag) –
concerns syntax
• Valid XML – concerns on semantic – an XML is valid if it is validated against a DTD. (DTD –
document type definition – it defines the structure and the legal elements and attributes of an XML
document)

Differences between XML and JSON

HTML
• Tags are predefined
• Limited number of tags
• Case insensitive
• Designed purely for
presentation
• Used for web browser
• Error insensitive
XML
• Tags are user defined
• Unlimited number of tags
• Case sensitive
• Designed for description of
data
• Used for data exchange and
integration
• Error sensitive
XML
• XML allows complex
schema definition (via r.e.)
and allows formal validation.
• Schema tells what your XML
file should look like
JSON
• Do not have tags, use
brackets to structure data
instead
• Smaller, more compact and
lightweight (designed to
speed up the interaction.
Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
LECTURE 3 – DATA CLEANING

Motivation behind data preprocessing and data cleaning
• To improve the data quality from these following perspectives:
o Accuracy
o Completeness
o Consistency
o Timeliness
o Believability
o Interpretability
Details about each step of data pre-processing
• Data cleaning: transform raw data to useful data
• Data integration: combine similar data from different datasets
• Data reduction: remove unwanted and redundant data
• Data transformation: transform or normalize data

Terminologies
• Features: the property of the data (the columns of the dataset)
• Attributes: same as features
• Instances: data items, normally represent the number of rows
• Object: same as instances

Differences between discrete/categorical and continuous
• Categorical data is descriptive data
• Discrete and continuous are both numerical data but discrete represents individually separable data

Data Cleaning:
• Types of data that need to be clean:
o Noisy data
▪ Truncated data (cut of data – ie. address that exceeds word limit)
▪ Incorrect splitting text between cells
▪ Values that don’t really exist ( ie. salary = “-5”)
o Inconsistent data:
o Disguised data (ie [email protected])
o Missing and incomplete data

• Missing values (random/not random) and why?
o Random – some people just don’t know their IQ so cannot fill in the dataset
o Not random – people with low IQ are uncomfortable to share their IQ (tf. Skew data)
o Why?
▪ Malfunction of equipment (ie. sensor)
▪ Not recorded bc of misunderstanding
▪ Deliberate
▪ Think that it was not important at the time

• Strategies to clean missing values
o Case deletion (delete all instances with missing values)
▪ Easy to implement and analyze
▪ Produce bias if sample size is small and structure of data was reflected on those
missing values
o Manual Correct:
▪ Fill in missing values using expert knowledge
▪ Time consuming
o Imputation:
▪ Replace with substitute values (mean, median)
▪ Simple
▪ Won’t break application
▪ Limited utility for analysis as:
• Reduced variation (with mean)
Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
• Incorrect views of distribution
• Change relationship with other features (with mean tf can use categorical
imputation)
o Categorical Imputation:
▪ Might become more precise
▪ Similar drawbacks with normal imputation

• Inconsistent data (outliers)
o What is an outlier?
▪ A data object that deviates significantly from the normal objects as if it were
generated by a different mechanism (which can skew the data or bias any analysis
performed on the dataset)
o Applications
▪ Credit card fault detection
▪ Medical analysis
▪ Sports (doping)
o Difference between global outlier and contextual outlier
▪ Global: significantly deviate from the rest of the data
▪ Contextual: significantly deviate from the given context, for example, 5 degree for
body temperature
o Strategies to detect outliers
▪ 1D data:
• Boxplot, histogram, statistical tests
▪ 2D & 3D data:
• Scatter plot and eyeball
▪ >3D data:
• Statistical/algorithmic methods
o Tukey (boxplot)
▪ Outliers: >Q3 + 3IQR || < Q1 – 3IQR
▪ Suspect outliers: >Q3 + 1.5IQR || o Histograms
▪ show clearly the distribution of the dataset
▪ easy to detect by eyeballing
▪ hard to choose appropriate bin size:
• too small: normal objects are in rare bins or empty bins => False Positive
• too big: outliers in frequent size => False Negative
o Statistical Methods
▪ Model of how we think the data is distributed (scatter plot)

LECTURE 4 – DATA VISUALISATION

Motivation for data visualization
• Convert dataset into visual format
• Easier for human to visualize and analyze
• Helps tell a story – communication tool

2D scatter plot
Plot each of the feature with different colors if it is
more than 2 dimensions

Can see correlation

Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
Interpreting data with a histogram
• Use to show distribution of values (single variable)

Interpreting a heat map
• Can be useful when objects are sorted according to class
• Typically, features are normalized to prevent one attribute from dominating the plot
• Find pattern in data
• Visually determine the clustering structure

Parallel Coordinates
• Use to see trends in attributes in high dimensional data
• Each feature is places along the x-axis and each instance(row) is shown as a line
• Scaling is important (affect visualization, may scale into [0,1] range)
• Downside: placements of the attributes (easier to compare if placed adjacent but more difficult to
compare features that are far apart) => alternate positions of features

LECTURE 5 & 6 – CLUSTERING (DATA VISUALISATION)

Why is it useful to perform clustering and what challenges are involved
• Visualizing high dimensions data
• Clustering helps to gain knowledge on the data and can allow processing separately for separate groups

What are clusters
• Minimizing the distance between the points in one cluster
• Maximize the distance between one cluster to another

Distance Metrics
• Euclidean distance

o Problems: If the two attributes have vastly different scale then it can cause one attribute to
dominate the other => need to scale the data before applying distance metric

K-means
• Steps:
1) Select k seed points as initial cluster centres
2) Assign each object to the cluster with the nearest seed point
3) Compute new seed points as the centroids of the clusters of the current partitioning (the centroid is
the center, ie. mean point, of the cluster)
4) Go back to step 2, stop when the assignment does not change
• Poorly performed
o Different runs of the algorithms will produce different results (local optimum)
o Different number of k can lead to different results
o When the dataset has clusters of varying different sizes and densities
o Non-globular shapes
• Application
o Outlier detection
• How to pick out k?
o Using VAT (inspecting a heat map)
Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
▪ Finding distance between different objects
▪ Reordering a dissimilarity matrix
1) Pick a point on the edge
2) Find the closest point using Euclidean distance and continue to the find the
closest to the newly made group of point

Hierarchical clustering
• Steps:
1) Compute the proximity matrix
2) Let each data point be a cluster
3) Repeat
a. Merge the two closest clusters
b. Update the proximity matrix
c. Until only a single cluster remains
• How to define the cluster similarity
o Single linkage:
▪ Minimum distance between two points, one in each cluster

▪ Can handle non-elliptical shapes
▪ Disadvantage: Sensitive to noise and outliers
o Complete linkage
▪ Maximum distance between two points, one in each cluster

▪ Less susceptible to noise and outliers
▪ Limitation: tends to break large clusters
• Advantage:
o Do not need to start off by assuming the value of k
o They can correspond to taxonomies
• Disadvantage:
o Depends on the choice of either single linkage or complete linkage (each one gives different
result)
o No objective function is directly minimized (does not have any measure to see how good the
clustering is supposed to be) – ie. with k-means, a small SSE means good clusters,
hierarchical does not have that
o Once a decision is made to combine two clusters, it cannot be undone

Dimensionality reduction
• Motivations
i) Reduce amount of time and memory required by data processing algorithms
ii) Allow data to be more easily visualized
iii) Help eliminate irrelevant features or noise
Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
• Concept of dimensionality reduction
o Input:
▪ A dataset with N features and K objects
o Transforming from N dimensions to n<▪ Transformation must preserve the characteristics of the data (the distance between
points)
• Methods
o Basic: select a subset of the original features (ie. the features that has the most variation)
o In general: transformation (new features generated)

• Transformation method (PCA – Principal Components Analysis PCA)
o First dimension chosen to capture as much of the variability as possible
o Second feature is orthogonal (perpendicular) to the first (capturing the remaining variability)
o Disadvantage: can lose much information without care
o Purpose:
▪ Allow the visualization of the general trend and identification of the outliers

LECTURE 7 – DATA LINKAGE AND PRIVACY (DATA
INTEGRATION)

What is Data Linkage
• Combining related/equivalent records across data sources
o Information relating to the same entity (e.g. a person or place) is connected
o E.g. two hospitals want to link the same patients
o E.g. Centrelink and the ATO matching their data
▪ If discrepancy (surprise/illogical lack of compatibility) detected => potential
undeclared outcome

Challenges:
• How to define similarity
• Scalability
• How to merge/group records

Why matching a database against itself can be regarded as a data linkage task
• Business wishes to carry out an advertising campaign
• The customer database changes over time, people move address, change their names
• Duplicate records about individuals – business wishes to know if the same person appears more than
one

Why data linkage can be difficult
• If it is not done right, it can cause incorrect detection of discrepancy
• E.g. Centrelink links their data with ATO to detect false declaration of income, but people but different
names of the company in Centrelink database and ATO database => false positive case of discrepancy
detection

Data Linkage Pipeline
• Preprocessing: clean the data
• Blocking: represent complex records as simple values(blocks). Only score records with simple value in
common
• Scoring: compare two records and assess their similarity
o Using Jaccard similarity or dice similarity
• Matching: match sufficient similar records
• Merging: merge two groups, resolve conflicting attributes

1) Blocking
• Should be efficient
• Trade-off between block sizes: true matches being missed vs computational efficiency
Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
o Can filter out large blocks
o Blocks can overlap but avoid redundancy
o Need to ensure that recall does not suffer
• Challenges
o To determine the size of blocks
▪ When block is too small => speed is not significantly improved
▪ When block is too large => accuracy will drop dramatically
▪ For a good blocking strategy (uniform spread across blocks), then accuracy may be
almost as good as strategy without blocking, but much faster

2) Scoring Similarity
• Create a set of smaller strings from a string with these methods:
o Token/word
o N-words
o N-gram
• Set based similarity (ie. N-gram)
• String based Similarity

Privacy
• If data matching is conducted within a single organization, it is not a concern generally
• Problems occur if:
o Matched data is being passed to another organization or being made public
o Data matching is conducted across databases from different organization
• Objective of privacy preserving
o Do data matching without revealing any information about individuals who do not get linked
across the databases
• One-way hashing (Non-invertible hash function)
o For each organization, applies a one-way hash function to the attribute used to join the
databases and shares its hashed values with other organization. Each organization checks
which ones match. These are linked records.
o Disadvantages:
▪ Little difference results in a completely different output
▪ Dictionary attack: an organization could ‘invert’ the hash function

3rd Party Protocol
• Introduce a trusted 3rd party (organization C)
• A and B send hashed values to C, who check for matches
• But what if C is malicious? C could mount a dictionary attack
=> A and B can perform ‘dictionary attack resistance’ hashing
• A and B concatenate a secret word (a salt) to every value in their data before hashing. C does not know
what this word is and thus can’t perform a dictionary attack to ‘reverse’ the hash values
• However, still may not be robust to frequent attack
Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
Frequency Attack
• The 3rd party compares the distribution of hashed values to some known distribution
o E.g. distribution of surname frequencies in a public database versus distribution of hash values
o May be able to guess some hashed values
• A and B could prevent this by adding random records to manipulate the frequency distribution

3) Approximate/ Similar Matching
• The hash-based technique using 3rd party can only compute exact similarity between strings in a
privacy preserving manner
• Components that are needed
o Computing approximate similarity of two record fields
o Representing a record field in a privacy preserving manner
o Computing approximate similarity of two records fields that have been represented in a
privacy preserving manner
• Method:
o Set-based string similarity + bloom filters
• Bloom filters:
o A bloom filter is an array of n bits, with all bits initially set to 0s. We may store strings in
the bloom filter by using hash functions H1, … Hk to turn on certain bits
o Each hash function maps an input string X to a value in the range [0, n-1]
o To store a set of strings X1,..,Xn in the bloom filter, each element is hash coded using the
k hash functions and all bits having indices Hj(Xj) are set to 1
o If a bit was set to 1 before, no change is made
o For any string X, can check if it is stored in the bloom filter by hashing X with K hash
functions
▪ If at least one of the bits having value HiX is set 0, then X is not a member of the
bloom filter
▪ If all bits corresponding to the hash values are set to 1, then X appears to be a
member of the bloom filter. But we are not 100% certain (could be FP)

• Why is it approximating?
o There might be collisions with the hash functions (different 2grams can be hash into the same
bit)

LECTURE 9 & 10 – RECOMMENDER SYSTEM

What is a recommender system
• Normally is a system that gives some advice of what the customer should buy based on what they
bought and what they like
• Setting:
o Online based
o Large number of user and items
o Capture the interaction between users and items

Collaborative filtering
• Make predictions about a user’s missing data according to the collective behavior of many other users
o Look at users' collective behavior
Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
o Look at active user history
• User-based methods: identify like-minded users
• Item based methods: identify similar items
• Model (matrix) based methods: solve an optimization problem and identify latent factors

• Item-based Intuition:
o Search for similarities among items and recommend similar items
o Measure items similarity (how to know if 2 items are similar)
o If a large group of people who give similar ratings to those items, then they are similar
o Similarity score based on Euclidean distance

o How to find with respect to specific users
o Find the user’s record of ratings and find similar items based on their history ratings

o Online vs. Offline Phases
▪ Phase 1 (Offline):
▪ Computing similarities between j and other items
▪ Can be done in batch, offline
▪ The cost is high O(n2)
▪ Phase 2 (Online):
• Predict rating of item j by user a based on the k-most similar items
• Predicted rating = weighted average over the ratings of the k-most similar item
o Challenges: new items with no previous ratings (cold start problem

• User-based Intuition
o Search for similarities among users
o Recommend items like by users similar to the target user
o Mathematically similar to item-based methods
o In offline:
▪ Can use clustering to find similar users

• User-based vs. Item- based
o Item-based is better in many practical cases: movies, books, etc.
o User based is dynamic, relatively static for item based:
▪ High update frequency of offline-calculated information in user based
o User based has sparsity problem
o Both have problem with new recommendations

• 2 other similarity measures (less bias):
o Cosine similarity (see users as vectors and measure the angle between two vectors):
▪ 0 has very different meanings in different vector context
o Pearson similarity

• Matrix based model
o Matrix factorization
o Dimensionality reduction on users and items at the same time
o Can measure how good the approximation by finding the error between the original matrix
and the new matrix
Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
Content-based recommender systems
• Uses pre-determined features of an item to recommend similar items
• Similarity measured on the attribute of the items
• Feature vector of an item

Purpose of TF-IDF
• Term Frequency (TF)
o Higher frequency, higher TF
o Normalize by document length
• Inverse Document Frequency
o Logarithmic inverse of document frequency
o Rare words, higher IDF

LECTURE 11 – ASSESSING CORRELATION

Why identifying correlations is useful for data wrangling/analysis
• Can hint at potential causal relationship (change in one variable is the result of change in the other)
• Allow prediction and analysis
• With real world examples, there are millions of features, therefore, a correlation allows feature
reduction as it can represent both features if they are correlated

What is correlation between a pair of features
• Correlation between a pair of features is the relationship between them (ie. positive, negative...)

How correlation can be identified using visualization

Linear relation vs. Non-linear relation
•

Euclidean distance for computing correlation
• Problems:
o Cannot discover variables with similar behaviors/dynamic but at different scale (must
normalize)
o Cannot discover variables with similar behaviors/dynamic but in the opposite direction
(negative correlation)

Pearson correlation coefficient for correlation
Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
• Can only assess linear correlation
• Range of rxy lies within [-1, 1]:
o 1 for perfect positive linear correlation
o -1 for perfect negative linear correlation
o 0 means no correlation
o Absolute value |r| indicates strength of linear correlation
• It does not show really clear about the strength of the correlation
• In general, it depends on domain

LECTURE 12 – MUTUAL INFORMATION

Mutual information
• A correlation measure that detect non-linear relationships
o Operates with discrete features

Data discretization
• Domain knowledge: assign thresholds manually
• Equal-length bin:
o Divide the range of the continuous features into equal length intervals (bins)
• Equal-frequency bin
o Divide range of continuous feature into equal frequency intervals

Entropy
• Measure used to assess the amount of uncertainty in an outcome
• More random = higher entropy
• More uniform = lower entropy
• Steps:
o Count how many items are in each bin
o Find total of item
o Calculate the probability of each bin
o Calculate the logarithm of the probability
o Multiply probability and the log value and sum them up and multiply by -1
• Conditional entropy
Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
o If I know one thing about you, then how surprised am I about the second thing (If predictable,
not that surprised) – if not surprised => high correlation, if surprised => low correlation
o Measures how much information is needed to describe outcome, given that one feature is
known

Mutual information
• Measure of correlation:
o The amount of information about X we gain by knowing about Y and vice versa
• MI(X, Y):
o Large: X and Y are highly correlated
o Small: X and Y have low correlation

Normalized mutual information
• NMI(X,Y) = MI(X,Y)/min(H(X), H(Y))
• Helps that interpretation for ranking

Advantages and disadvantages of mutual information:
• Advantages:
o Can detect both linear and non-linear dependencies (unlike Pearson)
o Applicable and very effective for use with discrete features
• Disadvantages
o If feature is continuous, it first must be discretized to compute mutual information, involves
making choices:
▪ Choices may not be obvious. Different bin choices can lead to different estimations
of mutual information

LECTURE 13 & 14– CLASSIFICATION AND REGRESSION
TECHNIQUES

What classification is and where it is useful
• Making predictions based on historical data
• Target variable: discrete

Task of regression and where it is useful
• Given a collection of records (training set), each record contains a set of attributes, one of them is
target variable
• Learn the predictive model from data
• Target variable: continuous real value
• Regression is better at filling in missing values

K-nearest neighbor
• Sorting things by similarity
• K-nearest neighbors of a record x are the k data points that are at smallest distance to x
• Require:
o Set of stored records
o Distance metric
o The value of k
• To classify an unknown record:
o Compute distance to other training records
o Identify k nearest neighbors
Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
o Use class labels of nearest neighbors to determine the class label of unknown record (taking
majority vote or weight)
• Advantages and disadvantages:
o Advantage:
▪ Easier to calculate
▪ Can handle datasets with complex dataset
o Disadvantages:
▪ Slow for large datasets
▪ Need to store training data
▪ Hard to determine the size of k
• Too small => sensitive to noise
• Too big => neighborhood may include points from other classes

Decision tree
• Multiple trees can be created from 1 data set
• Applying decision tree to data:
o Start at first feature
▪ Internal node => test on an attribute
▪ Branch => outcome of the test
▪ Leaf node => class labels
• Deciding on the tree:
o Hunt’s algorithm:
▪ Let Dt = set of training set records that go to node t:
• If all Dt have the same class => leaf node = that class
• Dt = empty set => leaf node = default class
• Dt. = more than one class => use an attribute test to split the data
o Attribute test condition:
▪ Depends on attribute types: nominal, ordinal, continuous
▪ Depends on number of ways to split: 2-way split, multi-way split
o Attribute types:
▪ Nominal:
• ie. cars => multi: (family), (sport), (luxury), binary: (family, luxury), (sport)
▪ Ordinal:
• ie. size => multi: (small), (large), (medium), binary: (small, large),
(medium)
▪ Continuous:
• Discretization, binary decision
• Node splitting and entropy
o Nodes with homogeneous class distribution are preferred
o Need a measure of node impurity => entropy
▪ Low entropy => low impurity => high purity
▪ High entropy => high impurity => low purity
o Compare the entropy of parent node before splitting
o To decide – need to gain as much info as possible
▪ Info gain = mutual info between class feature vs. the feature split
o With the impurity of children nodes

• Advantages and disadvantages:
o Advantages:
▪ Easy to interpret
▪ Relatively efficient
▪ Fast for making decision
o Disadvantages:
▪ Sometimes too simple for complex data
▪ For complex data: big tree – not very comprehensive
▪ May behave strangely for some features (ie. studentID)

Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
LECTURE 15 – EXPERIMENTAL DESIGN

Supervised vs. Unsupervised
• Supervised:
o Have known results (ie. class labels)
o Can learn the right answer because the classification can supervise the learning
o Discrete: Classification/categorization; Continuous: regression
o Evaluation: Performance metric
• Unsupervised
o Don’t have class labels
o Don’t really know what the answer is
o Discrete: clustering; Continuous: dimensionality reduction
o Evaluation: Hard to find out

Performance metric

• Want to minimize both:
o False Positive: predict yes when it should be no
o False Negative: predict no when it should be yes
• Accuracy may be misleading here; it may not detect any other value in a dataset but still get a very high
accuracy (TP and TN should have different weights)

Training set and testing set
• Overfitting:
o The algorithm is very specific to the data but does not work with other data (ie. the exam is
the same as the workshop => can’t gain anything)
o Solution: Cross validation
▪ Solving problem for splitting into 2 training set and testing set (as it is random,
therefore, there is a probability that the splitting is lucky to have the right. And the
uncertainty of which 70% of the data to use.
▪ K-fold cross validation (example – 3 fold cross validation)

• More computationally expensive
▪ Bootstrapping:
• Draw k datasets from original data by sampling with replacement
• Run the algorithm for each dataset and compute accuracy
• Return the mean and standard deviation of eccuracies

Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
LECTURE 16 & 17 – BLOCKCHAIN

Motivation for blockchain technology
• Blockchain is an infrastructure to support the sharing of large data from many parties, making it
accessible to all
o Based on some core computing technologies: peer to peer, hashing, public key cryptography
• What is it?
o A distributed database
o Stored in many computer
o No central point of control
• Problems it’s trying to solve
• Benefits:
o Remove middleman from holding private records (ie. bank, doctor/hospital, university)
▪ No central, trusted point of control
▪ Less administration, less bureaucracy
▪ Less expensive
▪ Faster transactions
▪ More control over records handed to users
▪ Ability to be anonymous
▪ Users can verify in the blockchain
▪ No single point of failure
▪ More secure solution (maybe)

Ledger
• A public record
o Can record events, facts, asset transfers
• Typically, public and shared between many parties
o Some of these parties may be hostile/untrustworthy
• Data cannot be altered if it is entered into the ledger
o No deletions, revisions, alterations, …
• The integrity of the ledger can be verified
o Ensure it hasn’t been tempered with
• Has consensus
o Parties in the network agree on the state

Hashing
• Takes a string as an input and creates an unpredictable output of fixed length
• Infeasible to recreate the input from a given output
• Input can be any file

How the chaining work
• The public ledger is called the blockchain (a file)
• File is a sequence of blocks, each block contains a header and some data
• Block ID = hash of its header
• A block’s header typically contains:
o ID of its parent block
o Timestamp of block’s creation
o Hash of the data inside a block
• The ID of a block derives from a hash of its parent block => a block’s header changes, the ID changes

Why it is virtually impossible to modify the content
• Advantages and Disadvantages
o Advantages:
▪ Content very hard to modify once it has been created
▪ No central authority
o Disadvantage:
▪ After created, if there is a mistake, it is virtually impossible to correct

How to make sure that it is not being manipulated
Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
• Proof of work Consensus:
o Make it really hard to create a block

Cryptography (Hashing)
• Cryptography is used to reinforce the proof of work concept
• Public key cryptography:
o Used for security and privacy
o Private key: only known by 1 party
o Public key: shared with the public
• Digital signatures:
o Evidence that the content of the document has been authorized by the person holding the
private key
o Digital signature = the message encrypted with the private key
o Append this signature on the message

Applications of blockchain in health and education
• The University can sign using the private key to encrypt data so it can allow other people to verify the
fact

LECTURE 19 – PUBLIC DATA RELEASE AND INDIVIDUAL
ANONYMITY

Sensitive, non-sensitive attribute, quasi identifier
• Sensitive attributes: information that people don’t wish to reveal (ie. medical condition)
• Non-sensitive attributes: information that people don’t care to reveal or not
• Quasi identifier: a combination of non-sensitive attributes that can be linked with external data to
identify an individual (e.g. gender, age, zip/post code)
• Explicit identifier: unique for an individual (e.g. passport no, medicare no.)

K-anonymity
• A table satisfies k-anonymity if every record in the table is indistinguishable from at least (k-1) other
records with respect to every set of quasi-identifier attribute
• Hence, for every combination of values of the quasi-identifiers in the k-anonymous table, there are at
least k records that share those values
• How to achieve:
o Generalization
▪ Make the quasi identifiers less specific
▪ Column level (e.g. race{Asian, black, white} to {person})
o Suppression
▪ Remove the quasi identifiers completely
▪ Moderate the generalization process
▪ Limited number of outliers
▪ Row, column and cell level
• E.g. removing the last two lines
• Generalizing zip code to 941**
• Generalizing race to person
o In the worst case, the adversary can only narrow down a quasi-identifier to a group of
individuals
o Data publisher needs to determine quasi-identifiers and choose the parameter of
• Homogeneity attack
o Lack of diversity in the sensitive attribute
• Background attack
o K-anonymity does not protect against attacks based on background knowledge

I-diversity
• Solution for homogeneity attack
• For each anonymous group, there are at least I different sensitive attribute values

Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
Benefits of using and sharing people’s location data
• Allow the learning of individuals travelling habits

Possible privacy concerns of people’s location data being shared
• Personal safety (e.g. stalking)
• Location-based profiling (e.g. Facebook)
• Intrusive inferences, e.g. individual’s political views, personal preferences, health, etc…

Maintaining spatial privacy
• Cloaking provides spatial k-anonymity
• Obfuscation ensures location imprecision

LECTURE 20 – DIFFERENTIAL PRIVACY

Differential privacy
• Global:
o We have a sensitive dataset, a trusted data owner Alice and a researcher Bob. Alice does
analysis on the raw data, adds noise to the answers, and reports the noisy answers to Bob
• Local:
o Each person is responsible for adding noise to their own data, Classic survey example, each
person has to answer questions “Do you use drugs?”
▪ They flip a coin in secret and answer “yes” for heads and tells the truth otherwise
▪ Plausible deniability about a “Yes” answer
• How much noise to add?
o Depends on
▪ Privacy loss budget: how private we want the results to be
▪ Global sensitivity: how much difference the presence or absence of an individual
could make to the result
▪ However, differential privacy does not prevent attackers from drawing conclusions
about individuals from the aggregate results over the population

Computing global sensitivity G of counting queries
• Global sensitivity of a query is the maximum
difference in answers that adding or removing
any individual from the dataset can cause
(maximum effect of an individual)

Role of privacy loss budget k
• Want that the presence or absence of a user in
the dataset does not have a considerable effect
on the released result

The amount of noise is dependent on the ratio G/k

LECTURE 21 – ETHICAL CONSIDERATIONS

Ethical considerations in a project

Who are the stakeholders
• Individuals: create big data
• Organizations: own and benefit from data
• Society: guide and regulate
•

Perspectives on big data
• Perspective 1: The ability to collect, store and process increasingly large and complex data sets from a
variety of sources, into competitive advantage (Technological perspective)
Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280
o 3V’s: Volume, Variety, and Velocity + Veracity (4Vs)
o Technological view (3Vs) does not help to understand unethical use
• Perspective 2: Data is contributed, collected, extracted, exchanged, sold, shared, and processed for the
purpose of predicting and modifying human behavior in the production of economic or social value
(Social perspective)
o 3 Factors: Individuals, organizations, society

10 simple rules for responsible big data research
• Acknowledge that data are people and can do harm
• Recognize that privacy is more than a binary value
• Guard against the reidentification of the data
• Practice ethical data sharing
• Consider the strengths and limitations of your data
• Debate the tough, ethical choices/issues
• Develop a code of conduct for your organization, research community or industry
• Design your data and systems for auditability
• Engage with the broader consequences of data and analysis practices
• Know when to break these rules
Downloaded by Yan Zhuang ([email protected])
lOMoARcPSD|4632280

欢迎咨询51作业君