Similarity Join for Transaction Logs (22 marks)
Background: Similarity join is an important task in data mining. Given a set of
records, it aims to find all the pairs of records with similarity no less than a
threshold. A record is a set of items. For example, in Amazon, a record could
represent the set of products that a customer purchases in a single transaction.
The problem can find various applications in different domains, such as data
cleaning and product recommendation.
Given a collection R of records, a similarity function sim() and threshold t, the
similarity join on R, is to find all the record pairs r and s from R, such that sim(r,
s)>=t. In this project, we use the Jaccard similarity function to compute
similarity. That is,
Problem definition: In this project, we are still going to use customer purchase
transaction logs. Each record (row) in the dataset has the following 4 fields (see
the sample dataset):
TransactionID: the unique ID to record one purchase transaction
ProductName: the name of the item in a transaction (a name can contain
multiple characters)
TransactionDate: the time of the transaction
UnitPrice: the price of a single item
Each transaction consists of the rows that have the same TransactionID. The
rows of one transaction may not be consecutive, e.g., row 10 for transaction 1 in
the sample dataset.
Each transaction contains a set of items purahced. In each transaction, there
could be multiple rows with the same ProductName, e.g., rows 7 and 11 in the
sample dataset. You need to convert it into a set before computing the similarity.
For example, in the sample dataset, we have transactoin 1 = {A, B, C}, trasaction
2 = {A, C, DD} and transaction 3 = {A, B, C, DD}. Your task is to utilize Spark to find
all the similar transaction pairs across different years (two transactions in
different years) OR within the same year but different months (two transaction
within the same year but different months).
Output Format: The output file contains all the similar transactions together
with their similarities. The output format is
"(TransactionID1,TransactionID2):similarity value". Each pair must have
TransactionID1 < TransactionID2. There should be no duplicates in the output
results. The pairs are sorted in ascending order (by the first and then the second).
Given the sample dataset above with the threshold 0.5, the output result should
be:
(1,2): 0.5 is not returned since transactions 1 and 2 are in the same year and
same month.
Code Format: The code template has been provided. Your code should take
three parameters: the input file, the output folder, and the similarity threshold
tau. You need to use the command below to run your code:
$ spark-submit project3.py input output tau
Some notes
• You need to design an exact approach to finding similar records (Please
revisit part 1 of Week 8 slides for more tips).
• Check the paper mentioned in slides if you want to know more details
Efficient Parallel Set-Similarity Joins Using MapReduce. SIGMOD’10
• You cannot compute the pairwise similarities direclty.
• Regular Python programming is not permitted in project3.
• When testing the correctness and efficiency of submissions, all the code
will be run with two local threads using the default setting of Spark.
Please be careful with your runtime and memory usage.
Optional (this will not be marked)
You can try to run it in Google Dataproc (Be careful about the bills, and make
sure you are using free quota if you want to try it) to see the power of distributed
computation, where your code should scale well with the number of nodes used
in a cluster. Create a project, test everything on your local computer, and finally
do it in Google Dataproc.
Create a bucket with the name “comp9313-
Create a folder “project3” in this bucket for holding the input files. You can create
three clusters in Dataproc to run the same job:
• Cluster1 - 1 master node and 2 worker nodes;
• Cluster2 - 1 master node and 4 worker nodes;
• Cluster3 - 1 master node and 6 worker nodes.
For both master and worker nodes, select n1-standard-2 (2 vCPU, 7.5GB
memory).
Unzip and upload the given large data set to your bucket and set τ to 0.85 to run
your program.
Marking Criteria
Your source code will be inspected and marked based on readability and ease of
understanding. The efficiency and scalability of this project are very important
and will be evaluated as well.
• Submission can be compiled and run on Spark => +6
• Accuracy (no unexpected pairs, no missing pairs, correct order, correct
similarity scores, correct format) => +5
• Efficiency (rules are shown as follows) => +9
--The rank of runtime (using two local threads):
--Correct results: 0.9 * (10 – floor((rank percentage-1)/10)), e.g., top 10% => 9
--Incorrect results: 0.4 * (10 – floor((rank percentage-1)/10))
• Code format and structure, readability, and documentation, including the
description of the optimization techniques used => +2