代写辅导接单-Similarity Join for Transaction Logs

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top

Similarity Join for Transaction Logs (22 marks)

Background: Similarity join is an important task in data mining. Given a set of

records, it aims to find all the pairs of records with similarity no less than a

threshold. A record is a set of items. For example, in Amazon, a record could

represent the set of products that a customer purchases in a single transaction.

The problem can find various applications in different domains, such as data

cleaning and product recommendation.

Given a collection R of records, a similarity function sim() and threshold t, the

similarity join on R, is to find all the record pairs r and s from R, such that sim(r,

s)>=t. In this project, we use the Jaccard similarity function to compute

similarity. That is,

Problem definition: In this project, we are still going to use customer purchase

transaction logs. Each record (row) in the dataset has the following 4 fields (see

the sample dataset):

TransactionID: the unique ID to record one purchase transaction

ProductName: the name of the item in a transaction (a name can contain

multiple characters)

TransactionDate: the time of the transaction

UnitPrice: the price of a single item

Each transaction consists of the rows that have the same TransactionID. The

rows of one transaction may not be consecutive, e.g., row 10 for transaction 1 in

the sample dataset.

Each transaction contains a set of items purahced. In each transaction, there

could be multiple rows with the same ProductName, e.g., rows 7 and 11 in the

sample dataset. You need to convert it into a set before computing the similarity.

For example, in the sample dataset, we have transactoin 1 = {A, B, C}, trasaction

2 = {A, C, DD} and transaction 3 = {A, B, C, DD}. Your task is to utilize Spark to find

all the similar transaction pairs across different years (two transactions in

different years) OR within the same year but different months (two transaction

within the same year but different months).

Output Format: The output file contains all the similar transactions together

with their similarities. The output format is

"(TransactionID1,TransactionID2):similarity value". Each pair must have

TransactionID1 < TransactionID2. There should be no duplicates in the output

results. The pairs are sorted in ascending order (by the first and then the second).

Given the sample dataset above with the threshold 0.5, the output result should

be:

(1,2): 0.5 is not returned since transactions 1 and 2 are in the same year and

same month.

Code Format: The code template has been provided. Your code should take

three parameters: the input file, the output folder, and the similarity threshold

tau. You need to use the command below to run your code:

$ spark-submit project3.py input output tau

Some notes

• You need to design an exact approach to finding similar records (Please

revisit part 1 of Week 8 slides for more tips).

• Check the paper mentioned in slides if you want to know more details

Efficient Parallel Set-Similarity Joins Using MapReduce. SIGMOD’10

• You cannot compute the pairwise similarities direclty.

• Regular Python programming is not permitted in project3.

• When testing the correctness and efficiency of submissions, all the code

will be run with two local threads using the default setting of Spark.

Please be careful with your runtime and memory usage.

Optional (this will not be marked)

You can try to run it in Google Dataproc (Be careful about the bills, and make

sure you are using free quota if you want to try it) to see the power of distributed

computation, where your code should scale well with the number of nodes used

in a cluster. Create a project, test everything on your local computer, and finally

do it in Google Dataproc.

Create a bucket with the name “comp9313-” in Dataproc.

Create a folder “project3” in this bucket for holding the input files. You can create

three clusters in Dataproc to run the same job:

• Cluster1 - 1 master node and 2 worker nodes;

• Cluster2 - 1 master node and 4 worker nodes;

• Cluster3 - 1 master node and 6 worker nodes.

For both master and worker nodes, select n1-standard-2 (2 vCPU, 7.5GB

memory).

Unzip and upload the given large data set to your bucket and set τ to 0.85 to run

your program.

Marking Criteria

Your source code will be inspected and marked based on readability and ease of

understanding. The efficiency and scalability of this project are very important

and will be evaluated as well.

• Submission can be compiled and run on Spark => +6

• Accuracy (no unexpected pairs, no missing pairs, correct order, correct

similarity scores, correct format) => +5

• Efficiency (rules are shown as follows) => +9

--The rank of runtime (using two local threads):

--Correct results: 0.9 * (10 – floor((rank percentage-1)/10)), e.g., top 10% => 9

--Incorrect results: 0.4 * (10 – floor((rank percentage-1)/10))

• Code format and structure, readability, and documentation, including the

description of the optimization techniques used => +2

51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: Fudaojun0228