程序代写案例-CS 769-Assignment 6

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
Assignment 6
CS 769 - Spring 2021
In this assignment, you will implement A Priori algorithm to nd the frequent itemsets and
extract the association rules.
1 Data
The data set used in this assignment is in the following directory. Copy it to your personal
directory.
/CS769/assignment6/store_data.csv
Every row of the data set is a list of items purchased in a transaction separated commas.
Dierent transactions have dierent number of items, see below:
Listing 1: Dataset.
’whole wheat pasta,french fries’
’soup,light cream,shallot’
’frozen vegetables,spaghetti,green tea’
’french fries’
’eggs,pet food’
’cookies’
’turkey,burgers,mineral water,eggs,cooking oil’
You can use the following code to read the le:
Listing 2: Read Data File.
from pyspark.sql import SparkSession
# spark object is the main gateway to your application.
# it is used to start a spark session and define different parameters for
your job.
spark = SparkSession.builder.appName("TestJob").getOrCreate()
path = ’/user//store_data.csv’ # path to file in hdfs
# read the file
dataset = spark.sparkContext.textFile(path)
After reading the data set, perform the following operations to make sure you have read the
le correctly:
1
Listing 3: Data Check.
dataset.count() # 7501
dataset.take(1)
’’’
[’shrimp,almonds,avocado,vegetables mix,green grapes,whole weat
flour,yams,cottage cheese,energy drink,tomato juice,low fat
yogurt,green tea,honey,salad,mineral water,salmon,antioxydant
juice,frozen smoothie,spinach,olive oil’]
’’’
Once you have the data set available in an RDD, you can start implementing the A Priori
algorithm.
2 A Priori.
Following the instructions of the A Priori algorithm, you start by building the frequency of
items as singletons, i.e, the frequency of every unique item over the entire list of transactions.
Use the map and reduce functions to build this table. Then set the minimum support value at
100 and lter any item that has a support value less than that:
Listing 4: Singleton Results.
(’mineral water’, 1788),
(’eggs’, 1348),
(’spaghetti’, 1306),
(’french fries’, 1282),
(’chocolate’, 1230)
....
In the next iteration, for every item in the singleton table, you create its pair with other single-
tons in the data set, i.e., (singleton_1, singleton_2), (singleton_1, singleton_3), ...., (singleton_2,
singleton_3), ...., etc. Also, (singleton_1, singleton_2) and (singleton_2, singleton_1) are the
same and you can remove replicas (this is discussed in the class, you may use either of the two
methods to generate the candidates).
For each new pairs, you need to check its support (if there is a transaction in the data set
that includes all the items in the pair). For example, if the new pair is (’eggs’, ’chocolate’) but
there is no transaction in the data set that has ’eggs’ and ’chocolate’, this pair will be removed.
Otherwise, you count the number of transactions which include these items, and that would
give you the frequency of pairs.
Again, follow map and reduce functions to get the new frequencies and don’t forget to lter
the nal counts by the minimum support.
Listing 5: Pair Results.
((’chocolate’, ’shrimp’), 135),
((’green tea’, ’spaghetti’), 199),
((’frozen smoothie’, ’milk’), 107),
((’chocolate’, ’olive oil’), 123)
....
2
You will follow the same pattern to generate frequent 3-item set table, i.e., by creating new
combinations of pairs with singletons in the item sets. As an example, you create a triple item
set with (’chocolate’, ’shrimp’) in the pair list and ’olive oil’ in the singleton list as (’chocolate’,
’shrimp’, ’olive oil’), check the transactions and count frequencies.
If you follow this pattern in a loop, you can generate 4th, 5th, etc. But since the minimum
support value is 100, there will be no new tables after frequent item sets of size 3. Therefore,
the nal frequent item set result you can get is similar to this:
Listing 6: Triplet Results.
((’chocolate’, ’milk’, ’mineral water’), 105)
((’ground beef’, ’mineral water’, ’spaghetti’), 128)
....
Append all these frequent item set tables and this will be the nal result for A Priori algorithm.
Listing 7: Final Results.
([’chocolate’], 1230)
([’light mayo’], 204)
([’frozen smoothie’], 475)
......
((’chocolate’, ’olive oil’), 123)
((’chocolate’, ’shrimp’), 135)
((’burgers’, ’chocolate’), 128)
......
((’chocolate’, ’milk’, ’mineral water’), 105)
((’ground beef’, ’mineral water’, ’spaghetti’), 128)
....
3 Association Rules.
Once you have the frequent item set result, you can follow the instructions of the Association
Rule mining to generate a list of rules in the form of 퐼 → 푗 ∶ 푐표푛푓 푖푑푒푛푐푒_푣푎푙푢푒 in which 퐼 is
the set of items already purchased, and 푗 is the new item that is likely to be of interest to the
customer to buy with a given 푐표푛푓 푖푑푒푛푐푒_푣푎푙푢푒.
You need to create pairs between all values of the frequent item set list and calculate the
condence level for each one to have a nal result similar to the following table where the
rst list in a row is the purchased items and the second list is the recommended item and the
nal value is condence level:
Listing 8: Association Rule Results.
[[’shrimp’], [’chocolate’], 25.18],
[[’chocolate’], [’shrimp’], 10.97]
....
[[’ground beef’, ’mineral water’], [’spaghetti’], 41.69]
....
3
4 Submission.
Implement both A Priori and Association Rules in one python le and submit them along with
the nal result of Association Rules.
Note: You should expect that nding frequent item sets to have an expensive computation
cost. You can increase the number of spark executors, number of cores and memory with the
following options when submitting your spark jobs:
Listing 9: Spark Options.
spark-submit \
--num-executors 5 \ # total number of executors
--executor-cores 5 \ # number of cores for each executor
--executor-memory 15G \ # amount of memory for each executor
apriori.py
In order to maintain resources for everyone, try to keep your options lower than the amount
of memory and cores provided above and also because using more memory space leads to
higher garbage collection cost and lowers performance.
Appendices
A Apache Spark API.
Python API for Apache Spark:
https://spark.apache.org/docs/2.4.0/api/python/index.html
4

欢迎咨询51作业君
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468