Assignment 6 CS 769 - Spring 2021 In this assignment, you will implement A Priori algorithm to nd the frequent itemsets and extract the association rules. 1 Data The data set used in this assignment is in the following directory. Copy it to your personal directory. /CS769/assignment6/store_data.csv Every row of the data set is a list of items purchased in a transaction separated commas. Dierent transactions have dierent number of items, see below: Listing 1: Dataset. ’whole wheat pasta,french fries’ ’soup,light cream,shallot’ ’frozen vegetables,spaghetti,green tea’ ’french fries’ ’eggs,pet food’ ’cookies’ ’turkey,burgers,mineral water,eggs,cooking oil’ You can use the following code to read the le: Listing 2: Read Data File. from pyspark.sql import SparkSession # spark object is the main gateway to your application. # it is used to start a spark session and define different parameters for your job. spark = SparkSession.builder.appName("TestJob").getOrCreate() path = ’/user/
/store_data.csv’ # path to file in hdfs # read the file dataset = spark.sparkContext.textFile(path) After reading the data set, perform the following operations to make sure you have read the le correctly: 1 Listing 3: Data Check. dataset.count() # 7501 dataset.take(1) ’’’ [’shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil’] ’’’ Once you have the data set available in an RDD, you can start implementing the A Priori algorithm. 2 A Priori. Following the instructions of the A Priori algorithm, you start by building the frequency of items as singletons, i.e, the frequency of every unique item over the entire list of transactions. Use the map and reduce functions to build this table. Then set the minimum support value at 100 and lter any item that has a support value less than that: Listing 4: Singleton Results. (’mineral water’, 1788), (’eggs’, 1348), (’spaghetti’, 1306), (’french fries’, 1282), (’chocolate’, 1230) .... In the next iteration, for every item in the singleton table, you create its pair with other single- tons in the data set, i.e., (singleton_1, singleton_2), (singleton_1, singleton_3), ...., (singleton_2, singleton_3), ...., etc. Also, (singleton_1, singleton_2) and (singleton_2, singleton_1) are the same and you can remove replicas (this is discussed in the class, you may use either of the two methods to generate the candidates). For each new pairs, you need to check its support (if there is a transaction in the data set that includes all the items in the pair). For example, if the new pair is (’eggs’, ’chocolate’) but there is no transaction in the data set that has ’eggs’ and ’chocolate’, this pair will be removed. Otherwise, you count the number of transactions which include these items, and that would give you the frequency of pairs. Again, follow map and reduce functions to get the new frequencies and don’t forget to lter the nal counts by the minimum support. Listing 5: Pair Results. ((’chocolate’, ’shrimp’), 135), ((’green tea’, ’spaghetti’), 199), ((’frozen smoothie’, ’milk’), 107), ((’chocolate’, ’olive oil’), 123) .... 2 You will follow the same pattern to generate frequent 3-item set table, i.e., by creating new combinations of pairs with singletons in the item sets. As an example, you create a triple item set with (’chocolate’, ’shrimp’) in the pair list and ’olive oil’ in the singleton list as (’chocolate’, ’shrimp’, ’olive oil’), check the transactions and count frequencies. If you follow this pattern in a loop, you can generate 4th, 5th, etc. But since the minimum support value is 100, there will be no new tables after frequent item sets of size 3. Therefore, the nal frequent item set result you can get is similar to this: Listing 6: Triplet Results. ((’chocolate’, ’milk’, ’mineral water’), 105) ((’ground beef’, ’mineral water’, ’spaghetti’), 128) .... Append all these frequent item set tables and this will be the nal result for A Priori algorithm. Listing 7: Final Results. ([’chocolate’], 1230) ([’light mayo’], 204) ([’frozen smoothie’], 475) ...... ((’chocolate’, ’olive oil’), 123) ((’chocolate’, ’shrimp’), 135) ((’burgers’, ’chocolate’), 128) ...... ((’chocolate’, ’milk’, ’mineral water’), 105) ((’ground beef’, ’mineral water’, ’spaghetti’), 128) .... 3 Association Rules. Once you have the frequent item set result, you can follow the instructions of the Association Rule mining to generate a list of rules in the form of 퐼 → 푗 ∶ 푐표푛푓 푖푑푒푛푐푒_푣푎푙푢푒 in which 퐼 is the set of items already purchased, and 푗 is the new item that is likely to be of interest to the customer to buy with a given 푐표푛푓 푖푑푒푛푐푒_푣푎푙푢푒. You need to create pairs between all values of the frequent item set list and calculate the condence level for each one to have a nal result similar to the following table where the rst list in a row is the purchased items and the second list is the recommended item and the nal value is condence level: Listing 8: Association Rule Results. [[’shrimp’], [’chocolate’], 25.18], [[’chocolate’], [’shrimp’], 10.97] .... [[’ground beef’, ’mineral water’], [’spaghetti’], 41.69] .... 3 4 Submission. Implement both A Priori and Association Rules in one python le and submit them along with the nal result of Association Rules. Note: You should expect that nding frequent item sets to have an expensive computation cost. You can increase the number of spark executors, number of cores and memory with the following options when submitting your spark jobs: Listing 9: Spark Options. spark-submit \ --num-executors 5 \ # total number of executors --executor-cores 5 \ # number of cores for each executor --executor-memory 15G \ # amount of memory for each executor apriori.py In order to maintain resources for everyone, try to keep your options lower than the amount of memory and cores provided above and also because using more memory space leads to higher garbage collection cost and lowers performance. Appendices A Apache Spark API. Python API for Apache Spark: https://spark.apache.org/docs/2.4.0/api/python/index.html 4 欢迎咨询51作业君