Assignment 5 CS 769 - Spring 2021 Resilient Distributed Dataset (RDD) is the fundamental data structure that Apache Spark uses to store data in memory and perform data transformations. Other data structure such as DataFrmae and DataSet are built upon it and all data operations are performed on RDDs. Having access to RDDs allow you to have greater exibility on the type of operations you can perform on the dataset. In this assignment, you will implement a number of queries on an unstructured dataset to become familiar with most common RDD operations. 1 Running Spark Applications There are two main ways to run your Spark applications: 1.1 Interactive Using the pyspark command, you can start an interactive pyspark shell in which you can write spark code in python. Simply run the following command in your terminal: $ pyspark Once you are done with the interactive shell you can exit() and all your operations and data in memory will be lost. 1.2 Deployment You can write your code in a python le and submit it as a spark job to the cluster. The master node will distributed the work automatically and generate the nal result. You can use the following command to submit jobs: $ spark-submit test.py 2 Data The dataset used in this assignment is in the following directory. Copy it to your personal directory. /CS769/assignment5/books.json The dataset is in json format and has a similar structure as the following for every row but dierent rows might have dierent number of attributes: 1 Listing 1: Dataset. [{’latest_revision’: 2, ’revision’: 2, ’title’: ’The effect of differentiated marking tools and \ motivational treatment on figural creativity’, ’languages’: [{’key’: ’/languages/eng’}], ’subjects’: [’Creative thinking -- Testing’, ’Educational psychology’], ’publish_country’: ’gau’, ’by_statement’: ’by Lillian Rose Arnold’, ’type’: {’key’: ’/type/edition’}, ’location’: [’NBuC’], ’other_titles’: [’Marking tools’, ’Motivational treatment’], ’publishers’: [’University of Georgia’], ’last_modified’: {’type’: ’/type/datetime’, ’value’: ’2009-12-15T08:04:07.512219’}, ’key’: ’/books/OL22783906M’, ’authors’: [{’key’: ’/authors/OL6535896A’}], ’publish_places’: [’Athens’], ’oclc_number’: [’3954579’], ’pagination’: ’xi, 161 leaves’, ’created’: {’type’: ’/type/datetime’, ’value’: ’2008-12-30T07:38:13.854568’}, ’notes’: {’type’: ’/type/text’, ’value’: ’Microfilm of typescript. Ann Arbor, Mich. : University Microfilms, 1975. -- 1 reel ; 35 mm\n\nThesis--University of Georgia\n\nBibliography: leaves 147-161’}, ’number_of_pages’: 161, ’publish_date’: ’1975’, ’works’: [{’key’: ’/works/OL13681062W’}]}] You can use the following code to read the le either in interactive mode or job submission: Listing 2: Read Data File. from pyspark.sql import SparkSession import json # spark object is the main gateway to your application. # it is used to start a spark session and define different parameters for your job. spark = SparkSession.builder.appName("TestJob").getOrCreate() path = ’/user/
/books.json’ # path to file in hdfs # read the json file raw_data = spark.sparkContext.textFile(path) # the json.loads function is passed to a mapper for the data file so that each row is parsed and transformed to a dictionary which is easier to work with than a string. # the result is an RDD which holds the entire dataset. dataset = raw_data.map(json.loads) 2 After reading the dataset, perform the following operations to make sure you have read the le: Listing 3: Data Check. dataset.count() # 723 dataset.take(1) # as Listing 1 Once you have the dataset available in an RDD, you can implement the following queries. 3 Operations and Queries. For every query described bellow you can use a combination and a chain of lter, map, reduce, etc. functions one or more times in your implementation. But make sure the function name in the title is used at least once in your implementation. As an example, we want to generate a list of all book titles in the dataset. We can chain functions in the following way: Listing 4: Sample Result. list_book_titles = ( dataset .filter(lambda e: "title" in e) # only bring rows that include the "title" field .map(lambda c: c["title"]) .collect() ) It will generate a list similar to this: Listing 5: Sample Result. [’The effect of differentiated marking tools and motivational treatment on figural creativity’, ’Comparison of the nominal grouping and sequenced brainstorming techniques of creative idea generation’, ’Professional accident investigation’, ...... 3.1 atMap Generate a list of all distinct subjects in the dataset. Listing 6: Sample Result. [’Creative thinking -- Testing’, ’Educational psychology’, ’Problem solving’, ’Group problem solving’, .....] 3 3.2 sortByKey Generate a list of book titles sorted by their number of pages. Listing 7: Sample Result. (’Aesthetic education in social perspective’, 10), (’Cmo vivan nuestros abuelos’, 10), (’Fine-line developer’, 11), ...., (’Life histories of North American cardinals, grosbeaks, buntings, towhees, finches, sparrows, and allies, order Passeriformes, family Fringillidae’, 1889) 3.3 reduce Use the map and reduce functions to nd the book with the highest number of pages (Don’t use any max function). Listing 8: Sample Result. (1889, ’Life histories of North American cardinals, grosbeaks, buntings, towhees, finches, sparrows, and allies, order Passeriformes, family Fringillidae’) 3.4 groupByKey Find the number of books published in each year. Listing 9: Sample Result. (’1977’, 14), (’1964’, 15), (’1969’, 43), .... 3.5 reduceByKey Find the number of books published in each year, this time with reduceByKey. Make sure to understand its dierence with groupByKey. Listing 10: Sample Result. (’1977’, 14), (’1964’, 15), (’1969’, 43), .... 3.6 aggregateByKey Find the book with highest number of pages in each city. 4 Listing 11: Sample Result. (’Athens’, (’The effect of differentiated marking tools and motivational treatment on figural creativity’, 161)) (’Jerusalem’, (’A lebn fun shlies’, 447)) (’New York, Prentice-Hall’, (’Notre Dame football’, 244)) 4 Submission. Implement all six queries and in one python le and submit. Appendices A Apache Spark API. Python API for Apache Spark: https://spark.apache.org/docs/2.4.0/api/python/index.html 5 欢迎咨询51作业君