程序代写案例-CS 769-Assignment 5
Assignment 5
CS 769 - Spring 2021
Resilient Distributed Dataset (RDD) is the fundamental data structure that Apache Spark uses
to store data in memory and perform data transformations. Other data structure such as
DataFrmae and DataSet are built upon it and all data operations are performed on RDDs.
Having access to RDDs allow you to have greater exibility on the type of operations you can
perform on the dataset. In this assignment, you will implement a number of queries on an
unstructured dataset to become familiar with most common RDD operations.
1 Running Spark Applications
There are two main ways to run your Spark applications:
1.1 Interactive
Using the pyspark command, you can start an interactive pyspark shell in which you can write
spark code in python. Simply run the following command in your terminal:
$ pyspark
Once you are done with the interactive shell you can exit() and all your operations and data in
memory will be lost.
1.2 Deployment
You can write your code in a python le and submit it as a spark job to the cluster. The master
node will distributed the work automatically and generate the nal result. You can use the
following command to submit jobs:
$ spark-submit test.py
2 Data
The dataset used in this assignment is in the following directory. Copy it to your personal
directory.
/CS769/assignment5/books.json
The dataset is in json format and has a similar structure as the following for every row but
dierent rows might have dierent number of attributes:
1
Listing 1: Dataset.
[{’latest_revision’: 2,
’revision’: 2,
’title’: ’The effect of differentiated marking tools and \ motivational
treatment on figural creativity’,
’languages’: [{’key’: ’/languages/eng’}],
’subjects’: [’Creative thinking -- Testing’, ’Educational psychology’],
’publish_country’: ’gau’,
’by_statement’: ’by Lillian Rose Arnold’,
’type’: {’key’: ’/type/edition’},
’location’: [’NBuC’],
’other_titles’: [’Marking tools’, ’Motivational treatment’], ’publishers’:
[’University of Georgia’], ’last_modified’: {’type’: ’/type/datetime’,
’value’: ’2009-12-15T08:04:07.512219’},
’key’: ’/books/OL22783906M’, ’authors’: [{’key’: ’/authors/OL6535896A’}],
’publish_places’: [’Athens’],
’oclc_number’: [’3954579’],
’pagination’: ’xi, 161 leaves’,
’created’: {’type’: ’/type/datetime’, ’value’:
’2008-12-30T07:38:13.854568’},
’notes’: {’type’: ’/type/text’,
’value’: ’Microfilm of typescript. Ann Arbor, Mich. : University
Microfilms, 1975. -- 1 reel ; 35 mm\n\nThesis--University of
Georgia\n\nBibliography: leaves 147-161’},
’number_of_pages’: 161,
’publish_date’: ’1975’,
’works’: [{’key’: ’/works/OL13681062W’}]}]
You can use the following code to read the le either in interactive mode or job submission:
Listing 2: Read Data File.
from pyspark.sql import SparkSession
import json
# spark object is the main gateway to your application.
# it is used to start a spark session and define different parameters for
your job.
spark = SparkSession.builder.appName("TestJob").getOrCreate()
path = ’/user//books.json’ # path to file in hdfs
# read the json file
raw_data = spark.sparkContext.textFile(path)
# the json.loads function is passed to a mapper for the data file so that
each row is parsed and transformed to a dictionary which is easier to
work with than a string.
# the result is an RDD which holds the entire dataset.
dataset = raw_data.map(json.loads)
2
After reading the dataset, perform the following operations to make sure you have read the
le:
Listing 3: Data Check.
dataset.count() # 723
dataset.take(1) # as Listing 1
Once you have the dataset available in an RDD, you can implement the following queries.
3 Operations and Queries.
For every query described bellow you can use a combination and a chain of lter, map, reduce,
etc. functions one or more times in your implementation. But make sure the function name in
the title is used at least once in your implementation.
As an example, we want to generate a list of all book titles in the dataset. We can chain
functions in the following way:
Listing 4: Sample Result.
list_book_titles = (
dataset
.filter(lambda e: "title" in e) # only bring rows that include the
"title" field
.map(lambda c: c["title"])
.collect()
)
It will generate a list similar to this:
Listing 5: Sample Result.
[’The effect of differentiated marking tools and motivational treatment on
figural creativity’, ’Comparison of the nominal grouping and sequenced
brainstorming techniques of creative idea generation’,
’Professional accident investigation’,
......
3.1 atMap
Generate a list of all distinct subjects in the dataset.
Listing 6: Sample Result.
[’Creative thinking -- Testing’,
’Educational psychology’,
’Problem solving’,
’Group problem solving’,
.....]
3
3.2 sortByKey
Generate a list of book titles sorted by their number of pages.
Listing 7: Sample Result.
(’Aesthetic education in social perspective’, 10),
(’Cmo vivan nuestros abuelos’, 10),
(’Fine-line developer’, 11),
....,
(’Life histories of North American cardinals, grosbeaks, buntings, towhees,
finches, sparrows, and allies, order Passeriformes, family
Fringillidae’, 1889)
3.3 reduce
Use the map and reduce functions to nd the book with the highest number of pages (Don’t
use any max function).
Listing 8: Sample Result.
(1889, ’Life histories of North American cardinals, grosbeaks, buntings,
towhees, finches, sparrows, and allies, order Passeriformes, family
Fringillidae’)
3.4 groupByKey
Find the number of books published in each year.
Listing 9: Sample Result.
(’1977’, 14),
(’1964’, 15),
(’1969’, 43),
....
3.5 reduceByKey
Find the number of books published in each year, this time with reduceByKey. Make sure to
understand its dierence with groupByKey.
Listing 10: Sample Result.
(’1977’, 14),
(’1964’, 15),
(’1969’, 43),
....
3.6 aggregateByKey
Find the book with highest number of pages in each city.
4
Listing 11: Sample Result.
(’Athens’, (’The effect of differentiated marking tools and motivational
treatment on figural creativity’, 161))
(’Jerusalem’, (’A lebn fun shlies’, 447))
(’New York, Prentice-Hall’, (’Notre Dame football’, 244))
4 Submission.
Implement all six queries and in one python le and submit.
Appendices
A Apache Spark API.
Python API for Apache Spark:
https://spark.apache.org/docs/2.4.0/api/python/index.html
5

欢迎咨询51作业君
51作业君 51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: ITCSdaixie