程序辅导案例 > Program >

程序代写案例-FIT5202-Assignment 1

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Monash University
FIT5202 - Data processing for Big Data
Assignment 1: Analysing Trip Data

Background
87Drive is an online marketplace, where drivers are the supply and passengers, are the demand.
One of our main challenges is to keep this marketplace balanced. If there's too much demand,
prices would increase due to surges and passengers would prefer not to run. If there's too much
supply, drivers would spend more time idle impacting their revenue. Here, we want to employ
various operations on the dataset using Spark to answer different queries.
Required Datasets (available in Moodle):
- Three datasets:
- Trip, Passenger, and City datasets
- These files are available in Moodle under Assessment 1.
Information on Dataset
The data used here is a simulated dataset, which mainly records trips that happened on the
87Drive platform in 2019 across different cities. The data is available on the website:
https://www.kaggle.com/datasets/ivanchvez/99littleorange

The datasets contain various details about the trip time and cost information. In this assignment,
only three datasets i.e., Trip, Passenger, and City are considered. For more detailed information
on the dataset, please refer to the given website.
Assignment Information
The assignment consists of three parts: RDD part1, Dataframes part2, and Comparison part3.
In this assignment, you are required to implement various solutions based on RDDs and
DataFrames in PySpark for the given queries related to trip data analysis. In the RDD part,
you will only use Trip and Passenger datasets. In the Dataframe part, all the datasets will be
considered. In the comparison part, only Trip and City datasets will be considered.

Getting Started
● Download the datasets from Moodle.
● Create an Assignment_1.ipynb file in Jupyter notebook to write your solution.
● You will be using Python 3+ and PySpark 3.0.0 for this assignment.
Part 1: Working with RDDs (30%)
1. Working with RDD (30%)
In this section, you will need to create RDDs from the given datasets, perform partitioning in
these RDDs and use various RDD operations to answer the queries for trip analysis.
1.1 Data Preparation and Loading (5%)
1. Write the code to create a SparkContext object using SparkSession, which tells Spark
how to access a cluster. To create a SparkSession you first need to build a SparkConf
object that contains information about your application. Give an appropriate name for
your application and run Spark locally with as many working processors as logical cores
on your machine.
You should create a folder which is called “data”, three csv files contained in this
“data” folder. The “Assignment_1.ipynb” file should be created outside the data folder
to make the consistency for every student. See the attached image to find how to
store your data files and create your assignment_1 file.

2. Import all the “Trip” csv files into a single RDD.
3. Import all the “Passenger” csv files into a single RDD.
4. For both Trip and Passenger, remove the header rows and display the total count
and first 5 records.
1.2 Data Partitioning in RDD (10%)
1. How many partitions do the above RDDs have? How is the data in these RDDs
partitioned by default, when we do not explicitly specify any partitioning strategy?
2. In the “Passenger” csv dataset, there is a column called first_call_time which shows the
called time of the passenger.
a. Create a Key Value Pair RDD of passenger data, with the key as 'In 2019' or
'Not In 2019' dependent on whether the first call time happened in
2019('first_call_time' column) and the rest of the other columns as the value.
After that, print the first 5 records.
b. Assume we want to keep all the data related to 2019 in one partition and keep
the other year’s data in another partition. Write the code to implement this
partitioning in RDD using appropriate partitioning functions. (Explain which
partition strategy you used?)
c. Write the code to print the number of records in each partition.

1.3 Query/Analysis (15%)
For the Trip RDD, write relevant RDD operations to answer the following queries.
1. There are 2 columns called "trip_distance" and "surge_rate", which show the
distance and surge rate of that trip. Filter the trip RDD which if any column of them is
empty. Show the count before and after filtering.
2. Calculate the average surge rate for each city. (Hint: you can use 'city_id' directly)

3. Find the driver id who has the max and min trip distance. Also, print out all other trips
they made. (Hint: Filter the negative value of trip distance)

Part2. Working with DataFrames (55%)
In this section, you will need to load the given datasets into PySpark DataFrames and
use DataFrame functions to answer the queries.
2.1 Data Preparation and Loading (5%)
1. Load all trips, passenger, and city data into three separate dataframes. (Hint: you should
directly use “inferSchema=True”)
2. Display the schema of the final three dataframes.
2.2 Query/Analysis (15%)
Implement the following queries using dataframes. You need to be able to perform operations
like filtering, sorting, joining and group by using the functions provided by the DataFrame API.
The following DF means dataframes.
1. Rename ‘id’ in city DF into ‘city_id’.
2. Join city DF with trip DF. Delete 'city_id' column and rename 'name' column to 'city'
column. (Hint: You should use “inner join”)
3. Use joined DF above in 2.2.2, keep the rows in which trip_distance and trip_fare is
both larger than 0. Show 5 records after filtering.
4. Use the filtered DF above in 2.2.3, show top 5 rows using descending order of
trip_distance. (‘id’, ‘driver_id’, ‘passenger_id’ and ‘trip_distance’ should be displayed)
2.3 Trip Analysis (35%)
In this section, we want to analyse whether the trip fare is higher during the holiday compared
with a normal day. And what do different weekdays affect the number of trips?

Using the DataFrame created in 2.2.2, implement the following queries:
1. Create a new Boolean column named 'On Holiday' to identify whether the trip
happened on holiday. If the 'call_time' column is on the following date, then it should
be a holiday ('On Holiday' should be true). Print out the latest DF result.
(Hint: You can directly use the String type of 'call_time’ to do this task and udf will be
used in this task)
('1/1/2019', '3/5/2019', '4/19/2019', '4/21/2019', '5/1/2019', '6/20/2019',
'9/7/2019', '10/12/2019', '11/2/2019', '11/15/2019', '12/25/2019')
2. Observe whether holidays have any effect on the average trip fare in various cities.
Your DF’s output should be like the following image. You need to provide 2
implementations that will use both Dataframes and Spark SQL to finish this task.

3. Use Dataframe created in 2.2.2. Create a new column called 'weekday' which
transfers the 'call_time' column to 'MON,'TUE','WED'... Print out the top 5 rows in the
output. (Try to use “to_date" to get full marks, otherwise, you will lose some marks
and udf will be used in this task)
4. Based on the Dataframe we created in 2.3.3. Compute the total number of the trip
(number of rows) for each weekday in different cities and the percentage for the 7
different weekdays. Your output should be like the following image. (Hint: udf will be
used in this task)

5. Draw a bar chart using ‘city’, ‘weekday’, and ‘Percentage’ we generated in 2.3.4 using
matplotlib. Discuss what effect dose different weekdays made on the number of trips
in different cities?
Part3: RDDs vs DataFrame vs Spark SQL (15%)
Implement the following queries using RDDs, DataFrames and SparkSQL separately. Log the
time taken for each query in each approach using the “%%time” built-in magic command in
Jupyter Notebook and discuss the performance difference between these 3 approaches.
Note: Students could research and/or think of other ways to compare the performance of the 3
approaches rather than rely on the "%%time" command.

Query: Join the trip and city csv based on 'city_id', only keep all the data in 'Minas Tirith' city,
show id of trip, city name, and call time in the output.

Assignment Marking
The marking of this assignment is based on the quality of work that you have submitted rather
than just quantity. The marking starts from zero and goes up based on the tasks you have
completed and their quality for example how well the code submitted follows programming
standards, code documentation, presentation of the assignment, readability of the code,
organization of code, and so on. Please find the PEP 8 -- Style Guide for Python Code for your
reference. Here is the link: https://peps.python.org/pep-0008/
Submission
You should submit your final version of the assignment solution online via Moodle; You
must submit the following:
• A PDF file (created from the notebook) is to be submitted through the
Turnitin submission link. Use the browser's print function to save the
notebook as a PDF. Please name this pdf file based on your authcate name
(e.g. glii0039.pdf)
• An Assignment_1.ipynb file contains all of your codes and outputs. (please do
not submit the data files)
Other Information
Where to get help
You can ask questions about the assignment in the Assignments section in the Ed Forum
accessible on the unit's Moodle Forum page. This is the preferred venue for assignment
clarification-type questions. You should check this forum regularly, as the responses of the
teaching staff are "official" and can constitute amendments or additions to the assignment
specification. Also, you can visit the consultation sessions if the problem and the confusions are
still not solved.
Plagiarism and collusion
Plagiarism and collusion are serious academic offenses at Monash University. Students
must not share their work with any other students. Students should consult the policy linked
below for more information.
https://www.monash.edu/students/academic/policies/academic-integrity See
also the video linked on the Moodle page under the Assignment block.
Students involved in collusion or plagiarism will be subject to disciplinary penalties, which
can include:
● The work not being assessed
● A zero grade for the unit
● Suspension from the University
● Exclusion from the University

Late submissions
Late Assignments or extensions will not be accepted unless you submit a special
consideration form. ALL Special Consideration, including within the semester, is now to be
submitted centrally. This means that students MUST submit an online Special Consideration
form via Monash Connect. For more details, please refer to the Unit Information section in
Moodle.
There is a 10% penalty per day including weekends for a late submission.

欢迎咨询51作业君