代写辅导接单-CSE 6242 / CX 4242: Data and Visual Analytics | Georgia Tech | Fall 2023 HW 3

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Version 0.2

CSE 6242 / CX 4242: Data and Visual Analytics | Georgia Tech | Fall 2023 HW 3: Spark, Docker, DataBricks, AWS and GCP

Important Notes ........................................................................................................................................... 2 Submission Instructions ............................................................................................................................... 2 Grading and Feedback ................................................................................................................................ 2 DO THIS NOW: Create AWS Academy account ...................................................................................... 3 Download Skeleton and Data Before You Begin .................................................................................... 3 Homework Overview.................................................................................................................................... 3 Q1 [15 points] Analyzing trips data with PySpark ........................................................................................ 4

Tasks........................................................................................................................................................ 4 Q2 [30 pts] Analyzing dataset with Spark/Scala on Databricks ................................................................... 7 Tutorial ..................................................................................................................................................... 7 Tasks........................................................................................................................................................ 8 Q3 [35 points] Analyzing Large Amount of Data with PySpark on AWS.................................................... 10 Setting Up AWS Environment ................................................................................................................ 10 Datasets ................................................................................................................................................. 10 Goal........................................................................................................................................................ 10 Tasks...................................................................................................................................................... 11 Q4 [10 points] Analyzing a Large Dataset using Spark on GCP................................................................ 13 GCP Guidelines ..................................................................................................................................... 13 Goal........................................................................................................................................................ 13 Tasks...................................................................................................................................................... 14 Q5 [10 points] Regression: Automobile price prediction, using Azure Machine Learning ......................... 16 Goal........................................................................................................................................................ 16 Tasks...................................................................................................................................................... 16

Important Notes

1. Submit your work by the due date on the course schedule.

a. Every assignment has a 48-hour grace period. You may use it without asking us.

b. Before the grace period expires, you may resubmit as many times as you need to.

c. The grace period is a lenient buffer for resolving last minute issues. We do not recommend

starting new work or modifying existing work during the grace period.

d. TA assistance is not guaranteed during the grace period.

e. Submissions during the grace period will display as "late" and will not incur a penalty. f. We will not accept any submissions after the grace period.

2. Always use the most up-to-date assignment (version number at bottom right of this document).

3. You may discuss ideas with other students at the "whiteboard" level (e.g. how cross validation

works, use HashMap instead of array) and review any relevant materials online. However, each

student must write up and submit the student’s own answers.

4. All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will

be subject to the institute’s Academic Integrity procedures, directly handled by the Office of Student Integrity (OSI). Consequences can be severe, e.g., academic probation or dismissal, a 0 grade for assignments concerned, and prohibition from withdrawing from the class.

Submission Instructions

Carefully read and follow the high-level instructions below, and the detailed instructions in each question.

1. Submit ALL deliverables via Gradescope. We will not accept submissions anywhere else.

2. Submit all the required files, as specified at the beginning of each question. We will not grade any

submissions that deviate from the specified format (extra files, misnamed files, etc.).

3. Each submission and its score will be recorded and saved by Gradescope. By default, Gradescope uses your last submission for grading. To use a different submission, you MUST “activate” it

(click “Submission History” button at bottom toolbar, then “Activate”). Grading and Feedback

The maximum possible score for this homework is 100 points. We will auto-grade all questions using the Gradescope platform. Keep the following in mind:

1. 2.

You can access Gradescope through Canvas.

You may upload your code periodically to obtain feedback on your code. Gradescope will auto-grade your submission using the same test cases we use to grade your work.

You must not use Gradescope as the primary way to test your code’s correctness. It provides only a few test cases, and error messages may not be as informative as local debuggers. Iteratively develop and test your code locally, write more test cases, and follow good coding practices. Use Gradescope mainly as a "final" check.

Gradescope cannot run code that contains syntax errors. If Gradescope is not running, verify:

a. Your code is free of syntax errors (by running it locally) b. All methods have been implemented

c. You have submitted the correct file with the correct name

Do not share links to your Gradescope submissions in public posts

When many students use Gradescope simultaneously, it may slow down or fail to communicate with the tester. It can become even slower as the submission deadline approaches. You are responsible for submitting your work on time.

Version 0.2

DO THIS NOW: Create AWS Academy account

Complete step 1 (Create an AWS Academy account) of the AWS Setup Guide to create your account. Download Skeleton and Data Before You Begin

• HW3 Skeleton

• Q1 Data, Q2 Data, and Q4 Data

Homework Overview

Many modern-day datasets are huge and truly exemplify “big data”. For example, the Facebook social graph is petabytes large (over 1M GB); every day, Twitter users generate over 12 terabytes of messages; and the NASA Terra and Aqua satellites each produce over 300 GB of MODIS satellite imagery per day. These raw datasets are far too large to even fit on the hard drive of an average computer, let alone to process and analyze. Luckily, there are a variety of modern technologies that allow us to process and analyze such large datasets in a reasonable amount of time. For the bulk of this assignment, you will be working with a dataset of over 1 billion individual taxi trips from the New York City Taxi & Limousine Commission (TLC). Further details on this dataset are available here.

The main goal of this assignment is to help students gain exposure to a variety of tools that will be useful in the future (e.g., future project, research, career). The reasoning behind intentionally including AWS, Azure and GCP, is because we want students to be able to try and compare these platforms as they evolve rapidly. This will help the students in the future. Should they need to select a cloud platform to use, they can make more informed decisions and be able to get started right away.

You will find that many computational tasks in this assignment are not very difficult, and there seems to be quite a bit of “setup” to do before getting to the actual “programming” part of the problem. Being able to set up work environments, start clusters, monitor compute usage, and run large-scale experiments on cloud platforms are important skills. Through this assignment, you will be able to familiarize yourself with using clusters of machines, and the pay-per-use model used by most cloud services. This is a helpful first cloud service experience for many students.

3 Version 0.2

Q1 [15 points] Analyzing trips data with PySpark

Goal Technology Deliverables

Get familiar with PySpark to answer analytical questions. PySpark, Docker

[Gradescope] q1.ipynb: your solution as a Jupyter Notebook file

Follow these instructio ns to download and set up a preconfigured Docker image that you will use for this assignment. that you will use for this assignment.

Why use Docker? In earlier iterations of this course, students installed software on their own machines, and we (both students and instructor team) ran into many issues that could not be resolved satisfactorily. Docker allows us to distribute a cross-platform, preconfigured image with all the requisite software and correct package versions. Once Docker is installed and the container is running, access Jupyter by browsing to http://localhost:6242. There is no need to install any additional Java or PySpark dependencies as they are all bundled as part of the Docker container.

Imagine that your boss gives you a large dataset which contains trip information of New York City Taxi and Limousine Commission (TLC). You are asked to provide summaries of the most common trips, as well as information related to fares and traffic. This information might help in positioning taxis depending on the demand at each location.

You are provided with a Jupyter notebook (q1.ipynb) file which you will complete using PySpark using the provided Docker image.

Note:

1. Regular PySpark Dataframe Operations and PySpark SQL operations can be used.

2. If you re-run cells, remember to restart the kernel to clear the Spark context, otherwise an existing

Spark context may cause errors.

3. Be sure to save your work often! If you do not see your notebook in Jupyter, then double check that

the file is present in the folder and that your Docker has been set up correctly. If, after checking both, the file still does not appear in Jupyter then you can still move forward by clicking the “upload” button in the Jupyter notebook and uploading the file – however, if you use this approach, then your file will not be saved to disk when you save in Jupyter, so you would need to download your work by going to File > Download as... > Notebook (.ipynb), so be sure to download often to save your work!

Tasks

You will use the yellow_tripdata_2019-01_short.csv dataset, a modified record of the NYC Green Taxi trips and includes information about the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, payment types, and driver-reported passenger counts. When processing the data or performing calculations, do not round any values, unless specifically instructed to.

a. passenger_count—integer b. total_amount — float

c. tip_amount — float

d. trip_distance — float

e. fare_amount—float

[1 pt] You will be modifying the function clean_data to clean the data. Cast the following columns into the specified data types:

Version 0.2

f. tpep_pickup_datetime—timestamp g. tpep_dropoff_datetime—timestamp

[4 pts] You will be modifying the function common_pair. Return the top 10 pickup-dropoff location pairs that have the highest number of total passengers who have traveled between them. Sort the location pairs by total passengers. For each location pair, also compute the average amount per passenger over all trips (name this per_person_rate), utilizing total_amount.

For pairs with the same total passengers, sort them in descending order of per_person_rate. Filter out any trips that have the same pick-up and drop-off location. Rename the column for total passengers to passenger_count.

Sample Output Format (values are examples only):

PULocationID DOLocationID passenger_count 1 2 23

3 4 5

per_person_rate 5.242345 6.61345634

[4 pts] You will be modifying the function distance_with_most_tip . Filter the data for trips having fares (fare_amount) greater than $2.00 and a trip distance (trip_distance) greater than 0. Calculate the tip percent (tip_amount * 100 / fare_amount) for each trip. Round all trip distances up to the closest mile and find the average tip_percent for each trip_distance. Sort the result in descending order of tip_percent to obtain the top 15 trip distances which tip the most generously. Rename the column for rounded trip distances to trip_distance, and the column for average tip percents tip_percent .

Output:

trip_distance tip_percent

2 6.2632344561 1 4.42342882

[6 pts] You will be modifying the function time_with_most_traffic to determine which hour of the day has the most traffic. Calculate the traffic for a particular hour using the average speed of all taxi trips which began during that hour. Calculate the average speed as the average trip_distance divided by the average trip time, as distance per hour. A day with low average speed indicates high levels of traffic. The average speed may be 0, indicating very high levels of traffic.

Additionally, you must separate the hours into AM and PM, with hours 0:00-11:59 being AM, and hours 12:00-23:59 being PM. Convert these times to the 12 hour time, so you can match the below output. For example, the row with 1 as time of day, should show the average speed between 1 am and 2 am in the am_avg_speed column, and between 1 pm and 2pm in the pm_avg_speed column.

Use date_format along with the appropriate pattern letters to format the time of day so that it matches the example output below. Your final table should contain values sorted from 0-11 for time_of_day. There may be data missing for a time of day, and it may be null for am_avg_speed or pm_avg_speed. If an hour has no data for am or pm, there may be missing rows. Do not include any additional rows for times of day which are not represented in the data.

Version 0.2

Sample Output Format (values are examples only):

time_of_day am_avg_speed 1 0.953452345

2 5.2424622

4 null

pm_avg_speed 9.23345272

null

2.55421905

IMPORTANT: Strictly follow the requirements below, or your answers may not be graded.

Version 0.2

1. 2.

Do not add any cells or additional imports to the notebook.

Remove all “testing” code that renders output, or Gradescope will crash. For instance, any additional print, display, and show statements used for debugging must be removed.

Q2 [30 pts] Analyzing dataset with Spark/Scala on Databricks

Goal Perform further analysis using Spark on DataBricks. Technology Spark/Scala, Databricks

Deliverables

[Gradescope]

• q2.dbc: Your solution as Scala Notebook archive file (.dbc) exported from Databricks (see Databricks Setup Guide below)

• q2.scala: Your solution as a Scala source file exported from Databricks (see Databricks Setup Guide below)

• q2_results.csv: The output results from your Scala code in the Databricks q2 notebook file. You must carefully copy the outputs of the display()/show() function into a file titled q2_results.csv under the relevant sections. Please double-check and compare your actual output with the results you copied.

Tutorial

Firstly, go over this Spark on Databricks Tutorial, to learn the basics of creating Spark jobs, loading data, and working with data.

You will analyze nyc-tripdata.csv1 using Spark and Scala on the Databricks platform. (A short description of how Spark and Scala are related can be found here.) You will also need to use the taxi zone lookup table using taxi_zone_lookup.csv that maps the location ID into the actual name of the region in NYC. The nyc-trip data dataset is a modified record of the NYC Green Taxi trips and includes information about the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, payment types, and driver- reported passenger counts.

VERY IMPORTANT

1. Use only Firefox, Safari or Chrome when configuring anything related to Databricks. The setup process has been verified to work on these browsers.

2. Carefully follow the instructions in the Databricks Setup Guide . (You should have already downloaded the data needed for this question using the link provided before Homework Overview.)

a. You must choose the Databricks Runtime (DBR) version as “10.4 (includes Apache Spark 3.2.1, Scala 2.12)”. We will grade your work using this version.

b. Note that you do not need to install Scala or Spark on your local machine. They are provided with the DBR environment.

3. You must use only Scala DataFrame operations for this question. Scala DataFrames are just another name for Spark DataSet of rows. You can use the DataSet API in Spark to work on these DataFrames. Here is a Spark document that will help you get started on working with DataFrames in Spark. You will lose points if you use SQL queries, Python, or R to manipulate a DataFrame.

a. After selecting the default language as SCALA, do not use the language magic %<language> with other languages like %r, %python, %sql etc. The language magics are used to override the default language, which you must not do for this assignment.

b. You must not use full SQL queries in lieu of the Spark DataFrame API. That is, you must not use functions like sql(), which allows you to directly write full SQL queries like spark.sql (“SELECT* FROM col1 WHERE ...”). This should be df.select(“*”) instead.

4. The template Scala notebook q2.dbc (in hw3-skeleton) provides you with code that reads a data file

1 Graph derived from the NYC Taxi and Limousine Commission

7 Version 0.2

nyc-tripdata.csv. The input data is loaded into a DataFrame, inferring the schema using reflection (Refer to the Databricks Setup Guide above). It also contains code that filters the data to only keep the rows where the pickup location is different from the drop location, and the trip distance is strictly greater than 2.0 (>2.0).

a. All tasks listed below must be performed on this filtered DataFrame, or you will end up with wrong answers.

b. Carefully read the instructions in the notebook, which provides hints for solving the problems.

5. Some tasks in this question have specified data types for the results that are of lower precision (e.g., float). For these tasks, we will accept relevant higher precision formats (e.g., double). Similarly, we will accept results stored in data types that offer “greater range” (e.g., long, bigint) than what we have specified (e.g., int).

6. Remove all “testing” code that renders output, or Gradescope will crash. For instance, any additional print, display, and show statements used for debugging must be removed.

Tasks

List the top-5 most popular locations for:

a. [2 pts] dropoff based on "DOLocationID", sorted in descending order by popularity. If there is a tie, then one with a lower "DOLocationID" gets listed first.

b. [2 pts] pickup based on "PULocationID", sorted in descending order by popularity. If there is a tie, then one with a lower "PULocationID" gets listed first.

[4 pts] List the top-3 locationID’s with the maximum overall activity. Here, overall activity at a LocationID is simply the sum of all pick-ups and all drop-offs at that LocationID. In case of a tie, the lower LocationID gets listed first.

Note: If a taxi picked up 3 passengers at once, we count it as 1 pickup and not 3 pickups.

[4 pts] List all the boroughs (of NYC: Manhattan, Brooklyn, Queens, Staten Island, Bronx along with "Unknown" and "EWR") and their total number of activities, in descending order of a total number of activities. Here, the total number of activities for a borough (e.g., Queens) is the sum of the overall activities (as defined in part 2) of all the LocationIDs that fall in that borough (Queens). An example output format is shown below.

[5 pts] List the top 2 days of the week with the largest number of daily average pick-ups, along with the average number of pick-ups on each of the 2 days in descending order (no rounding off required). Here, the average pickup is calculated by taking an average of the number of pick-ups on different dates falling on the same day of the week. For example, 02/01/2021, 02/08/2021 and 02/15/2021 are

Version 0.2

all Mondays, so the average pick-ups for these is the sum of the pickups on each date divided by 3. An example output is shown below.

Note: The day of week is a string of the day’s full spelling, e.g., "Monday" instead of the number 1 or "Mon". Also, the pickup_datetime is in the format: yyyy-mm-dd

5) [6 pts] For each hour of a day (0 to 23, 0 being midnight) — in the order from 0 to 23 (inclusively), find the zone in the Brooklyn borough with the largest number of total pick-ups.

Note: All dates for each hour should be included.

6) [7 pts] Find which 3 different days in the month of January, in Manhattan, that saw the largest positive percentage increase in pick-ups compared to the previous day, in the order from largest percentage increase to smallest percentage increase. An example output is shown below.

Note: All years need to be aggregated to calculate the pickups for a specific day of January. The change from Dec 31 to Jan 1 can be excluded.

List the results of the above tasks in the provided q2_results.csv file under the relevant sections. These pre- formatted sections also show you the required output format from your Scala code with the necessary columns — while column names can be different, their resulting values must be correct.

• You must manually enter the output generated into the corresponding sections of the q2_results.csv file, preferably using some spreadsheet software like MS-Excel (but make sure to keep the csv format). For generating the output in the Scala notebook, refer to show() and display()functions of Scala.

• Note that you can edit this csv file using text editor, but please be mindful about putting the results under designated columns.

• If you encounter a “UnicodeDecodeError”, please save file as “.csv UTF-8" to resolve.

Note: Do NOT modify anything other than filling in those required output values in this csv file. We grade by running the Spark Scala code you write and by looking at your results listed in this file. So, make sure that your output is actually obtained from the Spark Scala code you write.

Hint: You may find some of the following DataFrame operations helpful:

toDF, join, select, groupBy, orderBy, filter, agg, Window(), partitionBy, orderBy, etc.

9 Version 0.2

Q3 [35 points] Analyzing Large Amount of Data with PySpark on AWS

Goal Use PySpark on AWS using Amazon Athena. Technology PySpark, AWS

VERY IMPORTANT: Use Firefox, Safari or Chrome when configuring anything related to AWS.

You will try out PySpark for processing data on Amazon Web Services (AWS). Here you can learn more about PySpark and how it can be used for data analysis. You will be completing a task that may be accomplished using a commodity computer (e.g., consumer-grade laptops or desktops). However, we would like you to use this exercise as an opportunity to learn distributed computing on AWS, and to gain experience that will help you tackle more complex problems.

The services you will primarily be using are Amazon S3 storage, Amazon Athena. You will be creating an S3 bucket, running code using Athena and its serverless PySpark engine, and then storing the output into that S3 bucket. Amazon Athena is serverless, meaning that it is pay for what you use. There are no servers to maintain that will accrue costs whether it’s being used or not.

For this question, you will only use up a very small fraction of your AWS credit. If you have any issues with the AWS Academy account, please post in the dedicated AWS Setup Ed Discussion thread.

Setting Up AWS Environment

Go through all the steps in the AWS Setup Guide. You should have already completed Step 1 to create your account) to set up your AWS environment, e.g., setting up billing alert, creating S3 storage bucket, and uploading skeleton file.

Datasets

In this question, you will use a dataset of trip records provided by the New York City Taxi and Limousine Commission (TLC). You will be accessing the dataset directly through AWS via the code outlined in the homework skeleton. Specifically, you will be working with two samples of this dataset, one small, and one much larger. Optionally, if you would like to learn more about the dataset, check out here and here; also optionally, you may explore the structure of the data by referring to [1] [2].

EXTREMELY IMPORTANT: Both the datasets are in the US East (N. Virginia) region. Using machines in other regions for computation will incur data transfer charges. Hence, set your region to US East (N. Virginia) in the beginning (not Oregon, which is the default). This is extremely important, otherwise your code may not work, and you may be charged extra.

Goal

You work at NYC TLC, and since the company bought a few new taxis, your boss has asked you to locate potential places where taxi drivers can pick up more passengers. Of course, the more profitable the locations are, the better. Your boss also tells you not to worry about short trips for any of your analysis, so only analyze trips which are 2.0 miles or longer.

10 Version 0.2

Deliverables

[Gradescope]

• q3.ipynb: PySpark notebook for this question (for the larger dataset).

• q3_output_large.csv: output file (comma-separated) for the larger dataset.

First, find the 20 most popular drop off locations in the Manhattan borough by finding which of these destinations had the greatest passenger count.

Now, analyze all pick-up locations.

• For each pick-up location determine

o theaveragetotalamountpertrip,

o thetotalcountofalltripsthatstartatthatlocation,and

o thecountofalltripsthatstartatthatlocationandendatoneofmostpopulardrop-off

locations.

• Using the above values,

o determinetheproportionoftripsthatendinoneofthepopulardrop-offlocations(#trips that end in drop off location divided by total # of trips) and

o multiplythatproportionbytheaveragetotalamounttogetaweightedprofitvaluebased on the probability of passengers going to one of the popular destinations.

Your boss is not as savvy with the data as you are and is not interested in location IDs. To make it easy for your boss, provide the Borough and Zone for each of the top 20 pick-up locations you determined.

Tasks

You are provided with a python notebook (q3.ipynb) file which you will complete and load into EMR. You are provided with the load_data() function, which loads two PySpark DataFrames. The first DataFrame, trips, contains trip data where each record refers to one (1) trip. The second DataFrame, lookup, maps a LocationID to its trip information. It can be linked to either the PULocationID or DOLocationID fields in the trips DataFrame.

The following functions must be completed for full credit.

VERY IMPORTANT

•

• •

•

Ensure that the parameters for each function remain as defined and the output order and names of the fields in the PySpark DataFrames are maintained.

Do not import any functions which were not already imported within the skeleton.

You must NOT round any numeric values. Rounding numbers can introduce inaccuracies. Our grader will be checking the first 8 decimal places of each value in the DataFrame.

The provided q3.ipynb contains an 'sc' cell. Sometimes, when you run it, you may encounter the 'Attach to a cluster to execute a cell' message. Please be aware that it takes a few minutes for a cluster/notebook to initialize before you can begin working.

[1 pts] user()

i. Returns your GT Username as a string (e.g., gburdell3)

[2 pts] long_trips(trips)

i. This function filters trips to keep only trips 2 miles or longer (e.g., >= 2).

ii. Returns PySpark DataFrame with the same schema as trips iii. Note: Parts c, d and e will use the result of this function

[6 pts] manhattan_trips(trips, lookup)

ii.

iii.

This function determines the top 20 locations with a DOLocationID in Manhattan by sum of passenger count.

Returns a PySpark DataFrame (mtrips) with the schema (DOLocationID, pcount)

Note: If you encounter the error 'Can only compare identically-labeled DataFrame objects,' it is likely due to the use of the RDD API. We recommend avoiding the use of the RDD API since it is not compatible with

Version 0.2

the autograder. Instead, we suggest rewriting the logic using a join clause.

d) [6 pts] weighted_profit(trips, mtrips) i. This function determines

i. the average total_amount, ii. the total count of trips, and

iii. the total count of trips ending in the top 20 destinations and return the weighted_profit as discussed earlier in the homework document.

iv. Returns a PySpark DataFrame with the schema (PULocationID, weighted_profit) for the weighted_profit as discussed earlier in this homework document.

e) [5 pts] final_output(wp, lookup)

i. This function

i. takes the results of weighted_profit,

ii. links it to the borough and zone through the lookup data frame, and

iii. returns the top 20 locations with the highest weighted_profit.

ii. Returns a PySpark DataFrame with the schema (Zone, Borough, weighted_profit)

iii. Note: If you encounter issues with '3.e Test Final Output,' primarily due to the DataFrame returned from 'final_output()' containing incorrect data, it is essential to reformat column data types, particularly when applying 'agg()' operations in previous sections.

Once you have implemented all these functions, run the main() function, which is already implemented, and update the line of code to include the name of your output s3 bucket and a location. This function will fail if the output directory already exists, so make sure to change it each time you run the function.

Example: final.write.csv(‘s3://cse6242-gburdell3/output-large3’)

Your output file will appear in a folder in your s3 bucket as a csv file with a name which is similar to part- 0000-4d992f7a-0ad3-48f8-8c72-0022984e4b50-c000.csv. Download this file and rename it to q3_output_large.csv for submission. Do NOT make any other changes to the file.

Hints:

1. Refer to DataFrame commands such as filter, join, groupBy, agg, limit, sort, withColumnRenamed and withColumn. Documentation for the DataFrame APIs is located here.

2. Testing on a single, small dataset (i.e. a "test case") is helpful, but is not sufficient for discovering all potential issues, especially if such issues only become apparent when the code is run on larger datasets. It is important for you to develop more ways to review and verify your code logic.

3. Overwriting the DataFrames from the function parameters can cause unintended side effects when it comes to rounding. Be sure to preserve the DataFrames in each function.

4. Precision in data analytics is very important. Keep in mind that precision reduction in an earlier step can accumulate and be magnified, subsequently significantly affecting the final output’s precision (e.g., for a dataset with 1,000,000 data points, a 0.0001 difference for each data point can lead to a total difference of 100 over the whole dataset). This is called precision loss. Check out this post or hints on how to avoid precision loss.

5. Check if you're reducing the precision (or "scale") too aggressively. Can you relax the restriction during intermediate steps?

6. Make sure you return a DataFrame. If you get NoneType errors, you are most likely not returning what you think you are.

7. Some columns may need to be cast to the right data type. Keep that in mind!

IMPORTANT: Strictly follow the guidelines below, or your answer may not be graded.

1. Double check that you are submitting the correct files — we only want the script and output from the

larger dataset. Also, double check that you are writing the right dataset’s output to the right file.

Version 0.2

2. You are welcome to store your script’s output in any bucket you choose, as long as you can download and submit the correct files.

3. Do not make any manual changes to the output files.

4. Please ensure that you do not remove #export from the HW skeleton;

5. Regular Pyspark Dataframe Operations and PySpark SQL operations can be used.

5.1. To use PySpark SQL operations, you must use the SQL Context on the Spark Dataframe. Example: df.sql_ctx.sql(“SELECT * FROM some_data”)

6. Do not import any additional packages, INCLUDING pyspark.sql.functions, as this may cause the autograder to work incorrectly. Everything you need should be imported for you.

7. Using .rdd() can cause issues in the GradeScope environment. You can accomplish this assignment without it. In general, since the RDD API is outdated (though not deprecated), you should be wary of using this API.Remove all “testing” code that renders output, or Gradescope will crash. For instance, any additional print, display, and show statements used for debugging must be removed.

Q4 [10 points] Analyzing a Large Dataset using Spark on GCP

Goal Technology Deliverables

Use Spark on Google Cloud Platform to analyze a dataset. Spark, Google Cloud Platform (GCP)

[Gradescope] q4.ipynb: the PySpark notebook for this question.

VERY IMPORTANT: Use Firefox, Safari or Chrome when configuring anything related to GCP. GCP Guidelines

Instructions to set up GCP Credits, GCP Storage and Dataproc Cluster are provided as video tutorials part 1, part 2, and part 3) and as written instructions

Helpful tips/FAQs for special scenarios:

a) If GCP service is disabled for your google account, try the steps in this google support link

b) If you have any issues with the GCP free credits, please post in the dedicated GCP Setup Ed

Discussion thread.

Goal

The goal of this question is to familiarize you with creating storage buckets/clusters and running Spark programs on Google Cloud Platform. This question asks you to create a new Google Storage Bucket and load the NYC Taxi & Limousine Commission Dataset. You are also provided with a Jupyter Notebook q4.ipynb file, which you will load and complete in a Google Dataproc Cluster. Inside the notebook, you are provided with the skeleton for the load_data() function, which you will complete to load a PySpark DataFrame from the Google Storage Bucket you created as part of this question. Using this PySpark DataFrame, you will complete the following tasks using Spark DataFrame functions.

You will use the data file yellow_tripdata09-08-2021.csv. The preceding link allows you to download the dataset you are required to work with for this question from the course DropBox. Each line represents a single taxi trip consisting of the comma-separated columns bulleted below. All columns are of string data type. You must convert the highlighted columns below into decimal data type (do NOT use float datatype) inside their respective functions when completing this question. Do not convert any datatypes within the load_data function. While casting to a decimal datatype, use a precision of 38 and a scale of 10.

Version 0.2

• vendorid

• tpep_pickup_datetime

• tpep_dropoff_datetime

• passenger_count

• trip_distance (decimal data type)

• ratecodeid

• store_and_fwd_flag

• pulocationid

• dolocationid

• payment_type

• fare_amount (decimal data type)

• extra

• mta_tax •

•

• improvement_surcharge

• total_amount

Tasks

You must first perform the task a BEFORE performing task b, c, d, e and f. No points are allocated to task a, but it is essential that you correctly implement the load_data()function as the remaining graded tasks depend upon this task and its correct implementation. Upload code to Gradescope ONLY after completing all tasks and removing/commenting all the testing code. Anything else will break the autograder

a) [0 pts — required] Function load_data()to load data from a Google Storage Bucket into a Spark DataFrame

b) [2 pts] Function exclude_no_pickuplocations() to exclude trips with no pick-up locations (pick- up location id column is null or is zero) in the original data from a.

c) [2 pts] Function exclude_no_tripdistance() to exclude trips with no distance (i.e., trip distance column is null or zero) in the dataframe output by exclude_no_pickuplocations().

d) [2 pts] Function include_fare_range() to include trips with fare from $20 (inclusively) to $60 (inclusively) in the dataframe output by exclude_no_tripdistance().

e) [2 pts] Function get_highest_tip() to identify the highest tip (rounded to 2 decimal places) in the dataframe output by include_fare_range().

f) [2 pts] Function get_total_toll() to calculate the total toll amount (rounded to 2 decimal places) in the dataframe output by include_fare_range().

IMPORTANT: Strictly follow the guidelines below, or your answer may not be graded.

tip_amount (decimal data type)

tolls_amount (decimal data type)

4. 5.

Regular PySpark Dataframe Operations can be used.

Do NOT use any functions from the RDD API or your code will break the autograder. In general, since the RDD API is outdated (though not deprecated), you should be wary of using this API.

Make sure to download the notebook from your GCP cluster before deleting the GCP cluster (otherwise, you will lose your work).

Do not add new cells to the notebook, as this may break the auto-grader.

Remove all “testing” code that renders output, or Gradescope will crash. For instance, any additional print, display, and show statements used for debugging must be removed.

Do not use any .rdd function in your code. Not only will this break the autograder, but you should be wary of using this function in general .

Version 0.2

7. Ensure that you are only submitting a COMPLETE solution to gradescope. Anything less will break the autograder. Write local unit tests to help test your code.

From experience we have had with the class, we have found that downloading the notebook using vanilla jupyter worked better than jupyter lab.

15 Version 0.2

Q5 [10 points] Regression: Automobile price prediction, using Azure Machine Learning

Goal Technology Deliverables

Use Microsoft Azure to predict automobile prices

Azure Machine Learning

[Gradescope] q5.csv: a csv file containing results for all parts

Note: Create and use a free workspace instance on Azure Machine Learning. Use your Georgia Tech username (e.g., jdoe3) to login.

Goal

The primary purpose of this question is to introduce you to Microsoft Machine Learning Studio by familiarizing you to its basic functionalities and machine learning workflows. Go through the Automobile Price Prediction tutorial and create/run ML experiments to complete the following tasks. You will not incur any cost if you save your experiments on Azure till submission. Once you are sure about the results and have reported them, feel free to delete your experiments.

Tasks

You will manually modify the given file q5.csv by adding the results using a plain text editor from the following tasks.

1. 2.

Update q5.csv by replacing gburdell3 with your GT username

[3 pts] Repeat the experiment described in the tutorial and report values of all metrics as mentioned

in the Evaluate Model section of the tutorial. Make sure the Split Data looks as it does below:

[3 pts] Repeat the experiment mentioned in task 2 with a different value of Fraction of rows in the first output dataset in the Split Data module. Change the value to 0.8 from the originally set value of 0.7. Report corresponding values of the metrics.

[4 pts] Run a new experiment — evaluate the model using 5-fold cross-validation CV. Select parameters in the Partition and Sample component in accordance with the figure below. For Cross Validate model set the column name as “price” for CV and use 0 as a random seed. Report the

Version 0.2

3. 4.

values of Root Mean Squared Error (RMSE) and Coefficient of Determination for each of the five folds (1st fold corresponds to fold number 0 and so on). Do NOT round the results. Report exact values. HINT: to see results, right click Cross Validate Model and select Preview data à Evalution results by fold. Make sure to utilize the same data cleaning/processing steps as you did before.

Figure: Property Tab of Partition and Sample Module

IMPORTANT: Strictly follow the guidelines below, or your answer may not be graded.

Version 0.2

• • •

•

DO NOT change the order of the questions.

Report the exact numerical values that you get in your output, and DO NOT round any of them. When manually entering a value into the csv file, append it immediately after a comma, so

there will be NO space between the comma and your value, and no trailing spaces or commas after your value.

Follow the tutorial and do not change values for L2 regularization. For tasks 2 and 3, select the columns given in the tutorial.