辅导案例-CS236

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

CS236 Class Project (Due Thursday June11th)
Spring 2020
TA: Pritom Ahmed ([email protected])

Project Summary
For this project, you will be provided with two real datasets. The first dataset contains station
information from weather stations across the world. The second provides individual recordings
for the stations over a 4-year period. The goal of the project is to find out which states in the US
have the most stable rainfall. That is, the result should provide the US states ordered (in
ascending order) by D, where D is the difference between the two months with the highest and
lowest rainfall. You will do the project individually. The project should be completed on Spark.
You will be using a local instance of Spark. The TA will upload instructions on how to quickly
setup local Spark in a Unix environment.

Suggested Approach
Task 1 (25 points) : As your datasets contain data from stations around the world while what
you are asked is about US states, you would first need to identify which stations are within the
United States. Then you would need to group stations by state.
Task 2 (25 points) : For each state with readings, you will first need to find the average
precipitation recorded for each month (ignoring year).
Task 3 (25 points) : Then find the months with the highest and lowest averages for each
state.
Task 4 (25 points) : You will need to order the states by the difference between the highest
and lowest month average, in ascending order.
In your result, for each state, you should return:
The state abbreviation, e.g. “CA”
The average precipitation and name of the highest month, e.g. “90, July”
The average precipitation and name of the lowest month, e.g. “50, January”
The difference between the two, e.g. “40”

Dataset Information
Please find the dataset here.

The LOCATIONS dataset is a single .csv file (WeatherStationLocations.csv), containing the
metadata for every station across the world. To identify that a station is in the US, you need to
look for stations where the “CTRY” field is “US” and the “ST” field is non-empty. Keep in mind
that the first row of this file is Header. Here are the fields for this dataset:
USAF = Air Force station ID. May contain a letter in the first position.
WBAN = NCDC WBAN number
CTRY = FIPS country ID
ST = State for US stations
LAT = Latitude in thousandths of decimal degrees
LON = Longitude in thousandths of decimal degrees
ELEV = Elevation in meters
BEGIN = Beginning Period Of Record (YYYYMMDD).
END = Ending Period Of Record (YYYYMMDD).
Sample Row:
"724920","23237","STOCKTON METROPOLITAN
AIRPORT","US","CA","+37.889","-121.226","+0007.9","20050101","20140403”

The RECORDINGS dataset is contained in four files, one for each year (2006-2009). The
“STN---“ value will match with the “USAF” field in the locations dataset. These files are
concatenated from many small files, so keep in mind that there will be Header lines through the
files. Here are the fields for this dataset:
STN--- = The station ID (USAF)
WBAN = NCDC WBAN number
YEARMODA = The datestamp
TEMP = Ignore for this project
DEWP = Ignore for this project
SLP = Ignore for this project
STP = Ignore for this project
VISIB = Ignore for this project (Visibility)
WDSP = Ignore for this project
MXSPD = Ignore for this project
GUST = Ignore for this project
MAX = Ignore for this project (Max Temperature for the day)
MIN = Ignore for this project (Min Temperature for the day)
PRCP = Precipitation
NDP = Ignore for this project
FRSHTT = Ignore for this project

Sample Row:
997781 99999 20061121 42.4 13 9999.9 0 9999.9 0 9999.9 0 999.9 0 17.5 13 22.0
999.9 46.2* 39.0* 0.00I 999.9 000000

Calculation of Precipitation

A value of 99.99 means that there is no data (You can either treat these as 0 or exclude them
from your calculations, though the latter is the more correct option)
There is a letter at the end of the recordings. Here is a table of what the letters mean (Basically
the letter tells you how long the precipitation was accumulated before recording).
A - 6 hours worth of precipitation
B - 12 hours...
C - 18 hours...
D - 24 hours...
E - 12 hours... (slightly different from B but the same for this project).
F - 24 hours ... (slightly different from D but the same for this project).
G - 24 hours ... (slightly different from D but the same for this project).
H - station recorded a 0 for the day (although there was some recorded instance of
precipitation).
I - station recorded a 0 for the day (and there was NO recorded instance of precipitation).
A simple solution would be to multiply. For instance, if they recorded 12 hours worth of
precipitation, multiply it by 2 to extrapolate 24 hours worth. How you treat these is up to you (just
let me know in your README)

Due Date
This project will be due Thursday, June11th before Midnight. Email the TA your submissions.

Deliverables
A zipped file containing:
1. The script to run your job
2. The full source code
3. A README describing your project, including:
1. An overall description of how you chose to divide the problem into different spark
operations, with reasoning.
2. A description of each step job
3. Total runtime for the whole project
4. A description of anything you did extra for the project, such as adding a combiner. If
there is anything that you feel deserves extra credit, put it here.
Do not include the two original datasets.
The script should take as input three things:
1. A folder containing the Locations file
2. A folder containing the Recordings files
3. An output folder

Hints
1. Make sure you parse correctly. One file is a csv, the other is not. One file has a single
row of Headers, the other has many.
2. Think about how many steps you should use. How much work should each step do?
3. It is a good idea to read about different types of RDDs and various operations they allow
users to do.
4. You may want to create a subset of the reading dataset and use that while you are
developing, instead of using the whole dataset. This can save some time.
5. Make sure you start early.
Potential Extra Credit (Optional)
Please feel free to try to beef up your project for extra credit. There are many ways that you can
do this. Here are a few examples:
● A clever way to achieve faster execution time (fewer steps); if you do that show how
much improvement you get.
● Bigger Bonus: Include the stations with “CTRY” as “US” that do not have a state tag,
finding a way to estimate the state using a spatial distance with the known stations.
There are some stations that are Ocean Buoys so you may want to have a maximum
distance to be required in order to be included in a state, or you could create a separate
“state” representing the “pacific” and “atlantic” ocean (Checked by using coordinates).
There is a lot of potential work here so the extra credit could be large.
Whatever you try to do let the TA know in your README. Make sure that you complete the
project before going for extra credit. If you aren’t sure whether your idea is worth extra credit or
not, just email the TA.