辅导案例-P7

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

2020/4/8 P7: COVID-19 Growth Trend Clustering
https://canvas.wisc.edu/courses/176717/assignments/804556 1/4
P7: COVID-19 Growth Trend Clustering
Due Thursday by 9:29am Points 100 Submitting a file upload File Types py
Available until Apr 9 at 9:29am
Submit Assignment
NOTE: A tie-breaking protocol has been added to the HAC description below.
Assignment Goals
Implement hierarchical clustering
Process real-world data of particular contemporary relevance
Summary
There is a lot of analysis happening with the various datasets for COVID-19 right now. One of the goals
of these analyses is to help figure out which countries are "beating" the pandemic.
The Ten Hundred Plot
2020/4/8 P7: COVID-19 Growth Trend Clustering
https://canvas.wisc.edu/courses/176717/assignments/804556 2/4
Using the publicly available Johns Hopkins COVID-19 data, you'll be performing clustering on time series
data for different regions in the world. Each region is defined by a row in the data set, which can be a
country, a province, etc. The time series data represents the number of (cumulative) confirmed cases on
each day. Because different regions have different onset of the outbreak and differ in magnitude, it is
often desirable to convert a raw time series into a shorter feature vector. For this assignment, you will
represent a time series by two numbers: the "x" and "y" values in the above ten-hundred plot video.
After each region becomes that two-dimensional feature vector, you will cluster all regions with HAC.
Program Specication
Download the data in CSV format: time_series_covid19_confirmed_global.csv
This is a snapshot of the data from the morning of April 2, 2020. If you wish to test your code on current
data, the Johns Hopkins University git repo (https://github.com/CSSEGISandData/COVID-
19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv)
has constantly-updating data, but we will only be testing your code with the snapshot.
Write the following Python functions:
1. load_data(filepath) — takes in a string with a path to a CSV file formatted as in the link above, and
returns the data (without the lat/long columns but retaining all other columns) in a single structure.
2. calculate_x_y(time_series) — takes in one row from the data loaded from the previous function,
calculates the corresponding x, y values for that region as specified in the video, and returns them in
a single structure.
Notes:
1. The "n/10 day" is the latest day with LESS THAN OR EQUAL TO n/10 cases, similarly for the
"n/100 day".
2. If "n/10 day" is day i, and today is day j, then x=j-i, not j-1+1.
3. Some x or y can be NaN if the time series doesn't contain enough growth.
4. There is a link to Matlab code in the video description on YouTube, please consult that for the
precise definition of x and y.
3. hac(dataset) — performs single linkage hierarchical agglomerative clustering on the regions with the
(x,y) feature representation, and returns a data structure representing the clustering.
You may implement other helper functions as necessary, but these are the functions we will be testing.
Load Data
Read in the file specified in the argument (the DictReader from Python's csv module
(https://docs.python.org/3/library/csv.html#csv.DictReader) will be of use) and return a list of dictionaries,
where each row in the dataset is a dictionary with the column headers as keys and the row elements as
values. These dictionaries should not include the lat/long columns, as we will not be using them in this
2020/4/8 P7: COVID-19 Growth Trend Clustering
https://canvas.wisc.edu/courses/176717/assignments/804556 3/4
program, but retain the province (possibly empty) and country columns so that data points can be
uniquely identified.
You may assume the file exists and is a properly formatted CSV.
Calculate Feature Values
This function takes in the data from a single row of the raw dataset as read in the previous function (i.e. a
single dictionary, without the lat/long values but retaining all other columns). As explained in the video
above, this function should return the x, y values in a tuple, formatted as (x, y) .
Perform HAC
For this function, we would like you to mimic the behavior of SciPy's HAC function
(https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html) , linkage() . You
may not use this function in your implementation, but we strongly recommend using it to verify your
results!
Input: A collection of m observation vectors in n dimensions may be passed as an m by n array. All
elements of the condensed distance matrix must be finite, i.e. no NaNs or infs. If you follow the Matlab
code from the YouTube video description, you will occasionally have NaN values — such rows should be
filtered out within this function and should not count toward your total number of regions.
In our case, m is the number of regions and n is 2: the x and y features for each region.
Using single linkage, perform the hierarchical agglomerative clustering algorithm as detailed on slide 19
of this presentation (http://pages.cs.wisc.edu/~jerryzhu/cs540/handouts/introML.pdf) . Use a standard
Euclidean distance function for calculating the distance between two points.
Output: An (m-1) by 4 matrix Z . At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are
combined to form cluster m + i. A cluster with an index less than m corresponds to one of the m original
observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2] . The fourth value
Z[i, 3] represents the number of original observations in the newly formed cluster.
That is:
Number each of your starting data points from 0 to m-1. These are their original cluster numbers.
Create an (m-1)x4 array or list. Iterate through the list row by row.
For each row, determine which two clusters you will merge and put their numbers into the first and
second elements of the row. The first point listed should be the smaller of the two cluster indexes.
The single-linkage distance between the two clusters goes into the third element of the row. The total
number of points in the cluster goes into the fourth element.
If you merge a cluster containing more than one data point, its number (for the first or second element
of the row) is given by m+the row index in which the cluster was created.
Before returning the data structure, convert it into a NumPy matrix.
2020/4/8 P7: COVID-19 Growth Trend Clustering
https://canvas.wisc.edu/courses/176717/assignments/804556 4/4
If you follow these guidelines for input and output, your result should match the result of
scipy.cluster.hierarchy.linkage() and you can use that function to verify your results. Be aware that this
function does not contain code to filter NaN values, so this filtering should be performed before calling
the function.
Tie Breaking
In the event that there are multiple pairs of points with equal distance for the next cluster:
Given a set of pairs with equal distance {(x , x )} where i < j, we prefer the pair with the smallest first
cluster index i. If there are still ties (x , x ), ... (x , x ) where i is that smallest first index, we prefer the pair
with the smallest second cluster index.
Be aware that this tie breaking strategy may not produce identical results to
scipy.cluster.hierarchy.linkage() .
Challenge Options
If you wish to continue exploring the data, we challenge you to implement a complete linkage option and
compare the results with single linkage. You may also wish to use the results of your HAC algorithm to
choose an appropriate k for k-means, and implement that clustering as well. Do you get the same
clusters every time? If not, how do they differ? Is there anything meaningful there, do you think?
Submission
Please submit your code in a file called ten_hundred.py . All code should be contained in functions or
under a
if __name__=="__main__":
check so that it will not run if your code is imported to another program.
i j
i j i k