程序代写案例-MAS1803

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Week 9 Handout - MAS1803 Problem Solving with Python 2021-22
Dr Chris Graham 1
Handout 9
About pandas
Pandas is the most commonly used package for working with datasets
in Python. The name pandas comes from "Panel Data" (a statistical
method for data captured over time) and "Python Data Analysis"
(analysing data). Nothing to do with the animals, though I am a big fan!
We're going to barely scratch the surface this week. You can find full
Pandas documentation here.
Motivation
Consider the following data on local department store Fenwick (I may have mentioned earlier my
love of the department store vibe - seeing the wooden escalators was one of my favourite bits of
New York!)
Year Window Revenue (Millions £)* White Christmas**
2015 Once Upon a Christmas 298.2 True
2016 Beatrix Potter 302.0 True
2017 Paddington Bear 297.9 True
2018 Charlie and the Chocolate Factory 288.0 False
2019 The Snowman 246.32 False
2020 Wind in the Willows 218.0 False
* Financial data from https://craft.co/fenwick-limited
** White Christmas in England according to the Met Office
Looking at the table above, we just don't currently know how to handle this with what we've
learnt so far: it is a table with multiple different data types that would naturally be represented (in
order of columns) as integers, strings, float values and booleans. And we could well have
information not available to us, for example missing data in one particular year.
What we really want is a data type that acts like a table: that can store data of different types,
that can have missing data, that we can query and manipulate and make plots from. That's what
pandas offers.
We are not trying to be statisticians (yet)! With a statistics mindset we might ask, for example,
whether a White Christmas has a statistically significant impact on revenue (probably not!). When
we look at a larger dataset later then it will open up lots of other things we could ask, but right
now we are just interested in handling this data. We'll effectively be laying the foundations for
doing some "proper" statistics, but actually the programming language of choice for statistics in
our degree programme is R, which you will meet next semester.
Week 9 Handout - MAS1803 Problem Solving with Python 2021-22
Dr Chris Graham 2
Importing pandas
We can import pandas with
import pandas as pd
Series and Dataframes
Pandas has two objects that we are interested in:
A Series you can think of as a column in a table and is very like an array in NumPy.
A DataFrame you can think of as the table in its entirety and is made up of Series.
Creating a series
Let's create a Series for the year's column above. The function pd.Series takes in a list of
values and turns it into a Series:
years = pd.Series([2015,2016,2017,2018,2019,2020])
type(years)
pandas.core.series.Series
Printing years , we get something a bit different from printing a list:
print(years)
0 2015
1 2016
2 2017
3 2018
4 2019
5 2020
dtype: int64
Note the presence of an "index column" which runs from 0 upwards.
We can do some familiar things like this
years[0]
2015
and some other unfamiliar things. Let's create a Series for revenue to demonstrate:
revenue = pd.Series([298.2,302.0,297.9,288.0,246.32,218.0])
Now check this out... we can query rows in a series like this
revenue[revenue > 280]
0 298.2
1 302.0
2 297.9
3 288.0
dtype: float64
Week 9 Handout - MAS1803 Problem Solving with Python 2021-22
Dr Chris Graham 3
which is really cool, but will be even better once we put this data into a DataFrame instead.
Creating a DataFrame
A pandas DataFrame is much like a table of data.
To create a DataFrame manually (we're more likely to create a DataFrame by importing data from
a CSV as we'll do later in the practical), we first create a dictionary. Recall (week 6) that a
dictionary is created as follows, made up of key:value pairs separated by commas and inside
braces {}
car = {"manufacturer": "Citreon", "model": "Picasso"}
print(car["manufacturer"])
Citreon
Here we go with the Fenwick data... I recommend copy and pasting this, or be very careful with
your commas etc...
# Dictionary - the indenting isn't important, it's just for easy reading
data = {
"year": [2015,2016,2017,2018,2019,2020],
"window": ["Once Upon a Christmas","Beatrix Potter","Paddington
Bear",
"Charlie and the Chocolate Factory","The Snowman","Wind
in the Willows"],
"revenue": [298.2,302.0,297.9,288.0,246.32,218.0],
"white_christmas": [True,True,True,False,False,False]
}

# load data into a DataFrame
df = pd.DataFrame(data)
We've got ourselves a DataFrame! We're going to use df for the variable name, but this could be
anything.
type(df)
pandas.core.frame.DataFrame
Let the fun commence!
Getting to know our DataFrame
We've just got a small bit of data here, but these methods/attributes will be extremely useful
when we look at the larger data set later on. Try these in the Console:
df.head() # show the top of the data frame
df.tail() # show the bottom of the data frame
df.columns # show the column names
df.dtypes # show the data types
Now let me show you something amazing...
df.describe()
Week 9 Handout - MAS1803 Problem Solving with Python 2021-22
Dr Chris Graham 4
year revenue
count 6.000000 6.000000
mean 2017.500000 275.070000
std 1.870829 34.733739
min 2015.000000 218.000000
25% 2016.250000 256.740000
50% 2017.500000 292.950000
75% 2018.750000 298.125000
max 2020.000000 302.000000
without any effort from us, we've got summary statistics there from the two numeric columns in
our DataFrame. I already know the mean revenue for Fenwick over those 6 years is £275.07 and
we've barely raised a finger!
Here's one more:
df.groupby("white_christmas").size()
white_christmas
False 3
True 3
dtype: int64
df.groupby("white_christmas").mean()
Sorting the DataFrame
We can sort the DataFrame like this:
df.sort_values(by="revenue")
which orders by revenue ascending, or descending with
df.sort_values(by="revenue",ascending=False)
Note this does not modify the DataFrame. If we want to make this permanent we could assign to
a new DataFrame though:
df_by_revenue = df.sort_values(by="revenue",ascending=False)
Obtaining Series from the DataFrame
We can isolate a column in our DataFrame like this
df["revenue"]
This itself is a Series. We can check like this:
revenue = df["revenue"]
type(revenue)
pandas.core.series.Series
If we wanted to then we could change this to a NumPy array using the to_numpy() method:
Week 9 Handout - MAS1803 Problem Solving with Python 2021-22
Dr Chris Graham 5
revenue = df["revenue"].to_numpy()
type(revenue)
numpy.ndarray
which would be one way to, for example, make some plots
# Make a plot using matplotlib
import matplotlib.pyplot as plt

# Convert Series to numpy arrays
revenue = df["revenue"].to_numpy()
year = df["year"].to_numpy()

# Plot using the usual command
plt.plot(year, revenue, '-o')
There is another, more direct way, which we will see shortly.
We can also isolate a subset of columns, for example just the year and revenue
df[["year","white_christmas"]]
year white_christmas
0 2015 True
1 2016 True
2 2017 True
3 2018 False
4 2019 False
5 2020 False
The output of this is itself a DataFrame.
Querying the DataFrame
We can query the DataFrame like this:
# find rows where revenue was greater than 280
df[df["revenue"] > 280]
year window revenue white_christmas
0 2015 Once Upon a Christmas 298.2 True
1 2016 Beatrix Potter 302.0 True
2 2017 Paddington Bear 297.9 True
3 2018 Charlie and the Chocolate Factory 288.0 False
Note that the part in df[] , df["revenue"] > 280 is a Series of booleans:
df["revenue"] > 280
Week 9 Handout - MAS1803 Problem Solving with Python 2021-22
Dr Chris Graham 6
0 True
1 True
2 True
3 True
4 False
5 False
Name: revenue, dtype: bool
Here's some more example:
# Find rows where it was a white christmas
# Note: same as df[df["white_christmas"]==True]
df[df["white_christmas"]]
year window revenue white_christmas
0 2015 Once Upon a Christmas 298.2 True
1 2016 Beatrix Potter 302.0 True
2 2017 Paddington Bear 297.9 True
# Find rows where the window was "The Snowman"
df[df["window"] == "The Snowman"]
year window revenue white_christmas
4 2019 The Snowman 246.32 False
and since each of these outputs is itself a DataFrame, we can even do this sort of thing:
# Get summary statistics for years with a white christmas
df[df["white_christmas"]].describe()
year revenue
count 3.0 3.000000
mean 2016.0 299.366667
std 1.0 2.285461
min 2015.0 297.900000
25% 2015.5 298.050000
50% 2016.0 298.200000
75% 2016.5 300.100000
max 2017.0 302.000000
Plotting DataFrame data
I hope you are appreciating how powerful pandas is. It is also is integrated with Matplotlib, so that
there is an alternative to converting to NumPy arrays and then using Matplotlib. Try this:
# Plot revenue versus year with the '-o' marker style
df.plot("year","revenue",style="-o")
Week 9 Handout - MAS1803 Problem Solving with Python 2021-22
Dr Chris Graham 7
The plot method for a DataFrame is a little less flexible and there are some subtle differences
plotting this way (for example setting line styles), but it's provided for convenience and certainly
ticks that box.
Importing datasets from CSV
We're going to work for most of the remainder of this session on a data set of music albums
(adapted from this data set on Kaggle, which is a good source for finding datasets to work with).
Download the data set
Import the dataset with the following (assumes music.csv is in the same folder as your Python
script):
import pandas as pd
df = pd.read_csv('music.csv')
Note that because of its size, if you do
print(df)
or type df in the console then you may find that you see only part of the DataFrame. If you're
working on a large screen, you may wish to set pandas to show all columns in the console with
pd.set_option('display.max_columns', None) # removes the maximum
or if you're working on a particularly small screen then you might like to set your own value for
this:
pd.set_option('display.max_columns', 4) # show 4 columns
Exercise 9.1
Let's familiarise ourselves with the DataFrame.
Show Exercise
Querying the music dataframe
Let's see if we can find some classic albums...
Week 9 Handout - MAS1803 Problem Solving with Python 2021-22
Dr Chris Graham 8
# Album exactly matches the word Thriller
df[df["Album"]=="Thriller"]
We also might want to match by a string, which can be done using str.contains . What was the
name of that Dire Straits album? Brother something...
# Album contains the word "Brother"
df[df["Album"].str.contains("Brother")]
Brothers in Arms... that was it... lots of other "brothers" in there too!
Before we go any further, we might get a good feel for some of the ranges of values that there
are in the DataFrame (for example what years are covered, what the maximum rating is etc) by
using describe() :
df.describe()
Year Average_Rating No_Ratings
count 4627.000000 4627.000000 4627.000000
mean 1975.349687 3.770542 4193.570348
std 33.850496 0.099216 6097.028922
min 1905.000000 3.520000 260.000000
25% 1968.000000 3.700000 784.000000
50% 1984.000000 3.750000 1912.000000
75% 1999.000000 3.810000 4914.000000
max 2021.000000 4.340000 70382.000000
That was useful, so the years range from 1905 to 2021, ratings from 3.52 to 4.34 with an average
3.77, the no of ratings ranges from 260 to over 70,000!
Let's use that to look at some other criteria we might want to match:
# albums released in 1990
df[df["Year"]==1990]
We can also easily find the length with len as follows:
len(df[df["Year"]==1990])
For example to find the number of albums with a rating greater than 4.0
len(df[df["Average_Rating"]>4])
About 159 of them. We might consider that the rating is unreliable unless there have been plenty
of people rating. We can join together queries like this:
len(df[(df["Average_Rating"]>4) & (df["No_Ratings"]>10000)])
still 129 of them!
len(df[(df["Average_Rating"]>4) & (df["No_Ratings"]>50000)])
Only 3 now. Let's take a look at them
df[(df["Average_Rating"]>4) & (df["No_Ratings"]>50000)]
Week 9 Handout - MAS1803 Problem Solving with Python 2021-22
Dr Chris Graham 9
Note that these lines should be read side by side, so put the columns from the 1923 "row"
together with the first part when reading it, i.e. Kid A rating is 4.21, OK Computer 4.23 etc.
I noticed the maximum was 4.34 from describe though, which is none of the above. Let's find
out what it was by ordering by Average_Rating:
df.sort_values(by="Average_Rating", ascending=False).head()
Album Artist Year \
3662 The Black Saint and the Sinner Lady Mingus 1963
26 A Love Supreme John Coltrane 1965
1738 In the Court of the Crimson King King Crimson 1969
4537 Wish You Were Here Pink Floyd 1975
4186 To Pimp a Butterfly Kendrick Lamar 2015

Genres Average_Rating \
3662 Avant-Garde Jazz, Third Stream 4.34
26 Spiritual Jazz 4.30
1738 Progressive Rock, Art Rock 4.30
4537 Progressive Rock, Art Rock 4.29
4186 Conscious Hip Hop, West Coast Hip Hop, Jazz Rap 4.27

No_Ratings
3662 20251
26 25040
1738 44943
4537 48662
4186 44206
Got to admit, didn't expect to see so much jazz up there.
Exercise 9.2
It's your turn... try these out!
Show Exercise
Plotting
Let's use the plot method to create some plots. Here's rating by year (very busy!)
df.plot("Year","Average_Rating",style='o')
Album Artist Year \
1923 Kid A Radiohead 2000
2612 OK Computer Radiohead 1997
3708 The Dark Side of the Moon Pink Floyd 1973

Genres Average_Rating No_Ratings
1923 Art Rock, Experimental Rock, Electronic 4.21 58590
2612 Alternative Rock, Art Rock 4.23 70382
3708 Art Rock, Progressive Rock 4.20 57622
Week 9 Handout - MAS1803 Problem Solving with Python 2021-22
Dr Chris Graham 10
Perhaps it would be better to group by year, which we can do like this to automatically calculate
mean values for any numeric columns!
# Create a new data frame with means grouped by year
mean_ratings = df.groupby("Year").mean()
Note year is the index of the new data frame. To create a plot is now a little strange: year is
automatically on the x axis, so we just have to set y = and choose the column for the y axis
mean_ratings.plot(y="Average_Rating")
or just get, say, 1960ish onwards
mean_ratings[10:].plot(y="Average_Rating")
Music appears to be going downhill!
Finding datasets
We're pretty much done with the handout, but you might like to explore a different dataset.
Here are some great websites for finding open datasets: just save a csv, use the read_csv
command and away you go!
Kaggle (note account required, e.g. Google account, but is free)
data.gov.uk
Amazon AWS
Google dataset search
Week 9 Handout - MAS1803 Problem Solving with Python 2021-22
Dr Chris Graham 11
Summary
Hopefully this has given a flavour of working with large data files with pandas. This marks the
end of new content in the module: next week we do something a bit different in the practical,
and begin our revision for the final class test.

欢迎咨询51作业君