Week 9 Handout - MAS1803 Problem Solving with Python 2021-22 Dr Chris Graham 1 Handout 9 About pandas Pandas is the most commonly used package for working with datasets in Python. The name pandas comes from "Panel Data" (a statistical method for data captured over time) and "Python Data Analysis" (analysing data). Nothing to do with the animals, though I am a big fan! We're going to barely scratch the surface this week. You can find full Pandas documentation here. Motivation Consider the following data on local department store Fenwick (I may have mentioned earlier my love of the department store vibe - seeing the wooden escalators was one of my favourite bits of New York!) Year Window Revenue (Millions £)* White Christmas** 2015 Once Upon a Christmas 298.2 True 2016 Beatrix Potter 302.0 True 2017 Paddington Bear 297.9 True 2018 Charlie and the Chocolate Factory 288.0 False 2019 The Snowman 246.32 False 2020 Wind in the Willows 218.0 False * Financial data from https://craft.co/fenwick-limited ** White Christmas in England according to the Met Office Looking at the table above, we just don't currently know how to handle this with what we've learnt so far: it is a table with multiple different data types that would naturally be represented (in order of columns) as integers, strings, float values and booleans. And we could well have information not available to us, for example missing data in one particular year. What we really want is a data type that acts like a table: that can store data of different types, that can have missing data, that we can query and manipulate and make plots from. That's what pandas offers. We are not trying to be statisticians (yet)! With a statistics mindset we might ask, for example, whether a White Christmas has a statistically significant impact on revenue (probably not!). When we look at a larger dataset later then it will open up lots of other things we could ask, but right now we are just interested in handling this data. We'll effectively be laying the foundations for doing some "proper" statistics, but actually the programming language of choice for statistics in our degree programme is R, which you will meet next semester. Week 9 Handout - MAS1803 Problem Solving with Python 2021-22 Dr Chris Graham 2 Importing pandas We can import pandas with import pandas as pd Series and Dataframes Pandas has two objects that we are interested in: A Series you can think of as a column in a table and is very like an array in NumPy. A DataFrame you can think of as the table in its entirety and is made up of Series. Creating a series Let's create a Series for the year's column above. The function pd.Series takes in a list of values and turns it into a Series: years = pd.Series([2015,2016,2017,2018,2019,2020]) type(years) pandas.core.series.Series Printing years , we get something a bit different from printing a list: print(years) 0 2015 1 2016 2 2017 3 2018 4 2019 5 2020 dtype: int64 Note the presence of an "index column" which runs from 0 upwards. We can do some familiar things like this years[0] 2015 and some other unfamiliar things. Let's create a Series for revenue to demonstrate: revenue = pd.Series([298.2,302.0,297.9,288.0,246.32,218.0]) Now check this out... we can query rows in a series like this revenue[revenue > 280] 0 298.2 1 302.0 2 297.9 3 288.0 dtype: float64 Week 9 Handout - MAS1803 Problem Solving with Python 2021-22 Dr Chris Graham 3 which is really cool, but will be even better once we put this data into a DataFrame instead. Creating a DataFrame A pandas DataFrame is much like a table of data. To create a DataFrame manually (we're more likely to create a DataFrame by importing data from a CSV as we'll do later in the practical), we first create a dictionary. Recall (week 6) that a dictionary is created as follows, made up of key:value pairs separated by commas and inside braces {} car = {"manufacturer": "Citreon", "model": "Picasso"} print(car["manufacturer"]) Citreon Here we go with the Fenwick data... I recommend copy and pasting this, or be very careful with your commas etc... # Dictionary - the indenting isn't important, it's just for easy reading data = { "year": [2015,2016,2017,2018,2019,2020], "window": ["Once Upon a Christmas","Beatrix Potter","Paddington Bear", "Charlie and the Chocolate Factory","The Snowman","Wind in the Willows"], "revenue": [298.2,302.0,297.9,288.0,246.32,218.0], "white_christmas": [True,True,True,False,False,False] } # load data into a DataFrame df = pd.DataFrame(data) We've got ourselves a DataFrame! We're going to use df for the variable name, but this could be anything. type(df) pandas.core.frame.DataFrame Let the fun commence! Getting to know our DataFrame We've just got a small bit of data here, but these methods/attributes will be extremely useful when we look at the larger data set later on. Try these in the Console: df.head() # show the top of the data frame df.tail() # show the bottom of the data frame df.columns # show the column names df.dtypes # show the data types Now let me show you something amazing... df.describe() Week 9 Handout - MAS1803 Problem Solving with Python 2021-22 Dr Chris Graham 4 year revenue count 6.000000 6.000000 mean 2017.500000 275.070000 std 1.870829 34.733739 min 2015.000000 218.000000 25% 2016.250000 256.740000 50% 2017.500000 292.950000 75% 2018.750000 298.125000 max 2020.000000 302.000000 without any effort from us, we've got summary statistics there from the two numeric columns in our DataFrame. I already know the mean revenue for Fenwick over those 6 years is £275.07 and we've barely raised a finger! Here's one more: df.groupby("white_christmas").size() white_christmas False 3 True 3 dtype: int64 df.groupby("white_christmas").mean() Sorting the DataFrame We can sort the DataFrame like this: df.sort_values(by="revenue") which orders by revenue ascending, or descending with df.sort_values(by="revenue",ascending=False) Note this does not modify the DataFrame. If we want to make this permanent we could assign to a new DataFrame though: df_by_revenue = df.sort_values(by="revenue",ascending=False) Obtaining Series from the DataFrame We can isolate a column in our DataFrame like this df["revenue"] This itself is a Series. We can check like this: revenue = df["revenue"] type(revenue) pandas.core.series.Series If we wanted to then we could change this to a NumPy array using the to_numpy() method: Week 9 Handout - MAS1803 Problem Solving with Python 2021-22 Dr Chris Graham 5 revenue = df["revenue"].to_numpy() type(revenue) numpy.ndarray which would be one way to, for example, make some plots # Make a plot using matplotlib import matplotlib.pyplot as plt # Convert Series to numpy arrays revenue = df["revenue"].to_numpy() year = df["year"].to_numpy() # Plot using the usual command plt.plot(year, revenue, '-o') There is another, more direct way, which we will see shortly. We can also isolate a subset of columns, for example just the year and revenue df[["year","white_christmas"]] year white_christmas 0 2015 True 1 2016 True 2 2017 True 3 2018 False 4 2019 False 5 2020 False The output of this is itself a DataFrame. Querying the DataFrame We can query the DataFrame like this: # find rows where revenue was greater than 280 df[df["revenue"] > 280] year window revenue white_christmas 0 2015 Once Upon a Christmas 298.2 True 1 2016 Beatrix Potter 302.0 True 2 2017 Paddington Bear 297.9 True 3 2018 Charlie and the Chocolate Factory 288.0 False Note that the part in df[] , df["revenue"] > 280 is a Series of booleans: df["revenue"] > 280 Week 9 Handout - MAS1803 Problem Solving with Python 2021-22 Dr Chris Graham 6 0 True 1 True 2 True 3 True 4 False 5 False Name: revenue, dtype: bool Here's some more example: # Find rows where it was a white christmas # Note: same as df[df["white_christmas"]==True] df[df["white_christmas"]] year window revenue white_christmas 0 2015 Once Upon a Christmas 298.2 True 1 2016 Beatrix Potter 302.0 True 2 2017 Paddington Bear 297.9 True # Find rows where the window was "The Snowman" df[df["window"] == "The Snowman"] year window revenue white_christmas 4 2019 The Snowman 246.32 False and since each of these outputs is itself a DataFrame, we can even do this sort of thing: # Get summary statistics for years with a white christmas df[df["white_christmas"]].describe() year revenue count 3.0 3.000000 mean 2016.0 299.366667 std 1.0 2.285461 min 2015.0 297.900000 25% 2015.5 298.050000 50% 2016.0 298.200000 75% 2016.5 300.100000 max 2017.0 302.000000 Plotting DataFrame data I hope you are appreciating how powerful pandas is. It is also is integrated with Matplotlib, so that there is an alternative to converting to NumPy arrays and then using Matplotlib. Try this: # Plot revenue versus year with the '-o' marker style df.plot("year","revenue",style="-o") Week 9 Handout - MAS1803 Problem Solving with Python 2021-22 Dr Chris Graham 7 The plot method for a DataFrame is a little less flexible and there are some subtle differences plotting this way (for example setting line styles), but it's provided for convenience and certainly ticks that box. Importing datasets from CSV We're going to work for most of the remainder of this session on a data set of music albums (adapted from this data set on Kaggle, which is a good source for finding datasets to work with). Download the data set Import the dataset with the following (assumes music.csv is in the same folder as your Python script): import pandas as pd df = pd.read_csv('music.csv') Note that because of its size, if you do print(df) or type df in the console then you may find that you see only part of the DataFrame. If you're working on a large screen, you may wish to set pandas to show all columns in the console with pd.set_option('display.max_columns', None) # removes the maximum or if you're working on a particularly small screen then you might like to set your own value for this: pd.set_option('display.max_columns', 4) # show 4 columns Exercise 9.1 Let's familiarise ourselves with the DataFrame. Show Exercise Querying the music dataframe Let's see if we can find some classic albums... Week 9 Handout - MAS1803 Problem Solving with Python 2021-22 Dr Chris Graham 8 # Album exactly matches the word Thriller df[df["Album"]=="Thriller"] We also might want to match by a string, which can be done using str.contains . What was the name of that Dire Straits album? Brother something... # Album contains the word "Brother" df[df["Album"].str.contains("Brother")] Brothers in Arms... that was it... lots of other "brothers" in there too! Before we go any further, we might get a good feel for some of the ranges of values that there are in the DataFrame (for example what years are covered, what the maximum rating is etc) by using describe() : df.describe() Year Average_Rating No_Ratings count 4627.000000 4627.000000 4627.000000 mean 1975.349687 3.770542 4193.570348 std 33.850496 0.099216 6097.028922 min 1905.000000 3.520000 260.000000 25% 1968.000000 3.700000 784.000000 50% 1984.000000 3.750000 1912.000000 75% 1999.000000 3.810000 4914.000000 max 2021.000000 4.340000 70382.000000 That was useful, so the years range from 1905 to 2021, ratings from 3.52 to 4.34 with an average 3.77, the no of ratings ranges from 260 to over 70,000! Let's use that to look at some other criteria we might want to match: # albums released in 1990 df[df["Year"]==1990] We can also easily find the length with len as follows: len(df[df["Year"]==1990]) For example to find the number of albums with a rating greater than 4.0 len(df[df["Average_Rating"]>4]) About 159 of them. We might consider that the rating is unreliable unless there have been plenty of people rating. We can join together queries like this: len(df[(df["Average_Rating"]>4) & (df["No_Ratings"]>10000)]) still 129 of them! len(df[(df["Average_Rating"]>4) & (df["No_Ratings"]>50000)]) Only 3 now. Let's take a look at them df[(df["Average_Rating"]>4) & (df["No_Ratings"]>50000)] Week 9 Handout - MAS1803 Problem Solving with Python 2021-22 Dr Chris Graham 9 Note that these lines should be read side by side, so put the columns from the 1923 "row" together with the first part when reading it, i.e. Kid A rating is 4.21, OK Computer 4.23 etc. I noticed the maximum was 4.34 from describe though, which is none of the above. Let's find out what it was by ordering by Average_Rating: df.sort_values(by="Average_Rating", ascending=False).head() Album Artist Year \ 3662 The Black Saint and the Sinner Lady Mingus 1963 26 A Love Supreme John Coltrane 1965 1738 In the Court of the Crimson King King Crimson 1969 4537 Wish You Were Here Pink Floyd 1975 4186 To Pimp a Butterfly Kendrick Lamar 2015 Genres Average_Rating \ 3662 Avant-Garde Jazz, Third Stream 4.34 26 Spiritual Jazz 4.30 1738 Progressive Rock, Art Rock 4.30 4537 Progressive Rock, Art Rock 4.29 4186 Conscious Hip Hop, West Coast Hip Hop, Jazz Rap 4.27 No_Ratings 3662 20251 26 25040 1738 44943 4537 48662 4186 44206 Got to admit, didn't expect to see so much jazz up there. Exercise 9.2 It's your turn... try these out! Show Exercise Plotting Let's use the plot method to create some plots. Here's rating by year (very busy!) df.plot("Year","Average_Rating",style='o') Album Artist Year \ 1923 Kid A Radiohead 2000 2612 OK Computer Radiohead 1997 3708 The Dark Side of the Moon Pink Floyd 1973 Genres Average_Rating No_Ratings 1923 Art Rock, Experimental Rock, Electronic 4.21 58590 2612 Alternative Rock, Art Rock 4.23 70382 3708 Art Rock, Progressive Rock 4.20 57622 Week 9 Handout - MAS1803 Problem Solving with Python 2021-22 Dr Chris Graham 10 Perhaps it would be better to group by year, which we can do like this to automatically calculate mean values for any numeric columns! # Create a new data frame with means grouped by year mean_ratings = df.groupby("Year").mean() Note year is the index of the new data frame. To create a plot is now a little strange: year is automatically on the x axis, so we just have to set y = and choose the column for the y axis mean_ratings.plot(y="Average_Rating") or just get, say, 1960ish onwards mean_ratings[10:].plot(y="Average_Rating") Music appears to be going downhill! Finding datasets We're pretty much done with the handout, but you might like to explore a different dataset. Here are some great websites for finding open datasets: just save a csv, use the read_csv command and away you go! Kaggle (note account required, e.g. Google account, but is free) data.gov.uk Amazon AWS Google dataset search Week 9 Handout - MAS1803 Problem Solving with Python 2021-22 Dr Chris Graham 11 Summary Hopefully this has given a flavour of working with large data files with pandas. This marks the end of new content in the module: next week we do something a bit different in the practical, and begin our revision for the final class test.
欢迎咨询51作业君