Pandas is the most commonly used package for working with datasets
in Python. The name pandas comes from "Panel Data" (a statistical
method for data captured over time) and "Python Data Analysis"
(analysing data). Nothing to do with the animals, though I am a big fan!
We're going to barely scratch the surface this week. You can find full
Pandas documentation here.
Motivation
Consider the following data on local department store Fenwick (I may have mentioned earlier my
love of the department store vibe - seeing the wooden escalators was one of my favourite bits of
New York!)
Year Window Revenue (Millions £) White Christmas*
2015 Once Upon a Christmas 298.2 True
2016 Beatrix Potter 302.0 True
2017 Paddington Bear 297.9 True
2018 Charlie and the Chocolate Factory 288.0 False
2019 The Snowman 246.32 False
2020 Wind in the Willows 218.0 False
- Financial data from https://craft.co/fenwick-limited
** White Christmas in England according to the Met Office
Looking at the table above, we just don't currently know how to handle this with what we've
learnt so far: it is a table with multiple different data types that would naturally be represented (in
order of columns) as integers, strings, float values and booleans. And we could well have
information not available to us, for example missing data in one particular year.
What we really want is a data type that acts like a table: that can store data of different types,
that can have missing data, that we can query and manipulate and make plots from. That's what
pandas offers.
We are not trying to be statisticians (yet)! With a statistics mindset we might ask, for example,
whether a White Christmas has a statistically significant impact on revenue (probably not!). When
we look at a larger dataset later then it will open up lots of other things we could ask, but right
now we are just interested in handling this data. We'll effectively be laying the foundations for
doing some "proper" statistics, but actually the programming language of choice for statistics in
our degree programme is R, which you will meet next semester.
Week 9 Handout - MAS1803 Problem Solving with Python 2021-22
Dr Chris Graham 2
Importing pandas
We can import pandas with
import pandas as pd
Series and Dataframes
Pandas has two objects that we are interested in:
A Series you can think of as a column in a table and is very like an array in NumPy.
A DataFrame you can think of as the table in its entirety and is made up of Series.
Creating a series
Let's create a Series for the year's column above. The function pd.Series takes in a list of
values and turns it into a Series:
years = pd.Series([2015,2016,2017,2018,2019,2020])
type(years)
pandas.core.series.Series
Printing years , we get something a bit different from printing a list:
print(years) - 2015
- 2016
- 2017
- 2018
- 2019
- 2020
dtype: int64
Note the presence of an "index column" which runs from 0 upwards.
We can do some familiar things like this
years[0]
2015
and some other unfamiliar things. Let's create a Series for revenue to demonstrate:
revenue = pd.Series([298.2,302.0,297.9,288.0,246.32,218.0])
Now check this out... we can query rows in a series like this
revenue[revenue > 280] - 298.2
- 302.0
- 297.9
288.0
dtype: float64
Week 9 Handout - MAS1803 Problem Solving with Python 2021-22
Dr Chris Graham 3
which is really cool, but will be even better once we put this data into a DataFrame instead.
Creating a DataFrame
A pandas DataFrame is much like a table of data.
To create a DataFrame manually (we're more likely to create a DataFrame by importing data from
a CSV as we'll do later in the practical), we first create a dictionary. Recall (week 6) that a
dictionary is created as follows, made up of key:value pairs separated by commas and inside
braces {}
car = {"manufacturer": "Citreon", "model": "Picasso"}
print(car["manufacturer"])
Citreon
Here we go with the Fenwick data... I recommend copy and pasting this, or be very careful with
your commas etc...Dictionary - the indenting isn't important, it's just for easy reading
data = {
"year": [2015,2016,2017,2018,2019,2020],
"window": ["Once Upon a Christmas","Beatrix Potter","Paddington
Bear",
"Charlie and the Chocolate Factory","The Snowman","Wind
in the Willows"],
"revenue": [298.2,302.0,297.9,288.0,246.32,218.0],
"white_christmas": [True,True,True,False,False,False]
}
load data into a DataFrame
df = pd.DataFrame(data)
We've got ourselves a DataFrame! We're going to use df for the variable name, but this could be
anything.
type(df)
pandas.core.frame.DataFrame
Let the fun commence!
Getting to know our DataFrame
We've just got a small bit of data here, but these methods/attributes will be extremely useful
when we look at the larger data set later on. Try these in the Console:
df.head() # show the top of the data frame
df.tail() # show the bottom of the data frame
df.columns # show the column names
df.dtypes # show the data types
Now let me show you something amazing...
df.describe()
Week 9 Handout - MAS1803 Problem Solving with Python 2021-22
Dr Chris Graham 4
year revenue
count 6.000000 6.000000
mean 2017.500000 275.070000
std 1.870829 34.733739
min 2015.000000 218.000000
25% 2016.250000 256.740000
50% 2017.500000 292.950000
75% 2018.750000 298.125000
max 2020.000000 302.000000
without any effort from us, we've got summary statistics there from the two numeric columns in
our DataFrame. I already know the mean revenue for Fenwick over those 6 years is £275.07 and
we've barely raised a finger!
Here's one more:
df.groupby("white_christmas").size()
white_christmas
False 3
True 3
dtype: int64
df.groupby("white_christmas").mean()
Sorting the DataFrame
We can sort the DataFrame like this:
df.sort_values(by="revenue")
which orders by revenue ascending, or descending with
df.sort_values(by="revenue",ascending=False)
Note this does not modify the DataFrame. If we want to make this permanent we could assign to
a new DataFrame though:
df_by_revenue = df.sort_values(by="revenue",ascending=False)
Obtaining Series from the DataFrame
We can isolate a column in our DataFrame like this
df["revenue"]
This itself is a Series. We can check like this:
revenue = df["revenue"]
type(revenue)
pandas.core.series.Series
If we wanted to then we could change this to a NumPy array using the to_numpy() method:
Week 9 Handout - MAS1803 Problem Solving with Python 2021-22
Dr Chris Graham 5
revenue = df["revenue"].to_numpy()
type(revenue)
numpy.ndarray
which would be one way to, for example, make some plots
Make a plot using matplotlib
import matplotlib.pyplot as plt
Convert Series to numpy arrays
revenue = df["revenue"].to_numpy()
year = df["year"].to_numpy()
Plot using the usual command
plt.plot(year, revenue, '-o')
There is another, more direct way, which we will see shortly.
We can also isolate a subset of columns, for example just the year and revenue
df[["year","white_christmas"]]
year white_christmas
0 2015 True
1 2016 True
2 2017 True
3 2018 False
4 2019 False
5 2020 False
The output of this is itself a DataFrame.
Querying the DataFrame
We can query the DataFrame like this:
find rows where revenue was greater than 280
df[df["revenue"] > 280]
year window revenue white_christmas
0 2015 Once Upon a Christmas 298.2 True
1 2016 Beatrix Potter 302.0 True
2 2017 Paddington Bear 297.9 True
3 2018 Charlie and the Chocolate Factory 288.0 False
Note that the part in df[] , df["revenue"] > 280 is a Series of booleans:
df["revenue"] > 280
Week 9 Handout - MAS1803 Problem Solving with Python 2021-22
Dr Chris Graham 6
0 True
1 True
2 True
3 True
4 False
5 False
Name: revenue, dtype: bool
Here's some more example:
Find rows where it was a white christmas
Note: same as df[df["white_christmas"]==True]
df[df["white_christmas"]]
year window revenue white_christmas
0 2015 Once Upon a Christmas 298.2 True
1 2016 Beatrix Potter 302.0 True
2 2017 Paddington Bear 297.9 True
Find rows where the window was "The Snowman"
df[df["window"] == "The Snowman"]
year window revenue white_christmas
4 2019 The Snowman 246.32 False
and since each of these outputs is itself a DataFrame, we can even do this sort of thing:
Get summary statistics for years with a white christmas
df[df["white_christmas"]].describe()
year revenue
count 3.0 3.000000
mean 2016.0 299.366667
std 1.0 2.285461
min 2015.0 297.900000
25% 2015.5 298.050000
50% 2016.0 298.200000
75% 2016.5 300.100000
max 2017.0 302.000000
Plotting DataFrame data
I hope you are appreciating how powerful pandas is. It is also is integrated with Matplotlib, so that
there is an alternative to converting to NumPy arrays and then using Matplotlib. Try this:
Plot revenue versus year with the '-o' marker style
df.plot("year","revenue",style="-o")
Week 9 Handout - MAS1803 Problem Solving with Python 2021-22
Dr Chris Graham 7
The plot method for a DataFrame is a little less flexible and there are some subtle differences
plotting this way (for example setting line styles), but it's provided for convenience and certainly
ticks that box.
Importing datasets from CSV
We're going to work for most of the remainder of this session on a data set of music albums
(adapted from this data set on Kaggle, which is a good source for finding datasets to work with).
Download the data set
Import the dataset with the following (assumes music.csv is in the same folder as your Python
script):
import pandas as pd
df = pd.read_csv('music.csv')
Note that because of its size, if you do
print(df)
or type df in the console then you may find that you see only part of the DataFrame. If you're
working on a large screen, you may wish to set pandas to show all columns in the console with
pd.set_option('display.max_columns', None) # removes the maximum
or if you're working on a particularly small screen then you might like to set your own value for
this:
pd.set_option('display.max_columns', 4) # show 4 columns
Exercise 9.1
Let's familiarise ourselves with the DataFrame.
Show Exercise
Querying the music dataframe
Let's see if we can find some classic albums...
Week 9 Handout - MAS1803 Problem Solving with Python 2021-22
Dr Chris Graham 8
Album exactly matches the word Thriller
df[df["Album"]=="Thriller"]
We also might want to match by a string, which can be done using str.contains . What was the
name of that Dire Straits album? Brother something...
Album contains the word "Brother"
df[df["Album"].str.contains("Brother")]
Brothers in Arms... that was it... lots of other "brothers" in there too!
Before we go any further, we might get a good feel for some of the ranges of values that there
are in the DataFrame (for example what years are covered, what the maximum rating is etc) by
using describe() :
df.describe()