from sqlalchemy import create_engine
engine = create_engine('sqlite:///data/chinook.db')
import pandas as pd
csv_df = pd.read_csv('data/itunes_data.csv')
csv_df.head()
Another common file format is Excel. We have a few more song additions in an Excel file. We can load it as a DataFrame like this:
excel_df = pd.read_excel('data/itunes_data.xlsx', engine='openpyxl')
excel_df.head(
Lastly, let's get the full data from the chinook SQLite database. Pandas has a few methods for loading from SQL, but we'll use the pd.read_sql_query()
function. First, we need to create our engine that will connect to our database with SQLAlchemy, as we did in Chapter 3, SQL and Built-in File Handling Modules in Python
from sqlalchemy import create_engine
engine = create_engine('sqlite:///data/chinook.db')
query = """SELECT tracks.name as Track,
tracks.composer,
tracks.milliseconds,
tracks.bytes,
tracks.unitprice,
genres.name as Genre,
albums.title as Album,
artists.name as Artist
FROM tracks
JOIN genres ON tracks.genreid = genres.genreid
JOIN albums ON tracks.albumid = albums.albumid
JOIN artists ON albums.artistid = artists.artistid;
"""
with engine.connect() as connection:
sql_df = pd.read_sql_query(query, connection)
sql_df.head(2).T
This prints out pandas.core.frame.DataFrame
, since we have a DataFrame. Recall from Chapter 2, Getting Started with Python, that the built-in function type()
tells us the type of an object.
To combine our three DataFrames into one, we'll use the pd.concat()
function:
itunes_df = pd.concat([csv_df, excel_df, sql_df])
Whenever we have some data loaded, it's a good idea to take a look at what we have. In general, we can follow a general EDA checklist:
Some of this EDA can provide a starting point for any further analysis that we do.
We already know how to look at the top of the data: itunes_df.head()
. For looking at the bottom of the data, we use tail()
:
Remember that if we have many columns, we can transpose the printout with
itunes_df.tail().T
, which transposes columns and rows. In this case, it's about the same with or without the transpose.
To index by row number, we use iloc
. This is helpful if we want to look at the first or last row. For example, this is how we look at the first row (the 0th index) and the last row (the -1 index):
print(itunes_df.iloc[0])
print(itunes_df.iloc[-1])
With iloc
, we can also choose a single column. For example, the following commands print out the value of the first row and first column (an index of [0, 0]) and the last row and last column (and an index of [-1, -1]):
print(itunes_df.iloc[0, 0])
print(itunes_df.iloc[-1, -1])
While we're looking at indexing, let's look at selecting columns of data. We can select a column of data like so:
itunes_df['Milliseconds']
If we want to select multiple columns, we can use a list of strings:
itunes_df[['Milliseconds', 'Bytes']]
If we are selecting a column, we can type part of the name and then press Tab to autocomplete it. For example, if we were in the middle of typing "Milliseconds" here:
itunes_df['Mill']
With our cursor just after "Mill
," we could press Tab and it will fill in the full column name for us. This works in Jupyter Notebook and IPython, and potentially some text editors and IDEs.
After we've completed some EDA, we can then move on to some other cleaning steps. Some surveys have found data scientists spend anywhere between 25% and 75% of their time cleaning data, as we covered in Chapter 1, Introduction to Data Science, although sometimes data scientists spend upward of 90% of their time cleaning data. Quite often, we can carry out most or all of our data cleaning with pandas. Some common data cleaning steps include:
We might already have an idea of which data we want to remove. This involves dropping columns or rows we don't want. For example, with our iTunes data, we may not really need the Composer
column. We could drop this column like so:
itunes_df.drop('Composer', axis=1, inplace=True)
itunes_df.columns
We use the drop
function of DataFrames, giving the column name as the first argument. We can drop multiple columns at once by supplying a list. The axis=1
argument specifies to drop a column, not a row, and inplace=True
changes the DataFrame itself instead of returning a new, modified DataFrame. Then we examine the remaining columns with the columns
attribute of our DataFrame.
If we have other irrelevant data we want to remove, say any genres that are not music, we could do so with filtering:
only_music = itunes_df[~itunes_df['Genre'].isin(['Drama', 'TV Shows', 'Sci Fi & Fantasy', 'Science Fiction', 'Comedy'])]
This uses filtering with the isin
method. The isin
method checks if each value is in the list or set provided to the function. In this case, we also negate this condition with the tilde (~
) so that any of the non-music genres are excluded, and our only_music
DataFrame has only genres that are music, just as the variable name suggests.
Missing values arise all the time in datasets. Often, they are represented as NA
or NaN
values. Other times, missing values may be represented by certain numbers or values, such as None
or -999. It helps to examine the data with EDA and check any documentation on the data to see if missing values are represented in a special way. We can fill in missing values, which is also called imputation. In terms of dealing with missing values, we have some options:
The best option for you depends on the situation and the data itself. For example, we saw that our Composer
column has several missing values. We can use filtering to see what some of these rows look like:
itunes_df[itunes_df['Composer'].isna()].sample(5, random_state=42).head()
Here, we take a random sample of 5 datapoints with sample()
, giving it a random_state
so the results are the same every time we run it. Then we look at a few rows with head()
. In this case, we get results from all sorts of genres – TV shows, latin, and so on.
The most advanced method of replacing missing values is using machine learning. We can use the machine learning techniques that we will learn later in the book to predict missing values for each row and fill them in. Another option is using a pre-built imputer function, like the sklearn.impute.KNNImputer
function, which accomplishes the same thing. Usually, replacing with the mean, median, or mode (or even a constant value such as 0) is good enough to start, but KNN (k-nearest neighbors) imputation works well (but requires more effort).
An example where KNN imputation can work is demographic data. KNN works by taking a certain number, n, of the closest datapoints, and averages them to get new values. It gets the nearest points by Euclidean distance, which is a straight line between two points in space. We will cover this algorithm in more detail in a future chapter.
To see KNN imputation in action, let's create missing values in our Bytes
column:
import numpy as np
itunes_df.loc[0, 'Bytes'] = np.nan
We first need to import the NumPy library to be able to create NaN
values, and then we get the location where the row index is 0 and the column is Bytes
, and set the value to np.nan
. Next, we import the KNNImputer
function from the sklearn
(scikit-learn) machine learning library and create an instance of the imputer
object:
from sklearn.impute import KNNImputer
imputer = KNNImputer()
We leave the object with its default of the five nearest neighbors for calculations. Then, we use the fit_transform
method:
imputed = imputer.fit_transform(itunes_df [['Milliseconds', 'Bytes', 'UnitPrice']])
This takes in data with missing values, fits a KNN model, and then fills in the missing values with predictions from the model. Unfortunately, sklearn
can only handle numeric data without missing values, so we can't pass it any strings. Then, to replace our missing values, we overwrite the Bytes
column with our new data:
itunes_df['Bytes'] = imputed[:, 1]
The imputed variable is a NumPy array, which is like a pandas Series. We are indexing it with [:, 1]
, which means retrieving all rows and the second column. If we examine the prediction for the missing value compared with the original value, it's close but not perfect. Our predicted value from KNNImputer
for the first row with an Index value of 0 (there are a few rows with an Index value of 0) is 3.8e8, but the actual original value was 2.1e8 before we set it to np.nan
. However, the KNNImputer
prediction is much closer than the mean value for Bytes
of 3.3e7, which is an order of magnitude smaller than the actual value. So, the KNNImputer
method does a much better job of filling in missing values compared with using the mean for data that has a very wide distribution or is not close to a Gaussian distribution.
The KNNImputer
method of replacing missing values is the most advanced that we covered here. Don't be worried if it's confusing at this time; we will cover machine learning methods like KNN later in the book.
Outliers are data that are not in the usual range of values. For categorical data, such as the genres, these may be some of the minority classes, like TV shows. We could remove these rows with filtering, or we could also group all minority classes into a class we label as Other
. Dealing with categorical outliers can help a little with analysis, but often has a minimal impact.
For numeric data, it's easy to quantify an outlier. Typically, we use interquartile range (IQR) or z-score methods. We will cover the IQR method here, since it relates to boxplots, which we will cover in the next chapter.
Recall that we get quartiles (25th, 50th, 75th percentiles) from the describe()
function in our EDA. These are sometimes called the first, second, and third quartiles, respectively (Q1, Q2, and Q3).
We can exclude outliers from a DataFrame like so:
def remove_outliers(df, column):
q1 = df[column].quantile(0.25)
q3 = df[column].quantile(0.75)
iqr = q3 - q1
upper_boundary = q3 + 1.5 * iqr
lower_boundary = q1 - 1.5 * iqr
new_df = df.loc[(df[column] > lower_boundary) & \
(df[column] < upper_boundary)]
return new_df
Here, we have created a function that takes a DataFrame and column name as an argument. The first two lines calculate the 25th and 75th percentile levels with the quantile()
method, and store them in the q1
and q3
variables. Then we calculate IQR from the difference between Q1 and Q3. Next, we get the upper and lower boundaries for outliers using the IQR outlier formulas. Then we use DataFrame filtering to only keep the points between the upper and lower boundaries (keeping only the non-outlier points). The result is stored in a new DataFrame, new_df
. Finally, we return our DataFrame. We could use this with a numeric column like this:
itunes_df_clean = remove_outliers(itunes_df, 'Milliseconds')
We can then use the shape
attribute, itunes_df_clean.shape
, to check that some rows were actually dropped. In this case, we excluded about 400 rows of data by removing the Milliseconds
outliers.
Other methods for dropping outliers can be found in this helpful Stack Overflow question and its answers: python - Detect and exclude outliers in a pandas DataFrame - Stack Overflow
In fact, the function above was adapted from one of those answers.
Removing outliers can make it easier to visualize data and can improve the performance of machine learning models. Another easy way to remove outliers is to exclude any datapoints that lie outside of extreme percentiles in the data. For example, we could remove any points outside of the 1st and 99th percentiles, meaning we only keep the middle 98% of the data.
It's always a good idea to check for duplicate values, since they can creep into data in numerous ways. An easy way to check for duplicates is the duplicated()
function:
itunes_df.duplicated().sum()
This prints out the number of rows that are exact duplicates. In this case, we see that 518 rows are duplicated! There must have been some issue with the data when we loaded and combined it at the beginning of the chapter, or somewhere else upstream. We can drop these duplicated rows like so:
itunes_df.drop_duplicates(inplace=True)
Once again, we use inplace=True
to modify the existing DataFrame. There are other options for drop_duplicates
, but the defaults check for exactly duplicated rows.
Sometimes data will be loaded as an object datatype (string) instead of numeric if there are some non-numbers in that column. We want to use the df.info()
function as we did before to check that our columns are the correct datatype, and then convert any columns that need it. For example, we could convert Milliseconds
to an integer datatype like so:
itunes_df['Milliseconds'] = itunes_df['Milliseconds'].astype('int')
Within the astype
function, we can use strings such as 'float'
, 'int'
, or 'object'
, Python datatypes like int
or float
, or NumPy datatypes like np.int
. For most work, we only need the datatypes object
(for strings), int
, and float
.
Sometimes we will have string data in several formats. This tends to happen when data is entered by hand. For example, some people may capitalize gender like "Male," "Female," or "Nonbinary," while others may leave it lowercased. Some people may only use a one-letter abbreviation, like "M." Cleaning this sort of data often means using DataFrame filtering, loc
indexing, and string methods to replace values. We will cover this in the next section.
A handy way to replace several values at once is with the map
and replace
functions. For example, we can replace variations of genres in our iTunes data like so:
genre_dict = {'metal': 'Metal', 'met': 'Metal'}
itunes_df['Genre'].replace(genre_dict)
First, we create a dictionary, where the keys are the existing values in the DataFrame, and the values are what should replace the existing values. In this case, we replace variations of the Metal
genre (metal
and met
) with Metal
. In the second line, we select the Genre
column, then use replace
with our dictionary of replacement values. This returns a new pandas Series.
The replace
function replaces any matching values it finds in the supplied dictionary, and one use case is replacing some (or all) of the values in a Series. Any values that are not in the dictionary passed to replace
will be left alone. If we want any non-matching values in our provided conversion dictionary to be replaced with NaN
instead, then we can use map
. This makes it easy to check if any values were not converted to something new by checking for NaN values in the Series (for example, this can be useful in unit tests). The performance of map
and replace
are similar.
Another handy tool is the apply
function, which is a Swiss army knife function – it can do anything. For example, to lowercase all values in the Genre
column, we could use apply
after selecting the Genre
column:
itunes_df['Genre'].apply(lambda x: x.lower())
Remember from Chapter 2, Getting Started with Python, that a lambda function is an "anonymous" function, created on the fly. It starts with the lambda
keyword, then is followed by the inputs as variable names, then a colon character, and finally the actual function. The result of putting the inputs through the function is returned. An equivalent way to lowercase the Genre
column is as follows:
def lowercase(x):
return x.lower()
itunes_df['Genre'].apply(lowercase)
Here, we define a function called lowercase
, which returns the lowercased version of the input. Then we simply give this function to apply
. However, pandas has a built-in method for lowercasing strings, which makes for cleaner and simpler code:
itunes_df['Genre'].str.lower()
Often it's better to stick with the built-in pandas solution for simplicity. Exceptions to this rule occur when we are doing something rather complex that's not built-in.
If we do need to use apply
, an easy way to potentially speed it up is with swifter
. This is a package in Python that attempts to automatically parallelize our apply
code. We can use it like so:
import swifter
itunes_df['Genre'].swifter.apply(lambda x: x.lower())
Another option for parallelization of apply
is to use the Dask package, which swifter
will use if it is the best solution.
The various built-in pandas functions include string methods (such as df['Genre'].str.lower()
), math methods (such as df['Bytes'].mean()
), and datetime methods (such as df['date'].dt.month
).
One last useful tool we'll cover is groupby
. This is just like in SQL – group by unique values in a column. For example, we can look at the average length of songs by genre, and sort them from least to greatest:
itunes_df.groupby('Genre').mean()['Seconds'].sort_values().head()
First, we take our DataFrame, then use the groupby
method. We supply the column name we want to group by, and then take the average, or mean.
This returns a pandas Series. We can then use the sort_values()
method to sort from least to greatest. Finally, we use head()
to get only the first five rows:
Genre
Rock And Roll 134.643500
Opera 174.813000
Hip Hop/Rap 178.176286
Easy Listening 189.164208
Bossa Nova 219.590000
Name: Seconds, dtype: float64
We can see Rock And Roll
has the shortest average song length.
Lastly, we often want to save our data after preprocessing and cleaning. Pandas offers several ways to save data: CSV, Excel, HDF5, and many others (detailed well in the documentation: Input/output — pandas 2.2.0 documentation). All of the major read
functions have a corresponding to
function that saves the data to disk. For example, to save our iTunes data to a CSV:
itunes_df.to_csv('data/saved_itunes_data.csv', index=False)
We first give the filename as an argument for to_csv
, and then tell it to not write the index to the file with index=False
. This filename would save the data as saved_itunes_data.csv
in the directory named "data" within the same folder/directory where the code is being run.
There are many other ways to save data. Some others I like to use are HDF and feather. HDF and Parquet files offer compression, and HDF allows us to append to files and retrieve only parts of the data at a time (via the index). Feather files are nice because they are very fast, compressed, and were designed for passing data between R and Python. However, feather is not considered a good idea for longer-term storage because the format could change. One last option to consider is writing to an Excel file (df.to_excel(filename)
) if the data needs to be shared with less technical colleagues.
There is a lot more on the advanced side of the pandas package – aggregations, working with temporal data, and reshaping data. To learn more about advanced pandas usage, consider Packt's Pandas 1.x Cookbook, by Matt Harrison and Theodore Petrou, which is highly rated.
For our second example, we'll use the same bitcoin price data from the Test your knowledge section in Chapter 3, SQL and Built-in File Handling Modules in Python. We can load the data like so:
btc_df = pd.read_csv('data/bitcoin_price.csv')
btc_df.head()
The first five rows look like this:
Figure 4.10: The first five rows of our bitcoin price data
The symbol
column is all btcusd
. You can verify this by examining unique values:
btc_df['symbol'].unique()
Let's drop this column since it does not give us any information:
btc_df.drop('symbol', axis=1, inplace=True)
Next, we are going to convert the time
column to a pandas datetime
datatype:
btc_df['time'] = pd.to_datetime(btc_df['time'], unit='ms')
In this line of code, we use the handy pandas function, to_datetime
, to convert our time
column to a datetime
. Often, this function can auto-detect the format of the datetime data. However, in this case, it fails – it assumes the units are seconds since the epoch instead of milliseconds, so we provide the argument unit='ms'
. "Seconds since the epoch" means the number of seconds since 1-1-1970 and is used widely in computer science and programming. If you see a datetime column that is a large integer, it's probably the time since the epoch or epoch time.
We can quickly figure out if a timestamp is in seconds, milliseconds, or another unit like nanoseconds by putting your timestamp into Epoch Converter - Unix Timestamp Converter or another online conversion tool for "time since the epoch" to datetime. We can also divide the number by 1e9. If it comes out as a single digit in the ones position (like 1.6), then it is in seconds. Otherwise, if it comes out in the thousands (like 1600), then it is in ms.
We can confirm that the conversion worked by examining btc_df.info()
, which should now show datetime64[ns]
for the time column's datatype.
Next, we set the time
column as the index:
btc_df.set_index('time', inplace=True)
This allows us to easily plot the data:
btc_df['close'].plot(logy=True)
Here, we are plotting the daily closing price as a line plot and using a logarithmic scale on the y-axis with logy=True
. This means the y-axis has major tick marks equally spaced by powers of 10 (for example, 10, 100, 1,000), and by means of which it is easier to visualize large ranges of data. We can see the results in Figure 4.11:
Figure 4.11: A line plot of the daily close prices for bitcoin in USD
That was a few lines of code to get the data preprocessed to a point where we can easily plot it as a time series. If the time
column was a datetime string, like 12-11-2020, we could load it directly as a datetime index like this:
btc_df = pd.read_csv('data/bitcoin_price.csv', index_col='time', parse_dates=['time'], infer_datetime_format=True)
The index_col
argument tells pandas to set that column as the index. The parse_dates
argument will parse the provided list of columns as datetimes. Finally, the infer_datetime_format
argument is a handy trick – it auto-detects the datetime format of the columns that will be parsed to datetimes. Unfortunately, it doesn't work with seconds since the epoch, although we could instead provide a function to the date_parser
argument like so:
date_parser = lambda x: pd.to_datetime(x, unit='ms')
btc_df = pd.read_csv('data/bitcoin_price.csv', index_col='time', parse_dates=['time'], date_parser=date_parser)
Here, we create a function called date_parser
(the same as the argument name, which is a common practice) which parses dates as milliseconds since the epoch.
Now that we have a datetime index, we can do some other handy things, like easily indexing times. Here is an example of getting the data from 2019 using a date range:
btc_df.loc['1-1-2019':'12-31-2019']
It's even simpler to provide the year: btc_df.loc['2019']
.
As always, it's best to run through the additional EDA and data cleaning steps we went through in the last section and didn't cover here. There are also a lot of other analytic steps and methods we can apply with pandas to datetimes that we won't cover here; the Pandas 1.x Cookbook mentioned earlier covers much of the rest of the datetime and time series functionality in pandas.
Another library that's useful for dealing with data is NumPy (numpy
). The name stands for "Numeric Python," and it has many tools for advanced mathematical calculations and the representation of numeric data. NumPy is used by other Python packages for computations, such as the scikit-learn machine learning library. In fact, pandas is built on top of NumPy. With NumPy, we'll learn:
The pandas library actually stores its data as NumPy arrays. An array is similar to a list, but has more capabilities and properties. We can extract an array from our DataFrame like so:
close_array = btc_df['close'].values
This gives us a NumPy array:
array([ 93.033 , 103.999 , 118.22935407, ...,
17211.69580098, 17171. , 17686.840768 ])
NumPy arrays can be multidimensional, like DataFrames. They also have similar properties to DataFrames, like the shape
parameter (close_array.shape
) and a datatype (close_array.dtype
).
Another way to get a NumPy array is by creating it from a list:
import numpy as np
close_list = btc_df['close'].to_list()
close_array = np.array(close_list)
First, we import the NumPy library with the alias np
, which is typical. Then we use the to_list
method of our DataFrame's close
column (which is a pandas Series), and finally convert it to a NumPy array with the function np.array
.
One reason NumPy arrays are useful is we can execute math operations more easily and in less compute time. This speed boost is due to something called vectorization, where operations are applied to a whole array instead of one element at a time. For example, if we want to scale down our closing bitcoin prices by 1,000 (putting the units in kilodollars), we can do this:
kd_close = close_array / 1000
Common math operations, including addition, subtraction, and so on, are available. Of course, we could do this with a list comprehension or for
loop:
kd_close_list = [c / 1000 for c in close_list]
The advantage of NumPy is that it executes much faster, since NumPy is mostly written in C and is vectorized. We can use the magic command %timeit
(or %%timeit
for more than one line of code) in Jupyter Notebook or IPython to measure how long the execution is for the two preceding examples:
%timeit kd_close = close_array / 1000
and
%timeit kd_close_list = [c / 1000 for c in close_list]
For NumPy, this returns something like this (it will differ depending on the machine this is run on):
3.49 µs ± 180 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Compare this with the list comprehension:
167 µs ± 5.44 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
That is a massive difference in speed of about 50x! Notice that this is the same way we used simple math operators with pandas – this is because pandas is built on top of NumPy.
NumPy also allows for element-by-element multiplication. If we wanted to get the market cap from our bitcoin data, we would multiply volume and the closing prices:
volume_array = btc_df['volume'].values
close_array * volume_array
Since pandas uses NumPy under the hood, this actually works with pandas too – we could just as easily use our DataFrame:
btc_df['market_cap'] = btc_df['close'] * btc_df['volume']
Lastly, let's take a look at NumPy's mathematical functions. These are well-documented in NumPy's documentation (Mathematical functions — NumPy v1.26 Manual). Many of these functions are already included in pandas, but some are not. For example, if we wanted to logarithmically scale our data, like we did when plotting it, we could do this with NumPy:
np.log(btc_df['close'])
Many other mathematical functions and abilities exist within NumPy, but often these are only needed for more advanced work. If you are interested in taking a deep dive with NumPy, a book from Packt's collection that can help you is Mastering Numerical Computing with NumPy, by Umit Mert Cakmak and Mert Cuhadaroglu.
This chapter was rather long, but it makes sense – as we've covered a few times, data scientists can spend anywhere between 25% and 75% (sometimes upwards of 90%) of their time cleaning and preparing data. The pandas package is the main package for loading and cleaning data in Python (which is built on top of NumPy), so it's important we have a basic grasp of how to use pandas for data preparation and cleaning. We've seen the core of pandas from beginning to end:
We also took a look at NumPy, but keep in mind that most NumPy functionality can be used directly from pandas. It's only when you need more advanced math that you might have to turn to NumPy.
In our next chapter, we'll take our EDA and visualization skills to a whole new level.