python 数据挖掘图书
Pandas (which is a portmanteau of "panel data") is one of the most important packages to grasp when you’re starting to learn Python.
Pandas(这是“面板数据”的缩写)是您开始学习Python时要掌握的最重要的软件包之一。
The package is known for a very useful data structure called the pandas DataFrame. Pandas also allows Python developers to easily deal with tabular data (like spreadsheets) within a Python script.
该软件包以称为pandas DataFrame的非常有用的数据结构而闻名。 Pandas还允许Python开发人员轻松地在Python脚本中处理表格数据(例如电子表格)。
This tutorial will teach you the fundamentals of pandas that you can use to build data-driven Python applications today.
本教程将教您熊猫的基本知识,您现在可以使用它们来构建数据驱动的Python应用程序。
You can skip to a specific section of this pandas tutorial using the table of contents below:
您可以使用以下目录跳至本熊猫教程的特定部分:
Introduction to Pandas
熊猫介绍
Pandas Series
熊猫系列
Pandas DataFrames
熊猫数据框
How to Deal With Missing Data in Pandas DataFrames
如何处理Pandas Dat aFrame中的丢失数据
The Pandas groupby
Method
熊猫groupby
法
What is the Pandas groupby
Feature?
什么是Pandas groupby
功能?
The Pandas concat
Method
熊猫concat
法
The Pandas merge
Method
熊猫merge
方法
The Pandas join
Method
熊猫join
方法
Other Common Operations in Pandas
熊猫的其他常见行动
Local Data Input and Output (I/O) in Pandas
熊猫的本地数据输入和输出(I / O)
Remote Data Input and Output (I/O) in Pandas
熊猫中的远程数据输入和输出(I / O)
Final Thoughts & Special Offer
最后的想法和特别优惠
Pandas is a widely-used Python library built on top of NumPy. Much of the rest of this course will be dedicated to learning about pandas and how it is used in the world of finance.
Pandas是建立在NumPy之上的广泛使用的Python库。 本课程的其余大部分内容将致力于学习有关熊猫及其在金融界的用途。
Pandas is a Python library created by Wes McKinney, who built pandas to help work with datasets in Python for his work in finance at his place of employment.
Pandas是由Wes McKinney创建的Python库,他构建了pandas来帮助使用Python中的数据集工作,从而在他的工作地点从事金融工作。
According to the library’s website, pandas is “a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.”
根据图书馆的网站 ,pandas是“一种快速,强大,灵活且易于使用的开源数据分析和处理工具,建立在Python编程语言之上。”
Pandas stands for ‘panel data’. Note that pandas is typically stylized as an all-lowercase word, although it is considered a best practice to capitalize its first letter at the beginning of sentences.
熊猫代表“面板数据”。 请注意,尽管熊猫被视为在句子开头大写首字母的最佳实践,但通常将其大写为小写字母。
Pandas is an open source library, which means that anyone can view its source code and make suggestions using pull requests. If you are curious about this, visit the pandas source code repository on GitHub
Pandas是一个开放源代码库,这意味着任何人都可以查看其源代码并使用请求请求提出建议。 如果您对此感到好奇,请访问GitHub上的pandas源代码存储库
Pandas was designed to work with two-dimensional data (similar to Excel spreadsheets). Just as the NumPy library had a built-in data structure called an array
with special attributes and methods, the pandas library has a built-in two-dimensional data structure called a DataFrame
.
Pandas旨在处理二维数据(类似于Excel电子表格)。 就像NumPy库具有内置数据结构(称为具有特殊属性和方法的array
,pandas库具有内置二维数据结构(称为DataFrame
。
As we mentioned earlier in this course, advanced Python practitioners will spend much more time working with pandas than they spend working with NumPy.
正如我们在本课程中前面提到的,与使用NumPy相比,高级Python从业人员与熊猫工作所花费的时间要多得多。
Over the next several sections, we will cover the following information about the pandas library:
在接下来的几节中,我们将介绍有关pandas库的以下信息:
In this section, we’ll be exploring pandas Series, which are a core component of the pandas library for Python programming.
在本节中,我们将探讨pandas系列 ,这是pandas库中用于Python编程的核心组件。
Series are a special type of data structure available in the pandas Python library. Pandas Series are similar to NumPy arrays, except that we can give them a named or datetime index instead of just a numerical index.
系列是pandas Python库中可用的一种特殊类型的数据结构。 Pandas Series与NumPy数组相似,除了我们可以给它们指定一个命名索引或日期时间索引,而不仅仅是数字索引。
To work with pandas Series, you’ll need to import both NumPy and pandas, as follows:
要使用pandas Series,您需要同时导入NumPy和pandas,如下所示:
import numpy as np
import pandas as pd
For the rest of this section, I will assume that both of those imports have been executed before running any code blocks.
对于本节的其余部分,我将假定在运行任何代码块之前都已执行了所有这些导入。
There are a number of different ways to create a pandas Series. We will explore all of them in this section.
创建熊猫系列有多种方法。 我们将在本节中探索所有这些。
First, let’s create a few starter variables - specifically, we’ll create two lists, a NumPy array, and a dictionary.
首先,让我们创建一些入门变量-具体来说,我们将创建两个列表,一个NumPy数组和一个字典。
labels = ['a', 'b', 'c']
my_list = [10, 20, 30]
arr = np.array([10, 20, 30])
d = {'a':10, 'b':20, 'c':30}
The easiest way to create a pandas Series is by passing a vanilla Python list into the pd.Series()
method. We do this with the my_list
variable below:
创建pandas系列的最简单方法是将普通的Python列表传递到pd.Series()
方法中。 我们使用下面的my_list
变量执行此操作:
pd.Series(my_list)
If you run this in your Jupyter Notebook, you will notice that the output is quite different than it is for a normal Python list:
如果在Jupyter Notebook中运行此命令,您将注意到输出与正常的Python列表完全不同:
0 10
1 20
2 30
dtype: int64
The output shown above is clearly designed to present as two columns. The second column is the data from my_list
. What is the first column?
上面显示的输出清楚地设计为显示为两列。 第二列是my_list
的数据。 第一栏是什么?
One of the key advantages of using pandas Series over NumPy arrays is that they allow for labeling. As you might have guessed, that first column is a column of labels.
与NumPy数组相比,使用Pandas Series的主要优势之一是它们允许标记。 您可能已经猜到了,第一列是标签列。
We can add labels to a pandas Series using the index
argument like this:
我们可以使用index
参数将标签添加到pandas系列中,如下所示:
pd.Series(my_list, index=labels)
#Remember - we created the 'labels' list earlier in this section
The output of this code is below:
此代码的输出如下:
a 10
b 20
c 30
dtype: int64
Why would you want to use labels in a pandas Series? The main advantage is that it allows you to reference an element of the Series using its label instead of its numerical index. To be clear, once labels have been applied to a pandas Series, you can use either its numerical index or its label.
您为什么要在熊猫系列中使用标签? 主要优点是它允许您使用其标签而不是其数字索引来引用该系列的元素。 需要明确的是,一旦标签已被应用到熊猫系列,您可以使用其数字索引或者它的标签。
An example of this is below.
下面是一个示例。
Series = pd.Series(my_list, index=labels)
Series[0]
#Returns 10
Series['a']
#Also returns 10
You might have noticed that the ability to reference an element of a Series using its label is similar to how we can reference the value
of a key
-value
pair in a dictionary. Because of this similarity in how they function, you can also pass in a dictionary to create a pandas Series. We’ll use the d={'a': 10, 'b': 20, 'c': 30}
that we created earlier as an example:
您可能已经注意到,参考使用一系列其标签的元素的能力,类似于我们如何可以参考value
一的key
- value
在字典中对。 由于它们在功能上的相似之处,您还可以传入字典来创建pandas系列。 我们以之前创建的d={'a': 10, 'b': 20, 'c': 30}
为例:
pd.Series(d)
This code’s output is:
该代码的输出为:
a 10
b 20
c 30
dtype: int64
It may not yet be clear why we have explored two new data structures (NumPy arrays and pandas Series) that are so similar. In the next section of this section, we’ll explore the main advantage of pandas Series over NumPy arrays.
尚不清楚为什么我们要探索两个如此相似的新数据结构(NumPy数组和pandas系列)。 在本节的下一部分中,我们将探讨pandas系列相对于NumPy数组的主要优势。
While we didn’t encounter it at the time, NumPy arrays are highly limited by one characteristic: every element of a NumPy array must be the same type of data structure. Said differently, NumPy array elements must be all string, or all integers, or all booleans - you get the point.
尽管我们当时还没有遇到过,但是NumPy数组受到一个特性的高度限制:NumPy数组的每个元素都必须是相同类型的数据结构。 换句话说,NumPy数组元素必须全部为字符串,或者全部为整数,或者全部为布尔值-您明白了。
Pandas Series do not suffer from this limitation. In fact, pandas Series are highly flexible.
熊猫系列不受此限制。 实际上,熊猫系列是高度灵活的。
As an example, you can pass three of Python’s built-in functions into a pandas Series without getting an error:
例如,您可以将Python的三个内置函数传递给pandas Series,而不会出现错误:
pd.Series([sum, print, len])
Here’s the output of that code:
这是该代码的输出:
0
1
2
dtype: object
To be clear, the example above is highly impractical and not something we would ever execute in practice. It is, however, an excellent example of the flexibility of the pandas Series data structure.
需要明确的是,上面的示例非常不切实际,不是我们在实践中可以执行的示例。 但是,它是熊猫系列数据结构灵活性的一个很好的例子。
NumPy allows developers to work with both one-dimensional NumPy arrays (sometimes called vectors) and two-dimensional NumPy arrays (sometimes called matrices). We explored pandas Series in the last section, which are similar to one-dimensional NumPy arrays.
NumPy允许开发人员使用一维NumPy数组(有时称为向量)和二维NumPy数组(有时称为矩阵)。 在上一节中,我们探讨了熊猫系列,它们与一维NumPy数组相似。
In this section, we will dive into pandas DataFrames, which are similar to two-dimensional NumPy arrays - but with much more functionality. DataFrames are the most important data structure in the pandas library, so pay close attention throughout this section.
在本节中,我们将深入研究pandas DataFrames ,它类似于二维NumPy数组-但功能更多。 DataFrames是pandas库中最重要的数据结构,因此在本节中请密切注意。
A pandas DataFrame is a two-dimensional data structure that has labels for both its rows and columns. For those familiar with Microsoft Excel, Google Sheets, or other spreadsheet software, DataFrames are very similar.
pandas DataFrame是一个二维数据结构,它的行和列都有标签。 对于熟悉Microsoft Excel,Google表格或其他电子表格软件的用户,DataFrames非常相似。
Here is an example of a pandas DataFrame being displayed within a Jupyter Notebook.
这是在Jupyter Notebook中显示的熊猫DataFrame的示例。
We will now go through the process of recreating this DataFrame step-by-step.
现在,我们将逐步完成重新创建此DataFrame的过程。
First, you’ll need to import both the NumPy and pandas libraries. We have done this before, but in case you’re unsure, here’s another example of how to do that:
首先,您需要同时导入NumPy和pandas库。 我们之前已经做过,但是如果您不确定,这是如何执行此操作的另一个示例:
import numpy as np
import pandas as pd
We’ll also need to create lists for the row and column names. We can do this using vanilla Python lists:
我们还需要为行和列名称创建列表。 我们可以使用原始的Python列表来做到这一点:
rows = ['X','Y','Z']
cols = ['A', 'B', 'C', 'D', 'E']
Next, we will need to create a NumPy array that holds the data contained within the cells of the DataFrame. I used NumPy’s np.random.randn
method for this. I also wrapped that method in the np.round
method (with a second argument of 2
), which rounds each data point to 2 decimal places and makes the data structure much easier to read.
接下来,我们将需要创建一个NumPy数组,用于保存DataFrame单元中包含的数据。 我np.random.randn
使用了NumPy的np.random.randn
方法。 我还将该方法包装在np.round
方法中(第二个参数为2
),该方法将每个数据点np.round
到2个小数位,并使数据结构更易于阅读。
Here’s the final function that generated the data.
这是生成数据的最终函数。
data = np.round(np.random.randn(3,5),2)
Once this is done, you can wrap all of the constituent variables in the pd.DataFrame
method to create your first DataFrame!
完成此操作后,您可以将所有组成变量包装在pd.DataFrame
方法中,以创建第一个DataFrame!
pd.DataFrame(data, rows, cols)
There is a lot to unpack here, so let’s discuss this example in a bit more detail.
这里有很多要解压的内容,因此让我们更详细地讨论这个示例。
First, it is not necessary to create each variable outside of the DataFrame itself. You could have created this DataFrame in one line like this:
首先,没有必要在DataFrame本身之外创建每个变量。 您可以像这样在一行中创建此DataFrame:
pd.DataFrame(np.round(np.random.randn(3,5),2), ['X','Y','Z'], ['A', 'B', 'C', 'D', 'E'])
With that said, declaring each variable separately makes the code much easier to read.
话虽如此,分别声明每个变量使代码更易于阅读。
Second, you might be wondering if it is necessary to put rows into the DataFrame
method before columns. It is indeed necessary. If you tried running pd.DataFrame(data, cols, rows)
, your Jupyter Notebook would generate the following error message:
其次,您可能想知道是否有必要将行放入DataFrame
方法中的列之前。 确实是必要的。 如果您尝试运行pd.DataFrame(data, cols, rows)
,您的Jupyter Notebook将生成以下错误消息:
ValueError: Shape of passed values is (3, 5), indices imply (5, 3)
Next, we will explore the relationship between pandas Series and pandas DataFrames.
接下来,我们将探讨pandas系列和pandas DataFrames之间的关系。
Let’s take another look at the pandas DataFrame that we just created:
让我们再看一下刚刚创建的pandas DataFrame:
If you had to verbally describe a pandas Series, one way to do so might be “a set of labeled columns containing data where each column shares the same set of row index.”
如果您必须用口头描述熊猫系列,一种方法可能是“ 一组包含数据的带标签的列,每个列共享相同的行索引集。”
Interestingly enough, each of these columns is actually a pandas Series! So we can modify our definition of the pandas DataFrame to match its formal definition:
有趣的是,这些专栏实际上都是熊猫系列! 因此,我们可以修改pandas DataFrame的定义以使其符合其正式定义:
“A set of pandas Series that shares the same index.”
“ 一组具有相同索引的熊猫系列。”
We can actually call a specific Series from a pandas DataFrame using square brackets, just like how we call a element from a list. A few examples are below:
实际上,我们可以使用方括号从pandas DataFrame调用特定的Series,就像我们从列表中调用元素一样。 以下是一些示例:
df = pd.DataFrame(data, rows, cols)
df['A']
"""
Returns:
X -0.66
Y -0.08
Z 0.64
Name: A, dtype: float64
"""
df['E']
"""
Returns:
X -1.46
Y 1.71
Z -0.20
Name: E, dtype: float64
"""
What if you wanted to select multiple columns from a pandas DataFrame? You can pass in a list of columns, either directly in the square brackets - such as df[['A', 'E']]
- or by declaring the variable outside of the square brackets like this:
如果要从pandas DataFrame中选择多个列怎么办? 您可以直接在方括号中传递列的列表,例如df[['A', 'E']]
-或通过在方括号之外声明变量,如下所示:
columnsIWant = ['A', 'E']
df[columnsIWant]
#Returns the DataFrame, but only with columns A and E
You can also select a specific element of a specific row using chained square brackets. For example, if you wanted the element contained in row A at index X (which is the element in the top left cell of the DataFrame) you could access it with df['A']['X']
.
您还可以使用链式方括号选择特定行的特定元素。 例如,如果您希望包含在索引X的行A中的元素(这是DataFrame左上角的元素),则可以使用df['A']['X']
。
A few other examples are below.
下面是其他一些示例。
df['B']['Z']
#Returns 1.34
df['D']['Y']
#Returns -0.64
You can create a new column in a pandas DataFrame by specifying the column as though it already exists, and then assigning it a new pandas Series.
您可以在熊猫数据框架中创建新列,方法是将其指定为已存在,然后为其分配新的熊猫系列。
As an example, in the following code block we create a new column called ‘A + B’ which is the sum of columns A and B:
例如,在下面的代码块中,我们创建一个名为“ A + B”的新列,该列是A和B列的总和:
df['A + B'] = df['A'] + df['B']
df
#The last line prints out the new DataFrame
Here’s the output of that code block:
这是该代码块的输出:
To remove this column from the pandas DataFrame, we need to use the pd.DataFrame.drop
method.
要从pandas DataFrame中删除此列,我们需要使用pd.DataFrame.drop
方法。
Note that this method defaults to dropping rows, not columns. To switch the method settings to operate on columns, we must pass it in the axis=1
argument.
请注意,此方法默认为删除行,而不是列。 要切换方法设置以对列进行操作,我们必须在axis=1
参数中传递它。
df.drop('A + B', axis = 1)
It is very important to note that this drop
method does not actually modify the DataFrame itself. For evidence of this, print out the df
variable again, and notice how it still has the A + B
column:
非常重要的一点是要注意,此drop
方法实际上并未修改DataFrame本身。 为了证明这一点,请再次打印出df
变量,并注意它仍然具有A + B
列:
df
The reason that drop
(and many other DataFrame methods!) do not modify the data structure by default is to prevent you from accidentally deleting data.
默认情况下drop
(以及许多其他DataFrame方法!)不修改数据结构的原因是为了防止您意外删除数据。
There are two ways to make pandas automatically overwrite the current DataFrame.
有两种方法可以使熊猫自动覆盖当前的DataFrame。
The first is by passing in the argument inplace=True
, like this:
第一种是通过传入参数inplace=True
,如下所示:
df.drop('A + B', axis=1, inplace=True)
The second is by using an assignment operator that manually overwrites the existing variable, like this:
第二种是通过使用赋值运算符来手动覆盖现有变量,如下所示:
df = df.drop('A + B', axis=1)
Both options are valid but I find myself using the second option more frequently because it is easier to remember.
这两个选项均有效,但我发现自己更经常使用第二个选项,因为它更容易记住。
The drop
method can also be used to drop rows. For example, we can remove the row Z
as follows:
drop
方法也可以用于删除行。 例如,我们可以如下删除Z
行:
df.drop('Z')
We have already seen that we can access a specific column of a pandas DataFrame using square brackets. We will now see how to access a specific row of a pandas DataFrame, with the similar goal of generating a pandas Series from the larger data structure.
我们已经看到我们可以使用方括号访问pandas DataFrame的特定列。 现在,我们将看到如何访问pandas DataFrame的特定行,其相似的目标是从较大的数据结构中生成pandas系列。
DataFrame rows can be accessed by their row label using the loc
attribute along with square brackets. An example is below.
可以使用loc
属性和方括号通过其行标签访问DataFrame行。 下面是一个示例。
df.loc['X']
Here is the output of that code:
这是该代码的输出:
A -0.66
B -1.43
C -0.88
D 1.60
E -1.46
Name: X, dtype: float64
DataFrame rows can be accessed by their numerical index using the iloc
attribute along with square brackets. An example is below.
可以使用iloc
属性及其方括号通过其数字索引访问DataFrame行。 下面是一个示例。
df.iloc[0]
As you would expect, this code has the same output as our last example:
如您所料,此代码与上一个示例具有相同的输出:
A -0.66
B -1.43
C -0.88
D 1.60
E -1.46
Name: X, dtype: float64
There are many cases where you’ll want to know the shape of a pandas DataFrame. By shape, I am referring to the number of columns and rows in the data structure.
在许多情况下,您需要了解熊猫DataFrame的形状。 按形状,我指的是数据结构中的列数和行数。
Pandas has a built-in attribute called shape
that allows us to easily access this:
Pandas具有一个名为shape
的内置属性,该属性使我们可以轻松访问此属性:
df.shape
#Returns (3, 5)
We have already seen how to select rows, columns, and elements from a pandas DataFrame. In this section, we will explore how to select a subset of a DataFrame. Specifically, let’s select the elements from columns A
and B
and rows X
and Y
.
我们已经看到了如何从pandas DataFrame中选择行,列和元素。 在本节中,我们将探讨如何选择DataFrame的子集。 具体来说,让我们从A
和B
列以及X
和Y
行中选择元素。
We can actually approach this in a step-by-step fashion. First, let’s select columns A
and B
:
实际上,我们可以逐步解决此问题。 首先,让我们选择A
和B
列:
df[['A', 'B']]
Then, let’s select rows X
and Y
:
然后,让我们选择X
和Y
行:
df[['A', 'B']].loc[['X', 'Y']]
And we’re done!
我们完成了!
If you recall from our discussion of NumPy arrays, we were able to select certain elements of the array using conditional operators. For example, if we had a NumPy array called arr
and we only wanted the values of the array that were larger than 4, we could use the command arr[arr > 4]
.
如果您回想起有关NumPy数组的讨论,则可以使用条件运算符选择数组的某些元素。 例如,如果我们有一个名为arr
的NumPy数组,而我们只希望该数组的值大于4,则可以使用命令arr[arr > 4]
。
Pandas DataFrames follow a similar syntax. For example, if we wanted to know where our DataFrame has values that were greater than 0.5, we could type df > 0.5
to get the following output:
Pandas DataFrame遵循类似的语法。 例如,如果我们想知道DataFrame的值大于0.5的位置,则可以键入df > 0.5
以获得以下输出:
We can also generate a new pandas DataFrame that contains the normal values where the statement is True
, and NaN
- which stands for Not a Number - values where the statement is false. We do this by passing the statement into the DataFrame using square brackets, like this:
我们也可以产生新的数据框熊猫包含正常值,其中语句是True
,并且NaN
-它代表不是一个数字-值,其中的说法是错误的。 为此,我们使用方括号将语句传递到DataFrame中,如下所示:
df[df > 0.5]
Here is the output of that code:
这是该代码的输出:
You can also use conditional selection to return a subset of the DataFrame where a specific condition is satisfied in a specified column.
您还可以使用条件选择来返回DataFrame的子集,其中指定列中满足特定条件。
To be more specific, let’s say that you wanted the subset of the DataFrame where the value in column C
was less than 1. This is only true for row X
.
更具体地说,假设您想要DataFrame的子集,其中C
列中的值小于1。这仅适用于X
行。
You can get an array of the boolean values associated with this statement like this:
您可以像下面这样获得与该语句关联的布尔值数组:
df['C'] < 1
Here’s the output:
这是输出:
X True
Y False
Z False
Name: C, dtype: bool
You can also get the DataFrame’s actual values relative to this conditional selection command by typing df[df['C'] < 1]
, which outputs just the first row of the DataFrame (since this is the only row where the statement is true for column C
:
您还可以通过键入df[df['C'] < 1]
来获得与该条件选择命令有关的DataFrame的实际值,它仅输出DataFrame的第一行(因为这是该语句为true的唯一行) C
栏:
You can also chain together multiple conditions while using conditional selection. We do this using pandas’ &
operator. You cannot use Python’s normal and
operator, because in this case we are not comparing two boolean values. Instead, we are comparing two pandas Series that contain boolean values, which is why the &
character is used instead.
您还可以在使用条件选择时将多个条件链接在一起。 我们使用pandas的&
运算符执行此操作。 您不能使用Python的normal and
operator,因为在这种情况下,我们不比较两个布尔值。 相反,我们正在比较两个包含布尔值的熊猫系列,这就是为什么使用&
字符代替的原因。
As an example of multiple conditional selection, you can return the DataFrame subset that satisfies df['C'] > 0
and df['A']> 0
with the following code:
作为多个条件选择的示例,您可以使用以下代码返回满足df['C'] > 0
和df['A']> 0
的DataFrame子集:
df[(df['C'] > 0) & (df['A']> 0)]
There are a number of ways that you can modify the index of a pandas DataFrame.
您可以通过多种方式修改熊猫DataFrame的索引。
The most basic is to reset the index to its default numerical values. We do this using the reset_index
method:
最基本的是将索引重置为其默认数值。 我们使用reset_index
方法执行此操作:
df.reset_index()
Note that this creates a new column in the DataFrame called index
that contains the previous index labels:
请注意,这会在DataFrame中创建一个名为index
的新列,其中包含以前的索引标签:
Note that like the other DataFrame operations that we have explored, reset_index
does not modify the original DataFrame unless you either (1) force it to using the =
assignment operator or (2) specify inplace=True
.
请注意,像我们探索的其他DataFrame操作一样, reset_index
不会修改原始DataFrame,除非您(1)强制它使用=
赋值运算符或(2)指定inplace=True
。
You can also set an existing column as the index of the DataFrame using the set_index
method. We can set column A
as the index of the DataFrame using the following code:
您还可以使用set_index
方法将现有列设置为DataFrame的索引。 我们可以使用以下代码将列A
设置为DataFrame的索引:
df.set_index('A')
The values of A
are now in the index of the DataFrame:
现在, A
的值在DataFrame的索引中:
There are three things worth noting here:
这里有三件事值得注意:
set_index
does not modify the original DataFrame unless you either (1) force it to using the =
assignment operator or (2) specify inplace=True
.
set_index
不会修改原始DataFrame,除非您(1)强制使用=
赋值运算符或(2)指定inplace=True
。
Unless you run reset_index
first, performing a set_index
operation with inplace=True
or a forced =
assignment operator will permanently overwrite your current index values.
除非您先运行reset_index
,否则使用set_index
inplace=True
或set_index
=
赋值运算符执行set_index
操作将永久覆盖您当前的索引值。
If you want to rename your index to labels that are not currently contained in a column, you can do so by (1) creating a NumPy array with those values, (2) adding those values as a new row of the pandas DataFrame, and (3) running the set_index
operation.
如果您想将索引重命名为当前列中不包含的标签,可以通过(1)使用这些值创建一个NumPy数组,(2)将这些值添加为pandas DataFrame的新行,然后将其重命名。 (3)运行set_index
操作。
The last DataFrame operation we’ll discuss is how to rename their columns.
我们将讨论的最后一个DataFrame操作是如何重命名其列。
Columns are an attribute of a pandas DataFrame, which means we can call them and modify them using a simple dot operator. For example:
列是pandas DataFrame的属性,这意味着我们可以调用它们并使用简单的点运算符对其进行修改。 例如:
df.columns
#Returns Index(['A', 'B', 'C', 'D', 'E'], dtype='object'
The assignment operator is the best way to modify this attribute:
赋值运算符是修改此属性的最佳方法:
df.columns = [1, 2, 3, 4, 5]
df
In an ideal world we will always work with perfect data sets. However, this is never the case in practice. There are many cases when working with quantitative data that you will need to drop or modify missing data. We will explore strategies for handling missing data in Pandas throughout this section.
在理想的世界中,我们将始终使用完善的数据集。 但是,实际情况并非如此。 在处理定量数据时,有很多情况需要删除或修改丢失的数据。 在本节中,我们将探讨处理熊猫中缺失数据的策略。
We will be using the np.nan
attribute to generate NaN
values throughout this section.
在本节中,我们将使用np.nan
属性生成NaN
值。
Np.nan
#Returns nan
In this section, we will make use of the following DataFrame:
在本节中,我们将使用以下DataFrame:
df = pd.DataFrame(np.array([[1, 5, 1],[2, np.nan, 2],[np.nan, np.nan, 3]]))
df.columns = ['A', 'B', 'C']
df
dropna
方法 (The Pandas dropna
Method)Pandas has a built-in method called dropna
. When applied against a DataFrame, the dropna
method will remove any rows that contain a NaN value.
熊猫有一个内置的方法dropna
。 当对DataFrame应用时, dropna
方法将删除任何包含NaN值的行。
Let’s apply the dropna
method to our df
DataFrame as an example:
让我们将dropna
方法应用于df
DataFrame作为示例:
df.dropna()
Note that like the other DataFrame operations that we have explored, dropna
does not modify the original DataFrame unless you either (1) force it to using the =
assignment operator or (2) specify inplace=True
.
请注意,就像我们探索的其他DataFrame操作一样, dropna
不会修改原始DataFrame,除非您(1)强制使用=
赋值运算符或(2)指定inplace=True
。
We can also drop any columns that have missing values by passing in the axis=1
argument to the dropna
method, like this:
我们还可以通过将axis=1
参数传递给dropna
方法来删除任何缺少值的列,如下所示:
df.dropna(axis=1)
fillna
方法 (The Pandas fillna
Method)In many cases, you will want to replace missing values in a pandas DataFrame instead of dropping it completely. The fillna
method is designed for this.
在许多情况下,您将希望替换熊猫DataFrame中的缺失值,而不是完全删除它。 fillna
,设计了fillna
方法。
As an example, let’s fill every missing value in our DataFrame with the :
作为一个例子,接下来让我们看看我们的数据帧的每个缺失值 :
df.fillna('')
Obviously, there is basically no situation where we would want to replace missing data with an emoji. This was simply an amusing example.
显然,基本上没有一种情况需要用表情符号替换丢失的数据。 这只是一个有趣的例子。
Instead, more commonly we will replace a missing value with either:
相反,更常见的是,我们将缺失值替换为:
We will demonstrate both below.
我们将在下面进行演示。
To fill missing values with the average value across the entire DataFrame, use the following code:
要使用整个DataFrame的平均值填充缺失值,请使用以下代码:
df.fillna(df.mean())
To fill the missing values within a particular column with the average value from that column, use the following code (this is for column A
):
要用该列的平均值填充特定列中的缺失值,请使用以下代码(这适用于A
列):
df['A'].fillna(df['A'].mean())
groupby
法 (The Pandas groupby
Method)In this section, we will be discussing how to use the pandas groupby feature.
在本节中,我们将讨论如何使用pandas groupby功能。
groupby
功能? (What is the Pandas groupby
Feature?)Pandas comes with a built-in groupby
feature that allows you to group together rows based off of a column and perform an aggregate function on them. For example, you could calculate the sum of all rows that have a value of 1
in the column ID
.
Pandas带有内置的groupby
功能,可让您根据列将行分组在一起并对其执行汇总功能。 例如,您可以计算列ID
中所有值为1
行的总和。
For anyone familiar with the SQL language for querying databases, the pandas groupby
method is very similar to a SQL groupby statement.
对于熟悉SQL查询数据库语言的人来说,pandas groupby
方法与SQL groupby语句非常相似。
It is easiest to understand the pandas groupby
method using an example. We will be using the following DataFrame:
使用示例最容易理解pandas groupby
方法。 我们将使用以下DataFrame:
df = pd.DataFrame([ ['Google', 'Sam', 200],
['Google', 'Charlie', 120],
['Salesforce','Ralph', 125],
['Salesforce','Emily', 250],
['Adobe','Rosalynn', 150],
['Adobe','Chelsea', 500]])
df.columns = ['Organization', 'Salesperson Name', 'Sales']
df
This DataFrame contains sales information for three separate organizations: Google, Salesforce, and Adobe. We will use the groupby
method to get summary sales data for each specific organization.
此DataFrame包含三个独立组织的销售信息:Google,Salesforce和Adobe。 我们将使用groupby
方法获取每个特定组织的摘要销售数据。
To start, we will need to create a groupby
object. This is a data structure that tells Python which column you’d like to group the DataFrame by. In our case, it is the Organization
column, so we create a groupby
object like this:
首先,我们需要创建一个groupby
对象。 这是一个数据结构,告诉Python您要将DataFrame分组的依据。 在我们的例子中,它是“ Organization
列,因此我们创建一个groupby
对象,如下所示:
df.groupby('Organization')
If you see an output that looks like this, you will know that you have created the object successfully:
如果看到如下所示的输出,您将知道您已成功创建该对象:
Once the groupby
object has been created, you can call operations on that object to create a DataFrame with summary information on the Organization
groups. A few examples are below:
一旦创建了groupby
对象,您就可以调用该对象上的操作来创建一个DataFrame,其中包含有关Organization
组的摘要信息。 以下是一些示例:
df.groupby('Organization').mean()
#The mean (or average) of the sales column
df.groupby('Organization').sum()
#The sum of the sales column
df.groupby('Organization').std()
#The standard deviation of the sales column
Note that since all of the operations above are numerical, they will automatically ignore the Salesperson Name
column, because it only contains strings.
请注意,由于上述所有操作都是数字操作,因此它们将自动忽略“ Salesperson Name
列,因为该列仅包含字符串。
Here are a few other aggregate functions that work well with pandas’ groupby
method:
以下是一些其他的聚合函数,它们可以与pandas的groupby
方法配合使用:
df.groupby('Organization').count()
#Counts the number of observations
df.groupby('Organization').max()
#Returns the maximum value
df.groupby('Organization').min()
#Returns the minimum value
describe
方法使用groupby
(Using groupby
With The describe
Method)One very useful tool when working with pandas DataFrames is the describe
method, which returns useful information for every category that the groupby
function is working with.
describe
方法是一个处理pandas DataFrames的非常有用的工具,它为groupby
函数使用的每个类别返回有用的信息。
This is best learned through an example. I’ve combined the groupby
and describe
methods below:
最好通过一个例子来学习。 我结合了groupby
并describe
下面describe
方法:
df.groupby('Organization').describe()
Here is what the output looks like:
输出如下所示:
concat
法 (The Pandas concat
Method)In this section, we will learn how to concatenate pandas DataFrames. This will be a brief section, but it is an important concept nonetheless. Let’s dig in!
在本节中,我们将学习如何连接pandas DataFrames 。 这将是一个简短的部分,但这仍然是一个重要的概念。 让我们开始吧!
To demonstrate how to merge pandas DataFrames, I will be using the following 3 example DataFrames:
为了演示如何合并熊猫DataFrame,我将使用以下3个示例DataFrame:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
Anyone who has taken my Introduction to Python course will remember that string concatenation means adding one string to the end of another string. An example of string concatenation is below.
上过我的Python入门课程的任何人都会记住,字符串串联意味着将一个字符串添加到另一个字符串的末尾。 字符串连接的示例如下。
str1 = "Hello "
str2 = "World!"
str1 + str2
#Returns 'Hello World!'
DataFrame concatenation is quite similar. It means adding one DataFrame to the end of another DataFrame.
DataFrame串联非常相似。 这意味着将一个DataFrame添加到另一个DataFrame的末尾。
In order for us to perform string concatenation, we should have two DataFrames with the same columns. An example is below:
为了使我们执行字符串连接,我们应该有两个具有相同列的DataFrame。 下面是一个示例:
pd.concat([df1, df2, df3])
By default, pandas will concatenate along axis=0
, which means that its adding rows, not columns.
默认情况下,大熊猫将沿axis=0
串联,这意味着其添加的是行而不是列。
If you want to add rows, simply pass in axis=0
as a new variable into the concat
function.
如果要添加行,只需将axis=0
作为新变量传递给concat
函数。
pd.concat([df1,df2,df3],axis=1)
In our case, this creates a very ugly DataFrame with many missing values:
在我们的例子中,这将创建一个非常丑陋的DataFrame,其中包含许多缺失值:
merge
方法 (The Pandas merge
Method)In this section, we’ll learn how to merge pandas DataFrames.
在本节中,我们将学习如何合并pandas DataFrames 。
In this section, we will be using the following two pandas DataFrames:
在本节中,我们将使用以下两个熊猫DataFrame:
import pandas as pd
leftDataFrame = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
rightDataFrame = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
The columns A
, B
, C
, and D
have real data in them, while the column key
has a key that is common among both DataFrames. To merge
two DataFrames means to connect them along one column that they both have in common.
A
, B
, C
和D
包含真实数据,而列key
具有两个DataFrame中共有的键。 merge
两个DataFrame意味着沿着它们共同的一列连接它们。
You can merge two pandas DataFrames along a common column using the merge
columns. For anyone that is familiar with the SQL programming language, this is very similar to performing an inner join
in SQL.
您可以使用merge
列沿公共列合并两个pandas DataFrame。 对于熟悉SQL编程语言的任何人,这都与在SQL中执行inner join
联接非常相似。
Do not worry if you are unfamiliar with SQL, because merge
syntax is actually very straightforward. It looks like this:
如果您不熟悉SQL,请不要担心,因为merge
语法实际上非常简单。 看起来像这样:
pd.merge(leftDataFrame, rightDataFrame, how='inner', on='key')
Let’s break down the four arguments we passed into the merge
method:
让我们分解一下传递给merge
方法的四个参数:
leftDataFrame
: This is the DataFrame that we’d like to merge on the left.
leftDataFrame
:这是我们要在左侧合并的DataFrame。
rightDataFrame
: This is the DataFrame that we’d like to merge on the right.
rightDataFrame
:这是我们要在右侧合并的DataFrame。
how=inner
: This is the type of merge that the operation is performing. There are multiple types of merges, but we will only be covering inner merges in this course.
how=inner
:这是操作正在执行的合并的类型。 合并有多种类型,但是在本课程中,我们将仅介绍内部合并。
on='key'
: This is the column that you’d like to perform the merge on. Since key
was the only column in common between the two DataFrames, it was the only option that we could use to perform the merge.
on='key'
:这是您要对其执行合并的列。 由于key
是两个DataFrame之间唯一的公共列,因此它是我们可以用来执行合并的唯一选项。
join
方法 (The Pandas join
Method)In this section, you will learn how to join pandas DataFrames.
在本节中,您将学习如何加入pandas DataFrames 。
We will be using the following two DataFrames in this section:
在本节中,我们将使用以下两个DataFrame:
leftDataFrame = pd.DataFrame({ 'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']},
index =['K0', 'K1', 'K2', 'K3'])
rightDataFrame = pd.DataFrame({ 'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index = ['K0', 'K1', 'K2', 'K3'])
If these look familiar, it’s because they are! These are the nearly the same DataFrames as we used when learning how to merge pandas DataFrames. A key difference is that instead of the key
column being its own column, it is now the index of the DataFrame. You can think of these DataFrames as being those from the last section after executing .set_index(key)
.
如果这些看起来很熟悉,那是因为它们! 这些与我们学习如何合并熊猫数据框时使用的数据框几乎相同。 一个关键的区别是,而不是key
柱是它自己的列,现在是数据框的索引。 您可以将这些DataFrame视为执行.set_index(key)
后的最后一部分。
Joining pandas DataFrames is very similar to merging pandas DataFrames except that the keys on which you’d like to combine are in the index instead of contained within a column.
加入pandas DataFrames与合并pandas DataFrames非常相似,除了要合并的键位于索引中而不是包含在列中。
To join these two DataFrames, we can use the following code:
要连接这两个DataFrame,我们可以使用以下代码:
leftDataFrame.join(rightDataFrame)
This section will explore common operations in the pandas Python library. The purpose of this section is to explore important pandas operations that have not fit into any of the sections we’ve discussed so far.
本节将探讨pandas Python库中的常见操作 。 本节的目的是探讨迄今为止我们尚未讨论的任何重要熊猫活动。
I will be using the following DataFrame in this section:
我将在本节中使用以下DataFrame:
df = pd.DataFrame({'col1':['A','B','C','D'],
'col2':[2,7,3,7],
'col3':['fgh','rty','asd','qwe']})
Pandas has an excellent method called unique
that can be used to find unique values within a pandas Series. Note that this method only works on Series and not on DataFrames. If you try to apply this method to a DataFrame, you will encounter an error:
熊猫有一种称为“ unique
的出色方法,可用于在“熊猫系列”中查找唯一值。 请注意,此方法仅适用于Series,不适用于DataFrames。 如果尝试将此方法应用于DataFrame,则会遇到错误:
df.unique()
#Returns AttributeError: 'DataFrame' object has no attribute 'unique'
However, since the columns of a pandas DataFrame are each a Series, we can apply the unique
method to a specific column, like this:
但是,由于pandas DataFrame的列都是Series,因此我们可以将unique
方法应用于特定列,如下所示:
df['col2'].unique()
#Returns array([2, 7, 3])
Pandas also has a separate nunique
method that counts the number of unique values in a Series and returns that value as an integer. For example:
熊猫还具有单独的nunique
方法,该方法计算系列中唯一值的数量,并将该值作为整数返回。 例如:
df['col2'].nunique()
#Returns 3
Interestingly, the nunique
method is exactly the same as len(unique())
but it is a common enough operation that the pandas community decided to create a specific method for this use case.
有趣的是, nunique
方法与 len(unique())
完全相同,但是它是一个足够普遍的操作,因此熊猫社区决定为此用例创建一个特定的方法。
Pandas has a function called counts_value
that allows you to easily count the number of time each observation occurs. An example is below:
熊猫具有一个称为counts_value
的函数,该函数可让您轻松计算每次观察发生的时间。 下面是一个示例:
df['col2'].value_counts()
"""
Returns:
7 2
2 1
3 1
Name: col2, dtype: int64
"""
apply
方法 (How To Use The Pandas apply
Method)The apply
method is one of the most powerful methods available in the pandas library. It allows you to apply a custom function to every element of a pandas Series.
apply
方法是pandas库中可用的最强大的方法之一。 它允许您将自定义函数应用于熊猫系列的每个元素。
As an example, imagine that we had the following function exponentify
that takes in an integer and raises it to the power of itself:
例如,假设我们有以下函数exponentify
,它接受一个整数并将其提升为幂:
def exponentify(x):
return x**x
The apply
method allows you to easily apply the exponentify
function to each element of the Series:
apply
方法使您可以轻松地将exponentify
函数应用于Series的每个元素:
df['col2'].apply(exponentify)
"""
Returns:
0 4
1 823543
2 27
3 823543
Name: col2, dtype: int64
"""
The apply
method can also be used with built-in functions like len
(although it is definitely more powerful when used with custom functions). An example of the len
function being used in conjunction with apply
is below:
apply
方法也可以与len
类的内置函数一起使用(尽管与自定义函数一起使用时,它肯定更强大)。 len
函数与apply
结合使用的示例如下:
df['col3'].apply(len)
"""
Returns
0 3
1 3
2 3
3 3
Name: col3, dtype: int64
"""
You can filter a pandas DataFrame by the values of a particular column using the sort_values
method. As an example, if you wanted to sort by col2
in our DataFrame df
, you would run the following command:
您可以使用sort_values
方法按特定列的值过滤pandas DataFrame。 例如,如果要在DataFrame df
按col2
进行排序, col2
可以运行以下命令:
df.sort_values('col2')
The output of this command is below:
该命令的输出如下:
There are two things to note from this output:
此输出有两点需要注意:
As with the other DataFrame methods, this does not actually modify the original DataFrame unless you force it to using the =
assignment operator or by passing in inplace = True
.
与其他DataFrame方法一样,这实际上并不会修改原始DataFrame,除非您强迫它使用=
赋值运算符或通过inplace = True
传递。
In this section, we will begin exploring data input and output with the pandas Python library.
在本节中,我们将开始探索pandas Python库的数据输入和输出 。
We will be working with different files containing stock prices for Facebook (FB), Amazon (AMZN), Google (GOOG), and Microsoft (MSFT) in this section. To download these files, download the entire GitHub repository for this course here. The files used in this section can be found in the stock_prices
folder of the repository.
在本部分中,我们将使用包含Facebook(FB),Amazon(AMZN),Google(GOOG)和Microsoft(MSFT)股票价格的不同文件。 要下载这些文件,请在此处下载本课程的整个GitHub存储库。 在存储库的stock_prices
文件夹中可以找到本节中使用的文件。
You’ll want to save these files in the same directory as your Jupyter Notebook for this section. The easiest way to do this is to download the GitHub repository, and then open your Jupyter Notebook in the stock_prices
folder of the repository.
在本节中,您需要将这些文件保存在与Jupyter Notebook相同的目录中。 最简单的方法是下载GitHub存储库,然后在存储库的stock_prices
文件夹中打开Jupyter Notebook。
.csv
文件 (How To Import .csv
Files Using Pandas)We can import .csv
files into a pandas DataFrame using the read_csv
method, like this:
我们可以使用read_csv
方法将.csv
文件导入pandas DataFrame中,如下所示:
import pandas as pd
pd.read_csv('stock_prices.csv')
As you’ll see, this creates (and displays) a new pandas DataFrame containing the data from the .csv
file.
如您所见,这将创建(并显示)一个新的pandas DataFrame,其中包含来自.csv
文件的数据。
You can also assign this new DataFrame to a variable to be referenced later using the normal =
assignment operator:
您还可以使用normal =
Assignment运算符将此新的DataFrame分配给以后要引用的变量:
new_data_frame = pd.read_csv('stock_prices.csv')
There are a number of read
methods included with the pandas programming library. If you are trying to import data from an external document, then it is likely that pandas has a built-in method for this.
熊猫编程库中包含许多read
方法。 如果您尝试从外部文档导入数据,则熊猫可能为此具有内置方法。
A few examples of different read
methods are below:
以下是几种不同read
方法的示例:
pd.read_json()
pd.read_html()
pd.read_excel()
We will explore some of these methods later in this section.
我们将在本节后面探讨其中一些方法。
If we wanted to import a .csv
file that was not directly in our working directory, we need to modify the syntax of the read_csv
method slightly.
如果要导入不在工作目录中的.csv
文件,则需要稍微修改read_csv
方法的语法。
If the file is in a folder deeper than what you’re working in now, you need to specify the full path of the file in the read_csv
method argument. As an example, if the stock_prices.csv
file was contained in a folder called new_folder
, then we could import it like this:
如果该文件位于比您现在正在使用的文件夹更深的文件夹中,则需要在read_csv
方法参数中指定文件的完整路径。 例如,如果stock_prices.csv
文件包含在名为new_folder
的文件夹中,那么我们可以这样导入它:
new_data_frame = pd.read_csv('./new_folder/stock_prices.csv')
For those unfamiliar with working with directory notation, the .
at the start of the filepath indicates the current directory. Similarly, a ..
indicates one directory above the current directory, and a ...
indicates two directories above the current directory.
对于那些不熟悉使用目录符号的人,可以使用.
文件路径开头的表示当前目录。 同样, ..
表示当前目录上方的一个目录,而...
表示当前目录上方的两个目录。
This syntax (using periods) is exactly how we reference (and import) files that are above our current working directory. As an example, open a Jupyter Notebook inside the new_folder
folder, and place stock_prices.csv
in the parent folder. With this file layout, you could import the stock_prices.csv
file using the following command:
这种语法(使用句点)正是我们引用(和导入)当前工作目录上方的文件的方式。 例如,在new_folder
文件夹中打开Jupyter Notebook,并将stock_prices.csv
放在父文件夹中。 使用此文件布局,可以使用以下命令导入stock_prices.csv
文件:
new_data_frame = pd.read_csv('../stock_prices.csv')
Note that this directory syntax is the same for all types of file imports, so we will not be revisiting how to import files from different directories when we explore different import methods later in this course.
请注意,此目录语法对于所有类型的文件导入都是相同的,因此在本课程稍后的内容中探索不同的导入方法时,我们将不会再讨论如何从不同的目录导入文件。
.csv
文件 (How To Export .csv
Files Using Pandas)To demonstrate how to save a new .csv
file, let’s first create a new DataFrame. Specifically, let’s fill a DataFrame with 3 columns and 50 rows with random data using the np.random.randn
method:
为了演示如何保存新的.csv
文件,我们首先创建一个新的DataFrame。 具体来说,让我们使用np.random.randn
方法用随机数据填充3列50行的np.random.randn
:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(50,3))
Now that we have a DataFrame, we can save it using the to_csv
method. This method takes in the name of the new file as its argument.
现在我们有了一个DataFrame,我们可以使用to_csv
方法保存它。 此方法将新文件的名称作为其参数。
df.to_csv('my_new_csv.csv')
You will notice that if you run the code above, the new .csv
file will begin with an unlabeled column that contains the index of the DataFrame. An example is below (after opening the .csv
in Microsoft Excel):
您会注意到,如果运行上面的代码,新的.csv
文件将以一个未标记的列开头,该列包含DataFrame的索引。 下面是一个示例(在Microsoft Excel中打开.csv
之后):
In many cases, this is undesirable. To remove the blank index column, pass in index=False
as a second argument to the to_csv
method, like this:
在许多情况下,这是不希望的。 要删除空白索引列,请将index=False
作为第二个参数传递给to_csv
方法,如下所示:
new_data_frame.to_csv('my_new_csv.csv', index = False)
The new .csv
file does not have the unlabelled index column:
新的.csv
文件没有未标记的索引列:
The read_csv
and to_csv
methods make it very easy to import and export data from .csv
files using pandas. We will see later in this section that for every read
method that allows us to import data, there is usually a corresponding to
function that allows us to save that data!
使用read_csv
和to_csv
方法可以非常轻松地使用熊猫从.csv
文件导入和导出数据。 我们将在本节后面看到,对于允许我们导入数据的每种read
方法,通常都有一个对应to
函数允许我们保存数据!
.json
文件 (How To Import .json
Files Using Pandas)If you are not experienced in working with large datasets, then you may not be familiar with the JSON file type.
如果您没有使用大型数据集的经验,那么您可能不熟悉JSON文件类型。
JSON stands for JavaScript Object Notation. JSON files are very similar to Python Dictionaries.
JSON代表JavaScript对象符号。 JSON文件与Python字典非常相似。
JSON files are one of the most commonly-used data types among software developers because they can be manipulated using basically every programming language.
JSON文件是软件开发人员中最常用的数据类型之一,因为它们基本上可以使用每种编程语言进行操作。
Pandas has a method called read_json
that makes it very easy to import JSON files as a pandas DataFrame. An example is below.
Pandas有一个名为read_json
的方法,可以很容易地将JSON文件作为pandas DataFrame导入。 下面是一个示例。
json_data_frame = pd.read_json('stock_prices.json')
We’ll learn how to export JSON files next.
接下来,我们将学习如何导出JSON文件。
.json
文件 (How To Export .json
Files Using Pandas)As I mentioned earlier, there is generally a to
method for every read
method. This means that we can save a DataFrame to a JSON file using the to_json
method.
如前所述,每个read
方法通常都有一个to
方法。 这意味着我们可以使用to_json
方法将to_json
保存到JSON文件。
As an example, let’s take the randomly-generated DataFrame df
from earlier in this section and save it as a JSON file in our local directory:
作为示例,让我们从本节前面的部分中随机生成的DataFrame df
并将其另存为JSON文件到我们的本地目录中:
df.to_json('my_new_json.json')
We’ll learn how to work with Excel files - which have the file extension .xlsx
- next.
接下来,我们将学习如何使用Excel文件(文件扩展名为.xlsx
)。
.xlsx
文件 (How To Import .xlsx
Files Using Pandas)Pandas’ read_excel
method makes it very easy to import data from an Excel document into a pandas DataFrame:
Pandas的read_excel
方法使将Excel文档中的数据导入pandas DataFrame非常容易:
new_data_frame = pd.read_excel('stock_prices.xlsx')
Unlike the read_csv
and read_json
methods that we explored earlier in this section, the read_excel
method can accept a second argument. The reason why read_excel
accepts multiple arguments is that Excel spreadsheets can contain multiple sheets. The second argument specifies which sheet you are trying to import and is called sheet_name
.
与本节前面探讨的read_csv
和read_json
方法不同, read_excel
方法可以接受第二个参数。 read_excel
接受多个参数的原因是Excel电子表格可以包含多个工作表。 第二个参数指定您要导入的工作表,称为sheet_name
。
As an example, if our stock_prices
had a second sheet called Sheet2
, you would import that sheet to a pandas DataFrame like this:
例如,如果我们的stock_prices
有第二张工作表Sheet2
,则可以将该工作表导入到熊猫DataFrame中,如下所示:
new_data_frame.to_excel('stock_prices.xlsx', sheet_name='Sheet2')
If you do not specify any value for sheet_name
, then read_excel
will import the first sheet of the Excel spreadsheet by default.
If you do not specify any value for sheet_name
, then read_excel
will import the first sheet of the Excel spreadsheet by default.
While importing Excel documents, it is very important to note that pandas only imports data. It cannot import other Excel capabilities like formatting, formulas, or macros. Trying to import data from an Excel document that has these features may cause pandas to crash.
While importing Excel documents, it is very important to note that pandas only imports data. It cannot import other Excel capabilities like formatting, formulas, or macros. Trying to import data from an Excel document that has these features may cause pandas to crash.
.xlsx
Files Using Pandas (How To Export .xlsx
Files Using Pandas)Exporting Excel files is very similar to importing Excel files, except we use to_excel
instead of read_excel
. An example is below using our randomly-generated df
DataFrame:
Exporting Excel files is very similar to importing Excel files, except we use to_excel
instead of read_excel
. An example is below using our randomly-generated df
DataFrame:
df.to_excel('my_new_excel_file.xlsx')
Like read_excel
, to_excel
accepts a second argument called sheet_name
that allows you to specify the name of the sheet that you’re saving. For example, we could have named the sheet of the new .xlsx
file My New Sheet!
by passing it into the to_excel
method like this:
Like read_excel
, to_excel
accepts a second argument called sheet_name
that allows you to specify the name of the sheet that you're saving. For example, we could have named the sheet of the new .xlsx
file My New Sheet!
by passing it into the to_excel
method like this:
df.to_excel('my_new_excel_file.xlsx', sheet_name='My New Sheet!')
If you do not specify a value for sheet_name
, then the sheet will be named Sheet1
by default (just like when you create a new Excel document using the actual application).
If you do not specify a value for sheet_name
, then the sheet will be named Sheet1
by default (just like when you create a new Excel document using the actual application).
In the last section of this course, we learned how to import data from .csv
, .json
, and .xlsx
files that were saved on our local computer. We will follow up by showing you how you can import files without actually saving them to your local machine first. This is called remote importing
.
In the last section of this course, we learned how to import data from .csv
, .json
, and .xlsx
files that were saved on our local computer. We will follow up by showing you how you can import files without actually saving them to your local machine first. This is called remote importing
.
Remote importing means bringing a file into your Python script without having that file saved on your computer.
Remote importing means bringing a file into your Python script without having that file saved on your computer.
On the surface, it may not seem clear why we might want to engage in remote importing. However, it can be very useful.
On the surface, it may not seem clear why we might want to engage in remote importing. However, it can be very useful.
The reason why remote importing is useful is because, by definition, it means the Python script will continue to function even if the file being imported is not saved on your computer. This means I can send my code to colleagues or friends and it will still function properly.
The reason why remote importing is useful is because, by definition, it means the Python script will continue to function even if the file being imported is not saved on your computer. This means I can send my code to colleagues or friends and it will still function properly.
Throughout the rest of this section, I will demonstrate how to perform remote imports in pandas for .csv
, .json
, and .xlsx
files.
Throughout the rest of this section, I will demonstrate how to perform remote imports in pandas for .csv
, .json
, and .xlsx
files.
.csv
Files (How To Import Remote .csv
Files)First, navigate to this course’s GitHub Repository. Open up the stock_prices
folder. Click on the file stock_prices.csv
and then click the button for the Raw
file, as shown below.
First, navigate to this course's GitHub Repository . Open up the stock_prices
folder. Click on the file stock_prices.csv
and then click the button for the Raw
file, as shown below.
This will take you to a new page that has the data from the .csv
file contained within stock_prices.csv
.
This will take you to a new page that has the data from the .csv
file contained within stock_prices.csv
.
To import this remote file into your into your Python script, you must first copy its URL to your clipboard. You can do this by either (1) highlighting the entire URL, right-clicking the selected text, and clicking copy
, or (2) highlighting the entire URL and typing CTRL+C on your keyboard.
To import this remote file into your into your Python script, you must first copy its URL to your clipboard. You can do this by either (1) highlighting the entire URL, right-clicking the selected text, and clicking copy
, or (2) highlighting the entire URL and typing CTRL+C on your keyboard.
The URL will look like this:
The URL will look like this:
[https://raw.githubusercontent.com/nicholasmccullum/advanced-python/master/stock_prices/stock_prices.csv](https://raw.githubusercontent.com/nicholasmccullum/advanced-python/master/stock_prices/stock_prices.csv)
You can pass this URL into the read_csv
method to import the dataset into a pandas DataFrame without saving the dataset to your computer first:
You can pass this URL into the read_csv
method to import the dataset into a pandas DataFrame without saving the dataset to your computer first:
pd.read_csv('https://raw.githubusercontent.com/nicholasmccullum/advanced-python/master/stock_prices/stock_prices.csv')
.json
Files (How To Import Remote .json
Files)We can import remote .json
files in a similar fashion to .csv
files.
We can import remote .json
files in a similar fashion to .csv
files.
First, grab the raw URL from GitHub. It will look like this:
First, grab the raw URL from GitHub. 它看起来像这样:
https://raw.githubusercontent.com/nicholasmccullum/advanced-python/master/stock_prices/stock_prices.json
Next, pass this URL into the read_json
method like this:
Next, pass this URL into the read_json
method like this:
pd.read_json('https://raw.githubusercontent.com/nicholasmccullum/advanced-python/master/stock_prices/stock_prices.json')
.xlsx
Files (How To Import Remote .xlsx
Files)We can import remote .xlsx
files in a similar fashion to .csv
and .json
files. Note that you will need to click in a slightly different place on the GitHub interface. Specifically, you’ll need to right-click ‘View Raw’ and select ‘Copy Link Address,’ as shown below.
We can import remote .xlsx
files in a similar fashion to .csv
and .json
files. Note that you will need to click in a slightly different place on the GitHub interface. Specifically, you'll need to right-click 'View Raw' and select 'Copy Link Address,' as shown below.
The raw URL will look like this:
The raw URL will look like this:
https://github.com/nicholasmccullum/advanced-python/blob/master/stock_prices/stock_prices.xlsx?raw=true
Then, pass this URL into the read_excel
method, like this:
Then, pass this URL into the read_excel
method, like this:
pd.read_excel('https://github.com/nicholasmccullum/advanced-python/blob/master/stock_prices/stock_prices.xlsx?raw=true')
Remote importing means that you do not need to first save the file being imported onto your local computer, which is an unquestionable benefit.
Remote importing means that you do not need to first save the file being imported onto your local computer, which is an unquestionable benefit.
However, remote importing also has two downsides:
However, remote importing also has two downsides:
Thanks for reading this article on Pandas, which is one of my favorite Python packages and a must-know library for every Python developer.
Thanks for reading this article on Pandas, which is one of my favorite Python packages and a must-know library for every Python developer.
This tutorial is an excerpt from my course Python For Finance and Data Science. If you're interested in learning more core Python skills, the course is 50% off for the first 50 freeCodeCamp readers that sign up - click here to get your discounted course now!
This tutorial is an excerpt from my course Python For Finance and Data Science . If you're interested in learning more core Python skills, the course is 50% off for the first 50 freeCodeCamp readers that sign up - click here to get your discounted course now !
翻译自: https://www.freecodecamp.org/news/the-ultimate-guide-to-the-pandas-library-for-data-science-in-python/
python 数据挖掘图书