python 数据挖掘图书_Python数据科学熊猫图书馆终极指南

python 数据挖掘图书

Pandas (which is a portmanteau of "panel data") is one of the most important packages to grasp when you’re starting to learn Python.

Pandas(这是“面板数据”的缩写)是您开始学习Python时要掌握的最重要的软件包之一。

The package is known for a very useful data structure called the pandas DataFrame. Pandas also allows Python developers to easily deal with tabular data (like spreadsheets) within a Python script.

该软件包以称为pandas DataFrame的非常有用的数据结构而闻名。 Pandas还允许Python开发人员轻松地在Python脚本中处理表格数据(例如电子表格)。

This tutorial will teach you the fundamentals of pandas that you can use to build data-driven Python applications today.

本教程将教您熊猫的基本知识,您现在可以使用它们来构建数据驱动的Python应用程序。

目录 (Table of Contents)

You can skip to a specific section of this pandas tutorial using the table of contents below:

您可以使用以下目录跳至本熊猫教程的特定部分:

  • Introduction to Pandas

    熊猫介绍

  • Pandas Series

    熊猫系列

  • Pandas DataFrames

    熊猫数据框

  • How to Deal With Missing Data in Pandas DataFrames

    如何处理Pandas Dat aFrame中的丢失数据

  • The Pandas groupby Method

    熊猫groupby

  • What is the Pandas groupby Feature?

    什么是Pandas groupby功能?

  • The Pandas concat Method

    熊猫concat

  • The Pandas merge Method

    熊猫merge方法

  • The Pandas join Method

    熊猫join方法

  • Other Common Operations in Pandas

    熊猫的其他常见行动

  • Local Data Input and Output (I/O) in Pandas

    熊猫的本地数据输入和输出(I / O)

  • Remote Data Input and Output (I/O) in Pandas

    熊猫中的远程数据输入和输出(I / O)

  • Final Thoughts & Special Offer

    最后的想法和特别优惠

熊猫介绍 (Introduction to Pandas)

Pandas is a widely-used Python library built on top of NumPy. Much of the rest of this course will be dedicated to learning about pandas and how it is used in the world of finance.

Pandas是建立在NumPy之上的广泛使用的Python库。 本课程的其余大部分内容将致力于学习有关熊猫及其在金融界的用途。

什么是熊猫? (What is Pandas?)

Pandas is a Python library created by Wes McKinney, who built pandas to help work with datasets in Python for his work in finance at his place of employment.

Pandas是由Wes McKinney创建的Python库,他构建了pandas来帮助使用Python中的数据集工作,从而在他的工作地点从事金融工作。

According to the library’s website, pandas is “a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.”

根据图书馆的网站 ,pandas是“一种快速,强大,灵活且易于使用的开源数据分析和处理工具,建立在Python编程语言之上。”

Pandas stands for ‘panel data’. Note that pandas is typically stylized as an all-lowercase word, although it is considered a best practice to capitalize its first letter at the beginning of sentences.

熊猫代表“面板数据”。 请注意,尽管熊猫被视为在句子开头大写首字母的最佳实践,但通常将其大写为小写字母。

Pandas is an open source library, which means that anyone can view its source code and make suggestions using pull requests. If you are curious about this, visit the pandas source code repository on GitHub

Pandas是一个开放源代码库,这意味着任何人都可以查看其源代码并使用请求请求提出建议。 如果您对此感到好奇,请访问GitHub上的pandas源代码存储库

熊猫的主要好处 (The Main Benefit of Pandas)

Pandas was designed to work with two-dimensional data (similar to Excel spreadsheets). Just as the NumPy library had a built-in data structure called an array with special attributes and methods, the pandas library has a built-in two-dimensional data structure called a DataFrame.

Pandas旨在处理二维数据(类似于Excel电子表格)。 就像NumPy库具有内置数据结构(称为具有特殊属性和方法的array ,pandas库具有内置二维数据结构(称为DataFrame

我们将对熊猫学到什么 (What We Will Learn About Pandas)

As we mentioned earlier in this course, advanced Python practitioners will spend much more time working with pandas than they spend working with NumPy.

正如我们在本课程中前面提到的,与使用NumPy相比,高级Python从业人员与熊猫工作所花费的时间要多得多。

Over the next several sections, we will cover the following information about the pandas library:

在接下来的几节中,我们将介绍有关pandas库的以下信息:

  • Pandas Series

    熊猫系列
  • Pandas DataFrames

    熊猫数据框
  • How To Deal With Missing Data in Pandas

    如何处理熊猫中的丢失数据
  • How To Merge DataFrames in Pandas

    如何在熊猫中合并数据框
  • How To Join DataFrames in Pandas

    如何在熊猫中加入数据框
  • How To Concatenate DataFrames in Pandas

    如何串联熊猫中的数据框
  • Common Operations in Pandas

    熊猫的共同行动
  • Data Input and Output in Pandas

    熊猫的数据输入和输出
  • How To Save Pandas DataFrames as Excel Files for External Users

    如何将Pandas DataFrames保存为Excel文件供外部用户使用

熊猫系列 (Pandas Series)

In this section, we’ll be exploring pandas Series, which are a core component of the pandas library for Python programming.

在本节中,我们将探讨pandas系列 ,这是pandas库中用于Python编程的核心组件。

什么是熊猫系列? (What Are Pandas Series?)

Series are a special type of data structure available in the pandas Python library. Pandas Series are similar to NumPy arrays, except that we can give them a named or datetime index instead of just a numerical index.

系列是pandas Python库中可用的一种特殊类型的数据结构。 Pandas Series与NumPy数组相似,除了我们可以给它们指定一个命名索引或日期时间索引,而不仅仅是数字索引。

您需要使用Pandas系列的进口商品 (The Imports You’ll Require To Work With Pandas Series)

To work with pandas Series, you’ll need to import both NumPy and pandas, as follows:

要使用pandas Series,您需要同时导入NumPy和pandas,如下所示:

import numpy as np

import pandas as pd

For the rest of this section, I will assume that both of those imports have been executed before running any code blocks.

对于本节的其余部分,我将假定在运行任何代码块之前都已执行了所有这些导入。

如何创建熊猫系列 (How To Create a Pandas Series)

There are a number of different ways to create a pandas Series. We will explore all of them in this section.

创建熊猫系列有多种方法。 我们将在本节中探索所有这些。

First, let’s create a few starter variables - specifically, we’ll create two lists, a NumPy array, and a dictionary.

首先,让我们创建一些入门变量-具体来说,我们将创建两个列表,一个NumPy数组和一个字典。

labels = ['a', 'b', 'c']

my_list = [10, 20, 30]

arr = np.array([10, 20, 30])

d = {'a':10, 'b':20, 'c':30}

The easiest way to create a pandas Series is by passing a vanilla Python list into the pd.Series() method. We do this with the my_list variable below:

创建pandas系列的最简单方法是将普通的Python列表传递到pd.Series()方法中。 我们使用下面的my_list变量执行此操作:

pd.Series(my_list)

If you run this in your Jupyter Notebook, you will notice that the output is quite different than it is for a normal Python list:

如果在Jupyter Notebook中运行此命令,您将注意到输出与正常的Python列表完全不同:

0    10

1    20

2    30

dtype: int64

The output shown above is clearly designed to present as two columns. The second column is the data from my_list. What is the first column?

上面显示的输出清楚地设计为显示为两列。 第二列是my_list的数据。 第一栏是什么?

One of the key advantages of using pandas Series over NumPy arrays is that they allow for labeling. As you might have guessed, that first column is a column of labels.

与NumPy数组相比,使用Pandas Series的主要优势之一是它们允许标记。 您可能已经猜到了,第一列是标签列。

We can add labels to a pandas Series using the index argument like this:

我们可以使用index参数将标签添加到pandas系列中,如下所示:

pd.Series(my_list, index=labels)

#Remember - we created the 'labels' list earlier in this section

The output of this code is below:

此代码的输出如下:

a    10

b    20

c    30

dtype: int64

Why would you want to use labels in a pandas Series? The main advantage is that it allows you to reference an element of the Series using its label instead of its numerical index. To be clear, once labels have been applied to a pandas Series, you can use either its numerical index or its label.

您为什么要在熊猫系列中使用标签? 主要优点是它允许您使用其标签而不是其数字索引来引用该系列的元素。 需要明确的是,一旦标签已被应用到熊猫系列,您可以使用其数字索引或者它的标签。

An example of this is below.

下面是一个示例。

Series = pd.Series(my_list, index=labels)

Series[0]

#Returns 10

Series['a']

#Also returns 10

You might have noticed that the ability to reference an element of a Series using its label is similar to how we can reference the value of a key-value pair in a dictionary. Because of this similarity in how they function, you can also pass in a dictionary to create a pandas Series. We’ll use the d={'a': 10, 'b': 20, 'c': 30} that we created earlier as an example:

您可能已经注意到,参考使用一系列其标签的元素的能力,类似于我们如何可以参考value一的key - value在字典中对。 由于它们在功能上的相似之处,您还可以传入字典来创建pandas系列。 我们以之前创建的d={'a': 10, 'b': 20, 'c': 30}为例:

pd.Series(d)

This code’s output is:

该代码的输出为:

a    10

b    20

c    30

dtype: int64

It may not yet be clear why we have explored two new data structures (NumPy arrays and pandas Series) that are so similar. In the next section of this section, we’ll explore the main advantage of pandas Series over NumPy arrays.

尚不清楚为什么我们要探索两个如此相似的新数据结构(NumPy数组和pandas系列)。 在本节的下一部分中,我们将探讨pandas系列相对于NumPy数组的主要优势。

熊猫系列相对于NumPy阵列的主要优势 (The Main Advantage of Pandas Series Over NumPy Arrays)

While we didn’t encounter it at the time, NumPy arrays are highly limited by one characteristic: every element of a NumPy array must be the same type of data structure. Said differently, NumPy array elements must be all string, or all integers, or all booleans - you get the point.

尽管我们当时还没有遇到过,但是NumPy数组受到一个特性的高度限制:NumPy数组的每个元素都必须是相同类型的数据结构。 换句话说,NumPy数组元素必须全部为字符串,或者全部为整数,或者全部为布尔值-您明白了。

Pandas Series do not suffer from this limitation. In fact, pandas Series are highly flexible.

熊猫系列不受此限制。 实际上,熊猫系列是高度灵活的。

As an example, you can pass three of Python’s built-in functions into a pandas Series without getting an error:

例如,您可以将Python的三个内置函数传递给pandas Series,而不会出现错误:

pd.Series([sum, print, len])

Here’s the output of that code:

这是该代码的输出:

0      

1    

2      

dtype: object

To be clear, the example above is highly impractical and not something we would ever execute in practice. It is, however, an excellent example of the flexibility of the pandas Series data structure.

需要明确的是,上面的示例非常不切实际,不是我们在实践中可以执行的示例。 但是,它是熊猫系列数据结构灵活性的一个很好的例子。

熊猫数据框 (Pandas DataFrames)

NumPy allows developers to work with both one-dimensional NumPy arrays (sometimes called vectors) and two-dimensional NumPy arrays (sometimes called matrices). We explored pandas Series in the last section, which are similar to one-dimensional NumPy arrays.

NumPy允许开发人员使用一维NumPy数组(有时称为向量)和二维NumPy数组(有时称为矩阵)。 在上一节中,我们探讨了熊猫系列,它们与一维NumPy数组相似。

In this section, we will dive into pandas DataFrames, which are similar to two-dimensional NumPy arrays - but with much more functionality. DataFrames are the most important data structure in the pandas library, so pay close attention throughout this section.

在本节中,我们将深入研究pandas DataFrames ,它类似于二维NumPy数组-但功能更多。 DataFrames是pandas库中最重要的数据结构,因此在本节中请密切注意。

什么是熊猫数据框? (What Is A Pandas DataFrame?)

A pandas DataFrame is a two-dimensional data structure that has labels for both its rows and columns. For those familiar with Microsoft Excel, Google Sheets, or other spreadsheet software, DataFrames are very similar.

pandas DataFrame是一个二维数据结构,它的行和列都有标签。 对于熟悉Microsoft Excel,Google表格或其他电子表格软件的用户,DataFrames非常相似。

Here is an example of a pandas DataFrame being displayed within a Jupyter Notebook.

这是在Jupyter Notebook中显示的熊猫DataFrame的示例。

We will now go through the process of recreating this DataFrame step-by-step.

现在,我们将逐步完成重新创建此DataFrame的过程。

First, you’ll need to import both the NumPy and pandas libraries. We have done this before, but in case you’re unsure, here’s another example of how to do that:

首先,您需要同时导入NumPy和pandas库。 我们之前已经做过,但是如果您不确定,这是如何执行此操作的另一个示例:

import numpy as np

import pandas as pd

We’ll also need to create lists for the row and column names. We can do this using vanilla Python lists:

我们还需要为行和列名称创建列表。 我们可以使用原始的Python列表来做到这一点:

rows = ['X','Y','Z']

cols = ['A', 'B', 'C', 'D', 'E']

Next, we will need to create a NumPy array that holds the data contained within the cells of the DataFrame. I used NumPy’s np.random.randn method for this. I also wrapped that method in the np.round method (with a second argument of 2), which rounds each data point to 2 decimal places and makes the data structure much easier to read.

接下来,我们将需要创建一个NumPy数组,用于保存DataFrame单元中包含的数据。 我np.random.randn使用了NumPy的np.random.randn方法。 我还将该方法包装在np.round方法中(第二个参数为2 ),该方法将每个数据点np.round到2个小数位,并使数据结构更易于阅读。

Here’s the final function that generated the data.

这是生成数据的最终函数。

data = np.round(np.random.randn(3,5),2)

Once this is done, you can wrap all of the constituent variables in the pd.DataFrame method to create your first DataFrame!

完成此操作后,您可以将所有组成变量包装在pd.DataFrame方法中,以创建第一个DataFrame!

pd.DataFrame(data, rows, cols)

There is a lot to unpack here, so let’s discuss this example in a bit more detail.

这里有很多要解压的内容,因此让我们更详细地讨论这个示例。

First, it is not necessary to create each variable outside of the DataFrame itself. You could have created this DataFrame in one line like this:

首先,没有必要在DataFrame本身之外创建每个变量。 您可以像这样在一行中创建此DataFrame:

pd.DataFrame(np.round(np.random.randn(3,5),2), ['X','Y','Z'], ['A', 'B', 'C', 'D', 'E'])

With that said, declaring each variable separately makes the code much easier to read.

话虽如此,分别声明每个变量使代码更易于阅读。

Second, you might be wondering if it is necessary to put rows into the DataFrame method before columns. It is indeed necessary. If you tried running pd.DataFrame(data, cols, rows), your Jupyter Notebook would generate the following error message:

其次,您可能想知道是否有必要将行放入DataFrame方法中的列之前。 确实是必要的。 如果您尝试运行pd.DataFrame(data, cols, rows) ,您的Jupyter Notebook将生成以下错误消息:

ValueError: Shape of passed values is (3, 5), indices imply (5, 3)

Next, we will explore the relationship between pandas Series and pandas DataFrames.

接下来,我们将探讨pandas系列和pandas DataFrames之间的关系。

熊猫系列与Pandas DataFrame之间的关系 (The Relationship Between Pandas Series and Pandas DataFrame)

Let’s take another look at the pandas DataFrame that we just created:

让我们再看一下刚刚创建的pandas DataFrame:

If you had to verbally describe a pandas Series, one way to do so might be “a set of labeled columns containing data where each column shares the same set of row index.”

如果您必须用口头描述熊猫系列,一种方法可能是“ 一组包含数据的带标签的列,每个列共享相同的行索引集。”

Interestingly enough, each of these columns is actually a pandas Series! So we can modify our definition of the pandas DataFrame to match its formal definition:

有趣的是,这些专栏实际上都是熊猫系列! 因此,我们可以修改pandas DataFrame的定义以使其符合其正式定义:

A set of pandas Series that shares the same index.”

一组具有相同索引的熊猫系列。”

熊猫数据框架中的索引和分配 (Indexing and Assignment in Pandas DataFrames)

We can actually call a specific Series from a pandas DataFrame using square brackets, just like how we call a element from a list. A few examples are below:

实际上,我们可以使用方括号从pandas DataFrame调用特定的Series,就像我们从列表中调用元素一样。 以下是一些示例:

df = pd.DataFrame(data, rows, cols)

df['A']

"""

Returns:

X   -0.66

Y   -0.08

Z    0.64

Name: A, dtype: float64

"""

df['E']

"""

Returns:

X   -1.46

Y    1.71

Z   -0.20

Name: E, dtype: float64

"""

What if you wanted to select multiple columns from a pandas DataFrame? You can pass in a list of columns, either directly in the square brackets - such as df[['A', 'E']] - or by declaring the variable outside of the square brackets like this:

如果要从pandas DataFrame中选择多个列怎么办? 您可以直接在方括号中传递列的列表,例如df[['A', 'E']] -或通过在方括号之外声明变量,如下所示:

columnsIWant = ['A', 'E']

df[columnsIWant]

#Returns the DataFrame, but only with columns A and E

You can also select a specific element of a specific row using chained square brackets. For example, if you wanted the element contained in row A at index X (which is the element in the top left cell of the DataFrame) you could access it with df['A']['X'].

您还可以使用链式方括号选择特定行的特定元素。 例如,如果您希望包含在索引X的行A中的元素(这是DataFrame左上角的元素),则可以使用df['A']['X']

A few other examples are below.

下面是其他一些示例。

df['B']['Z']

#Returns 1.34

df['D']['Y']

#Returns -0.64

如何在Pandas DataFrame中创建和删除列 (How To Create and Remove Columns in a Pandas DataFrame)

You can create a new column in a pandas DataFrame by specifying the column as though it already exists, and then assigning it a new pandas Series.

您可以在熊猫数据框架中创建新列,方法是将其指定为已存在,然后为其分配新的熊猫系列。

As an example, in the following code block we create a new column called ‘A + B’ which is the sum of columns A and B:

例如,在下面的代码块中,我们创建一个名为“ A + B”的新列,该列是A和B列的总和:

df['A + B'] = df['A'] + df['B']

df 

#The last line prints out the new DataFrame

Here’s the output of that code block:

这是该代码块的输出:

To remove this column from the pandas DataFrame, we need to use the pd.DataFrame.drop method.

要从pandas DataFrame中删除此列,我们需要使用pd.DataFrame.drop方法。

Note that this method defaults to dropping rows, not columns. To switch the method settings to operate on columns, we must pass it in the axis=1 argument.

请注意,此方法默认为删除行,而不是列。 要切换方法设置以对列进行操作,我们必须在axis=1参数中传递它。

df.drop('A + B', axis = 1)

It is very important to note that this drop method does not actually modify the DataFrame itself. For evidence of this, print out the df variable again, and notice how it still has the A + B column:

非常重要的一点是要注意,此drop方法实际上并未修改DataFrame本身。 为了证明这一点,请再次打印出df变量,并注意它仍然具有A + B列:

df

The reason that drop (and many other DataFrame methods!) do not modify the data structure by default is to prevent you from accidentally deleting data.

默认情况下drop (以及许多其他DataFrame方法!)不修改数据结构的原因是为了防止您意外删除数据。

There are two ways to make pandas automatically overwrite the current DataFrame.

有两种方法可以使熊猫自动覆盖当前的DataFrame。

The first is by passing in the argument inplace=True, like this:

第一种是通过传入参数inplace=True ,如下所示:

df.drop('A + B', axis=1, inplace=True)

The second is by using an assignment operator that manually overwrites the existing variable, like this:

第二种是通过使用赋值运算符来手动覆盖现有变量,如下所示:

df = df.drop('A + B', axis=1)

Both options are valid but I find myself using the second option more frequently because it is easier to remember.

这两个选项均有效,但我发现自己更经常使用第二个选项,因为它更容易记住。

The drop method can also be used to drop rows. For example, we can remove the row Z as follows:

drop方法也可以用于删除行。 例如,我们可以如下删除Z行:

df.drop('Z')

如何从Pandas DataFrame中选择一行 (How To Select A Row From A Pandas DataFrame)

We have already seen that we can access a specific column of a pandas DataFrame using square brackets. We will now see how to access a specific row of a pandas DataFrame, with the similar goal of generating a pandas Series from the larger data structure.

我们已经看到我们可以使用方括号访问pandas DataFrame的特定列。 现在,我们将看到如何访问pandas DataFrame的特定行,其相似的目标是从较大的数据结构中生成pandas系列。

DataFrame rows can be accessed by their row label using the loc attribute along with square brackets. An example is below.

可以使用loc属性和方括号通过其行标签访问DataFrame行。 下面是一个示例。

df.loc['X']

Here is the output of that code:

这是该代码的输出:

A   -0.66

B   -1.43

C   -0.88

D    1.60

E   -1.46

Name: X, dtype: float64

DataFrame rows can be accessed by their numerical index using the iloc attribute along with square brackets. An example is below.

可以使用iloc属性及其方括号通过其数字索引访问DataFrame行。 下面是一个示例。

df.iloc[0]

As you would expect, this code has the same output as our last example:

如您所料,此代码与上一个示例具有相同的输出:

A   -0.66

B   -1.43

C   -0.88

D    1.60

E   -1.46

Name: X, dtype: float64

如何确定Pandas DataFrame中的行数和列数 (How To Determine The Number Of Rows and Columns in a Pandas DataFrame)

There are many cases where you’ll want to know the shape of a pandas DataFrame. By shape, I am referring to the number of columns and rows in the data structure.

在许多情况下,您需要了解熊猫DataFrame的形状。 按形状,我指的是数据结构中的列数和行数。

Pandas has a built-in attribute called shape that allows us to easily access this:

Pandas具有一个名为shape的内置属性,该属性使我们可以轻松访问此属性:

df.shape

#Returns (3, 5)

切片熊猫数据框 (Slicing Pandas DataFrames)

We have already seen how to select rows, columns, and elements from a pandas DataFrame. In this section, we will explore how to select a subset of a DataFrame. Specifically, let’s select the elements from columns A and B and rows X and Y.

我们已经看到了如何从pandas DataFrame中选择行,列和元素。 在本节中,我们将探讨如何选择DataFrame的子集。 具体来说,让我们从AB列以及XY行中选择元素。

We can actually approach this in a step-by-step fashion. First, let’s select columns A and B:

实际上,我们可以逐步解决此问题。 首先,让我们选择AB列:

df[['A', 'B']]

Then, let’s select rows X and Y:

然后,让我们选择XY行:

df[['A', 'B']].loc[['X', 'Y']]

And we’re done!

我们完成了!

使用Pandas DataFrame进行条件选择 (Conditional Selection Using Pandas DataFrame)

If you recall from our discussion of NumPy arrays, we were able to select certain elements of the array using conditional operators. For example, if we had a NumPy array called arr and we only wanted the values of the array that were larger than 4, we could use the command arr[arr > 4].

如果您回想起有关NumPy数组的讨论,则可以使用条件运算符选择数组的某些元素。 例如,如果我们有一个名为arr的NumPy数组,而我们只希望该数组的值大于4,则可以使用命令arr[arr > 4]

Pandas DataFrames follow a similar syntax. For example, if we wanted to know where our DataFrame has values that were greater than 0.5, we could type df > 0.5 to get the following output:

Pandas DataFrame遵循类似的语法。 例如,如果我们想知道DataFrame的值大于0.5的位置,则可以键入df > 0.5以获得以下输出:

We can also generate a new pandas DataFrame that contains the normal values where the statement is True, and NaN - which stands for Not a Number - values where the statement is false. We do this by passing the statement into the DataFrame using square brackets, like this:

我们也可以产生新的数据框熊猫包含正常值,其中语句是True ,并且NaN -它代表不是一个数字-值,其中的说法是错误的。 为此,我们使用方括号将语句传递到DataFrame中,如下所示:

df[df > 0.5]

Here is the output of that code:

这是该代码的输出:

You can also use conditional selection to return a subset of the DataFrame where a specific condition is satisfied in a specified column.

您还可以使用条件选择来返回DataFrame的子集,其中指定列中满足特定条件。

To be more specific, let’s say that you wanted the subset of the DataFrame where the value in column C was less than 1. This is only true for row X.

更具体地说,假设您想要DataFrame的子集,其中C列中的值小于1。这仅适用于X行。

You can get an array of the boolean values associated with this statement like this:

您可以像下面这样获得与该语句关联的布尔值数组:

df['C'] < 1

Here’s the output:

这是输出:

X     True

Y    False

Z    False

Name: C, dtype: bool

You can also get the DataFrame’s actual values relative to this conditional selection command by typing df[df['C'] < 1], which outputs just the first row of the DataFrame (since this is the only row where the statement is true for column C:

您还可以通过键入df[df['C'] < 1]来获得与该条件选择命令有关的DataFrame的实际值,它仅输出DataFrame的第一行(因为这是该语句为true的唯一行) C栏:

You can also chain together multiple conditions while using conditional selection. We do this using pandas’ & operator. You cannot use Python’s normal and operator, because in this case we are not comparing two boolean values. Instead, we are comparing two pandas Series that contain boolean values, which is why the & character is used instead.

您还可以在使用条件选择时将多个条件链接在一起。 我们使用pandas的&运算符执行此操作。 您不能使用Python的normal and operator,因为在这种情况下,我们不比较两个布尔值。 相反,我们正在比较两个包含布尔值的熊猫系列,这就是为什么使用&字符代替的原因。

As an example of multiple conditional selection, you can return the DataFrame subset that satisfies df['C'] > 0 and df['A']> 0 with the following code:

作为多个条件选择的示例,您可以使用以下代码返回满足df['C'] > 0df['A']> 0的DataFrame子集:

df[(df['C'] > 0) & (df['A']> 0)]

如何修改Pandas DataFrame的索引 (How To Modify The Index of a Pandas DataFrame)

There are a number of ways that you can modify the index of a pandas DataFrame.

您可以通过多种方式修改熊猫DataFrame的索引。

The most basic is to reset the index to its default numerical values. We do this using the reset_index method:

最基本的是将索引重置为其默认数值。 我们使用reset_index方法执行此操作:

df.reset_index()

Note that this creates a new column in the DataFrame called index that contains the previous index labels:

请注意,这会在DataFrame中创建一个名为index的新列,其中包含以前的索引标签:

Note that like the other DataFrame operations that we have explored, reset_index does not modify the original DataFrame unless you either (1) force it to using the = assignment operator or (2) specify inplace=True.

请注意,像我们探索的其他DataFrame操作一样, reset_index不会修改原始DataFrame,除非您(1)强制它使用=赋值运算符或(2)指定inplace=True

You can also set an existing column as the index of the DataFrame using the set_index method. We can set column A as the index of the DataFrame using the following code:

您还可以使用set_index方法将现有列设置为DataFrame的索引。 我们可以使用以下代码将列A设置为DataFrame的索引:

df.set_index('A')

The values of A are now in the index of the DataFrame:

现在, A的值在DataFrame的索引中:

There are three things worth noting here:

这里有三件事值得注意:

  • set_index does not modify the original DataFrame unless you either (1) force it to using the = assignment operator or (2) specify inplace=True.

    set_index不会修改原始DataFrame,除非您(1)强制使用=赋值运算符或(2)指定inplace=True

  • Unless you run reset_index first, performing a set_index operation with inplace=True or a forced = assignment operator will permanently overwrite your current index values.

    除非您先运行reset_index ,否则使用set_index inplace=Trueset_index =赋值运算符执行set_index操作将永久覆盖您当前的索引值。

  • If you want to rename your index to labels that are not currently contained in a column, you can do so by (1) creating a NumPy array with those values, (2) adding those values as a new row of the pandas DataFrame, and (3) running the set_index operation.

    如果您想将索引重命名为当前列中不包含的标签,可以通过(1)使用这些值创建一个NumPy数组,(2)将这些值添加为pandas DataFrame的新行,然后将其重命名。 (3)运行set_index操作。

如何重命名Pandas DataFrame中的列 (How To Rename Columns in a Pandas DataFrame)

The last DataFrame operation we’ll discuss is how to rename their columns.

我们将讨论的最后一个DataFrame操作是如何重命名其列。

Columns are an attribute of a pandas DataFrame, which means we can call them and modify them using a simple dot operator. For example:

列是pandas DataFrame的属性,这意味着我们可以调用它们并使用简单的点运算符对其进行修改。 例如:

df.columns

#Returns Index(['A', 'B', 'C', 'D', 'E'], dtype='object'

The assignment operator is the best way to modify this attribute:

赋值运算符是修改此属性的最佳方法:

df.columns = [1, 2, 3, 4, 5]

df

如何处理Pandas DataFrames中的丢失数据 (How to Deal With Missing Data in Pandas DataFrames)

In an ideal world we will always work with perfect data sets. However, this is never the case in practice. There are many cases when working with quantitative data that you will need to drop or modify missing data. We will explore strategies for handling missing data in Pandas throughout this section.

在理想的世界中,我们将始终使用完善的数据集。 但是,实际情况并非如此。 在处理定量数据时,有很多情况需要删除或修改丢失的数据。 在本节中,我们将探讨处理熊猫中缺失数据的策略。

本节中将使用的DataFrame (The DataFrame We’ll Be Using In This section)

We will be using the np.nan attribute to generate NaN values throughout this section.

在本节中,我们将使用np.nan属性生成NaN值。

Np.nan

#Returns nan

In this section, we will make use of the following DataFrame:

在本节中,我们将使用以下DataFrame:

df = pd.DataFrame(np.array([[1, 5, 1],[2, np.nan, 2],[np.nan, np.nan, 3]]))

df.columns = ['A', 'B', 'C']

df

熊猫dropna方法 (The Pandas dropna Method)

Pandas has a built-in method called dropna. When applied against a DataFrame, the dropna method will remove any rows that contain a NaN value.

熊猫有一个内置的方法dropna 。 当对DataFrame应用时, dropna方法将删除任何包含NaN值的行。

Let’s apply the dropna method to our df DataFrame as an example:

让我们将dropna方法应用于df DataFrame作为示例:

df.dropna()

Note that like the other DataFrame operations that we have explored, dropna does not modify the original DataFrame unless you either (1) force it to using the = assignment operator or (2) specify inplace=True.

请注意,就像我们探索的其他DataFrame操作一样, dropna不会修改原始DataFrame,除非您(1)强制使用=赋值运算符或(2)指定inplace=True

We can also drop any columns that have missing values by passing in the axis=1 argument to the dropna method, like this:

我们还可以通过将axis=1参数传递给dropna方法来删除任何缺少值的列,如下所示:

df.dropna(axis=1)

熊猫fillna方法 (The Pandas fillna Method)

In many cases, you will want to replace missing values in a pandas DataFrame instead of dropping it completely. The fillna method is designed for this.

在许多情况下,您将希望替换熊猫DataFrame中的缺失值,而不是完全删除它。 fillna ,设计了fillna方法。

As an example, let’s fill every missing value in our DataFrame with the :

作为一个例子,接下来让我们看看我们的数据帧的每个缺失值

df.fillna('')

Obviously, there is basically no situation where we would want to replace missing data with an emoji. This was simply an amusing example.

显然,基本上没有一种情况需要用表情符号替换丢失的数据。 这只是一个有趣的例子。

Instead, more commonly we will replace a missing value with either:

相反,更常见的是,我们将缺失值替换为:

  • The average value of the entire DataFrame

    整个DataFrame的平均值
  • The average value of that row of the DataFrame

    DataFrame的该行的平均值

We will demonstrate both below.

我们将在下面进行演示。

To fill missing values with the average value across the entire DataFrame, use the following code:

要使用整个DataFrame的平均值填充缺失值,请使用以下代码:

df.fillna(df.mean())

To fill the missing values within a particular column with the average value from that column, use the following code (this is for column A):

要用该列的平均值填充特定列中的缺失值,请使用以下代码(这适用于A列):

df['A'].fillna(df['A'].mean())

熊猫groupby (The Pandas groupby Method)

In this section, we will be discussing how to use the pandas groupby feature.

在本节中,我们将讨论如何使用pandas groupby功能。

什么是Pandas groupby功能? (What is the Pandas groupby Feature?)

Pandas comes with a built-in groupby feature that allows you to group together rows based off of a column and perform an aggregate function on them. For example, you could calculate the sum of all rows that have a value of 1 in the column ID.

Pandas带有内置的groupby功能,可让您根据列将行分组在一起并对其执行汇总功能。 例如,您可以计算列ID中所有值为1行的总和。

For anyone familiar with the SQL language for querying databases, the pandas groupby method is very similar to a SQL groupby statement.

对于熟悉SQL查询数据库语言的人来说,pandas groupby方法与SQL groupby语句非常相似。

It is easiest to understand the pandas groupby method using an example. We will be using the following DataFrame:

使用示例最容易理解pandas groupby方法。 我们将使用以下DataFrame:

df = pd.DataFrame([ ['Google', 'Sam', 200],

                    ['Google', 'Charlie', 120],

                    ['Salesforce','Ralph', 125],

                    ['Salesforce','Emily', 250],

                    ['Adobe','Rosalynn', 150],

                    ['Adobe','Chelsea', 500]])

df.columns = ['Organization', 'Salesperson Name', 'Sales']

df

This DataFrame contains sales information for three separate organizations: Google, Salesforce, and Adobe. We will use the groupby method to get summary sales data for each specific organization.

此DataFrame包含三个独立组织的销售信息:Google,Salesforce和Adobe。 我们将使用groupby方法获取每个特定组织的摘要销售数据。

To start, we will need to create a groupby object. This is a data structure that tells Python which column you’d like to group the DataFrame by. In our case, it is the Organization column, so we create a groupby object like this:

首先,我们需要创建一个groupby对象。 这是一个数据结构,告诉Python您要将DataFrame分组的依据。 在我们的例子中,它是“ Organization列,因此我们创建一个groupby对象,如下所示:

df.groupby('Organization')

If you see an output that looks like this, you will know that you have created the object successfully:

如果看到如下所示的输出,您将知道您已成功创建该对象:

Once the groupby object has been created, you can call operations on that object to create a DataFrame with summary information on the Organization groups. A few examples are below:

一旦创建了groupby对象,您就可以调用该对象上的操作来创建一个DataFrame,其中包含有关Organization组的摘要信息。 以下是一些示例:

df.groupby('Organization').mean()

#The mean (or average) of the sales column

df.groupby('Organization').sum()

#The sum of the sales column

df.groupby('Organization').std()

#The standard deviation of the sales column

Note that since all of the operations above are numerical, they will automatically ignore the Salesperson Name column, because it only contains strings.

请注意,由于上述所有操作都是数字操作,因此它们将自动忽略“ Salesperson Name列,因为该列仅包含字符串。

Here are a few other aggregate functions that work well with pandas’ groupby method:

以下是一些其他的聚合函数,它们可以与pandas的groupby方法配合使用:

df.groupby('Organization').count()

#Counts the number of observations

df.groupby('Organization').max()

#Returns the maximum value

df.groupby('Organization').min()

#Returns the minimum value

通过describe方法使用groupby (Using groupby With The describe Method)

One very useful tool when working with pandas DataFrames is the describe method, which returns useful information for every category that the groupby function is working with.

describe方法是一个处理pandas DataFrames的非常有用的工具,它为groupby函数使用的每个类别返回有用的信息。

This is best learned through an example. I’ve combined the groupby and describe methods below:

最好通过一个例子来学习。 我结合了groupbydescribe下面describe方法:

df.groupby('Organization').describe()

Here is what the output looks like:

输出如下所示:

熊猫concat (The Pandas concat Method)

In this section, we will learn how to concatenate pandas DataFrames. This will be a brief section, but it is an important concept nonetheless. Let’s dig in!

在本节中,我们将学习如何连接pandas DataFrames 。 这将是一个简短的部分,但这仍然是一个重要的概念。 让我们开始吧!

我们将在本节中使用的数据框 (The DataFrames We’ll Use In This section)

To demonstrate how to merge pandas DataFrames, I will be using the following 3 example DataFrames:

为了演示如何合并熊猫DataFrame,我将使用以下3个示例DataFrame:

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],

                        'B': ['B0', 'B1', 'B2', 'B3'],

                        'C': ['C0', 'C1', 'C2', 'C3'],

                        'D': ['D0', 'D1', 'D2', 'D3']},

                        index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],

                        'B': ['B4', 'B5', 'B6', 'B7'],

                        'C': ['C4', 'C5', 'C6', 'C7'],

                        'D': ['D4', 'D5', 'D6', 'D7']},

                         index=[4, 5, 6, 7]) 

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],

                        'B': ['B8', 'B9', 'B10', 'B11'],

                        'C': ['C8', 'C9', 'C10', 'C11'],

                        'D': ['D8', 'D9', 'D10', 'D11']},

                        index=[8, 9, 10, 11])

如何串联熊猫数据框 (How To Concatenate Pandas DataFrames)

Anyone who has taken my Introduction to Python course will remember that string concatenation means adding one string to the end of another string. An example of string concatenation is below.

上过我的Python入门课程的任何人都会记住,字符串串联意味着将一个字符串添加到另一个字符串的末尾。 字符串连接的示例如下。

str1 = "Hello "

str2 = "World!"

str1 + str2

#Returns 'Hello World!'

DataFrame concatenation is quite similar. It means adding one DataFrame to the end of another DataFrame.

DataFrame串联非常相似。 这意味着将一个DataFrame添加到另一个DataFrame的末尾。

In order for us to perform string concatenation, we should have two DataFrames with the same columns. An example is below:

为了使我们执行字符串连接,我们应该有两个具有相同列的DataFrame。 下面是一个示例:

pd.concat([df1, df2, df3])

By default, pandas will concatenate along axis=0, which means that its adding rows, not columns.

默认情况下,大熊猫将沿axis=0串联,这意味着其添加的是行而不是列。

If you want to add rows, simply pass in axis=0 as a new variable into the concat function.

如果要添加行,只需将axis=0作为新变量传递给concat函数。

pd.concat([df1,df2,df3],axis=1)

In our case, this creates a very ugly DataFrame with many missing values:

在我们的例子中,这将创建一个非常丑陋的DataFrame,其中包含许多缺失值:

熊猫merge方法 (The Pandas merge Method)

In this section, we’ll learn how to merge pandas DataFrames.

在本节中,我们将学习如何合并pandas DataFrames 。

我们将在本节中使用的数据框架 (The DataFrames We Will Be Using In This section)

In this section, we will be using the following two pandas DataFrames:

在本节中,我们将使用以下两个熊猫DataFrame:

import pandas as pd

leftDataFrame = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],

                     'A': ['A0', 'A1', 'A2', 'A3'],

                     'B': ['B0', 'B1', 'B2', 'B3']})

   

rightDataFrame = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],

                          'C': ['C0', 'C1', 'C2', 'C3'],

                          'D': ['D0', 'D1', 'D2', 'D3']})

The columns A, B, C, and D have real data in them, while the column key has a key that is common among both DataFrames. To merge two DataFrames means to connect them along one column that they both have in common.

ABCD包含真实数据,而列key具有两个DataFrame中共有的键。 merge两个DataFrame意味着沿着它们共同的一列连接它们。

如何合并熊猫数据框 (How To Merge Pandas DataFrames)

You can merge two pandas DataFrames along a common column using the merge columns. For anyone that is familiar with the SQL programming language, this is very similar to performing an inner join in SQL.

您可以使用merge列沿公共列合并两个pandas DataFrame。 对于熟悉SQL编程语言的任何人,这都与在SQL中执行inner join联接非常相似。

Do not worry if you are unfamiliar with SQL, because merge syntax is actually very straightforward. It looks like this:

如果您不熟悉SQL,请不要担心,因为merge语法实际上非常简单。 看起来像这样:

pd.merge(leftDataFrame, rightDataFrame, how='inner', on='key')

Let’s break down the four arguments we passed into the merge method:

让我们分解一下传递给merge方法的四个参数:

  1. leftDataFrame: This is the DataFrame that we’d like to merge on the left.

    leftDataFrame :这是我们要在左侧合并的DataFrame。

  2. rightDataFrame: This is the DataFrame that we’d like to merge on the right.

    rightDataFrame :这是我们要在右侧合并的DataFrame。

  3. how=inner: This is the type of merge that the operation is performing. There are multiple types of merges, but we will only be covering inner merges in this course.

    how=inner :这是操作正在执行的合并的类型。 合并有多种类型,但是在本课程中,我们将仅介绍内部合并。

  4. on='key': This is the column that you’d like to perform the merge on. Since key was the only column in common between the two DataFrames, it was the only option that we could use to perform the merge.

    on='key' :这是您要对其执行合并的列。 由于key是两个DataFrame之间唯一的公共列,因此它是我们可以用来执行合并的唯一选项。

熊猫join方法 (The Pandas join Method)

In this section, you will learn how to join pandas DataFrames.

在本节中,您将学习如何加入pandas DataFrames 。

我们将在本节中使用的数据框 (The DataFrames We Will Be Using In This Section)

We will be using the following two DataFrames in this section:

在本节中,我们将使用以下两个DataFrame:

leftDataFrame = pd.DataFrame({  'A': ['A0', 'A1', 'A2', 'A3'],

                                'B': ['B0', 'B1', 'B2', 'B3']},

                                index =['K0', 'K1', 'K2', 'K3'])

   

rightDataFrame = pd.DataFrame({ 'C': ['C0', 'C1', 'C2', 'C3'],

                                'D': ['D0', 'D1', 'D2', 'D3']},

                                index = ['K0', 'K1', 'K2', 'K3'])

If these look familiar, it’s because they are! These are the nearly the same DataFrames as we used when learning how to merge pandas DataFrames. A key difference is that instead of the key column being its own column, it is now the index of the DataFrame. You can think of these DataFrames as being those from the last section after executing .set_index(key).

如果这些看起来很熟悉,那是因为它们! 这些与我们学习如何合并熊猫数据框时使用的数据框几乎相同。 一个关键的区别是,而不是key柱是它自己的列,现在是数据框的索引。 您可以将这些DataFrame视为执行.set_index(key)后的最后一部分。

如何加入熊猫数据框 (How To Join Pandas DataFrames)

Joining pandas DataFrames is very similar to merging pandas DataFrames except that the keys on which you’d like to combine are in the index instead of contained within a column.

加入pandas DataFrames与合并pandas DataFrames非常相似,除了要合并的键位于索引中而不是包含在列中。

To join these two DataFrames, we can use the following code:

要连接这两个DataFrame,我们可以使用以下代码:

leftDataFrame.join(rightDataFrame)

熊猫的其他常见行动 (Other Common Operations in Pandas)

This section will explore common operations in the pandas Python library. The purpose of this section is to explore important pandas operations that have not fit into any of the sections we’ve discussed so far.

本节将探讨pandas Python库中的常见操作 。 本节的目的是探讨迄今为止我们尚未讨论的任何重要熊猫活动。

我们将在本节中使用的DataFrame (The DataFrame We Will Use In This section)

I will be using the following DataFrame in this section:

我将在本节中使用以下DataFrame:

df = pd.DataFrame({'col1':['A','B','C','D'],

                   'col2':[2,7,3,7],

                   'col3':['fgh','rty','asd','qwe']})

如何在熊猫系列中找到独特的价值 (How To Find Unique Values in a Pandas Series)

Pandas has an excellent method called unique that can be used to find unique values within a pandas Series. Note that this method only works on Series and not on DataFrames. If you try to apply this method to a DataFrame, you will encounter an error:

熊猫有一种称为“ unique的出色方法,可用于在“熊猫系列”中查找唯一值。 请注意,此方法仅适用于Series,不适用于DataFrames。 如果尝试将此方法应用于DataFrame,则会遇到错误:

df.unique()

#Returns AttributeError: 'DataFrame' object has no attribute 'unique'

However, since the columns of a pandas DataFrame are each a Series, we can apply the unique method to a specific column, like this:

但是,由于pandas DataFrame的列都是Series,因此我们可以将unique方法应用于特定列,如下所示:

df['col2'].unique()

#Returns array([2, 7, 3])

Pandas also has a separate nunique method that counts the number of unique values in a Series and returns that value as an integer. For example:

熊猫还具有单独的nunique方法,该方法计算系列中唯一值的数量,并将该值作为整数返回。 例如:

df['col2'].nunique()

#Returns 3

Interestingly, the nunique method is exactly the same as len(unique()) but it is a common enough operation that the pandas community decided to create a specific method for this use case.

有趣的是, nunique方法 len(unique()) 完全相同,但是它是一个足够普遍的操作,因此熊猫社区决定为此用例创建一个特定的方法。

如何计算熊猫系列中每个值的出现率 (How To Count The Occurence of Each Value In A Pandas Series)

Pandas has a function called counts_value that allows you to easily count the number of time each observation occurs. An example is below:

熊猫具有一个称为counts_value的函数,该函数可让您轻松计算每次观察发生的时间。 下面是一个示例:

df['col2'].value_counts()

"""

Returns:

7    2

2    1

3    1

Name: col2, dtype: int64

"""

如何使用熊猫apply方法 (How To Use The Pandas apply Method)

The apply method is one of the most powerful methods available in the pandas library. It allows you to apply a custom function to every element of a pandas Series.

apply方法是pandas库中可用的最强大的方法之一。 它允许您将自定义函数应用于熊猫系列的每个元素。

As an example, imagine that we had the following function exponentify that takes in an integer and raises it to the power of itself:

例如,假设我们有以下函数exponentify ,它接受一个整数并将其提升为幂:

def exponentify(x):

    return x**x

The apply method allows you to easily apply the exponentify function to each element of the Series:

apply方法使您可以轻松地将exponentify函数应用于Series的每个元素:

df['col2'].apply(exponentify)

"""

Returns:

0         4

1    823543

2        27

3    823543

Name: col2, dtype: int64

"""

The apply method can also be used with built-in functions like len (although it is definitely more powerful when used with custom functions). An example of the len function being used in conjunction with apply is below:

apply方法也可以与len类的内置函数一起使用(尽管与自定义函数一起使用时,它肯定更强大)。 len函数与apply结合使用的示例如下:

df['col3'].apply(len)

"""

Returns

0    3

1    3

2    3

3    3

Name: col3, dtype: int64

"""

如何对熊猫数据框进行排序 (How To Sort A Pandas DataFrame)

You can filter a pandas DataFrame by the values of a particular column using the sort_values method. As an example, if you wanted to sort by col2 in our DataFrame df, you would run the following command:

您可以使用sort_values方法按特定列的值过滤pandas DataFrame。 例如,如果要在DataFrame dfcol2进行排序, col2可以运行以下命令:

df.sort_values('col2')

The output of this command is below:

该命令的输出如下:

There are two things to note from this output:

此输出有两点需要注意:

  1. As you can see, each row preserves its index, which means the index is now out of order.

    如您所见,每一行都保留其索引,这意味着索引现在已乱序。
  2. As with the other DataFrame methods, this does not actually modify the original DataFrame unless you force it to using the = assignment operator or by passing in inplace = True.

    与其他DataFrame方法一样,这实际上并不会修改原始DataFrame,除非您强迫它使用=赋值运算符或通过inplace = True传递。

熊猫的本地数据输入和输出(I / O) (Local Data Input and Output (I/O) in Pandas)

In this section, we will begin exploring data input and output with the pandas Python library.

在本节中,我们将开始探索pandas Python库的数据输入和输出 。

本节中我们将使用的文件 (The File We Will Be Working With In This section)

We will be working with different files containing stock prices for Facebook (FB), Amazon (AMZN), Google (GOOG), and Microsoft (MSFT) in this section. To download these files, download the entire GitHub repository for this course here. The files used in this section can be found in the stock_prices folder of the repository.

在本部分中,我们将使用包含Facebook(FB),Amazon(AMZN),Google(GOOG)和Microsoft(MSFT)股票价格的不同文件。 要下载这些文件,请在此处下载本课程的整个GitHub存储库。 在存储库的stock_prices文件夹中可以找到本节中使用的文件。

You’ll want to save these files in the same directory as your Jupyter Notebook for this section. The easiest way to do this is to download the GitHub repository, and then open your Jupyter Notebook in the stock_prices folder of the repository.

在本节中,您需要将这些文件保存在与Jupyter Notebook相同的目录中。 最简单的方法是下载GitHub存储库,然后在存储库的stock_prices文件夹中打开Jupyter Notebook。

如何使用熊猫导入.csv文件 (How To Import .csv Files Using Pandas)

We can import .csv files into a pandas DataFrame using the read_csv method, like this:

我们可以使用read_csv方法将.csv文件导入pandas DataFrame中,如下所示:

import pandas as pd

pd.read_csv('stock_prices.csv')

As you’ll see, this creates (and displays) a new pandas DataFrame containing the data from the .csv file.

如您所见,这将创建(并显示)一个新的pandas DataFrame,其中包含来自.csv文件的数据。

You can also assign this new DataFrame to a variable to be referenced later using the normal = assignment operator:

您还可以使用normal = Assignment运算符将此新的DataFrame分配给以后要引用的变量:

new_data_frame = pd.read_csv('stock_prices.csv')

There are a number of read methods included with the pandas programming library. If you are trying to import data from an external document, then it is likely that pandas has a built-in method for this.

熊猫编程库中包含许多read方法。 如果您尝试从外部文档导入数据,则熊猫可能为此具有内置方法。

A few examples of different read methods are below:

以下是几种不同read方法的示例:

pd.read_json()

pd.read_html()

pd.read_excel()

We will explore some of these methods later in this section.

我们将在本节后面探讨其中一些方法。

If we wanted to import a .csv file that was not directly in our working directory, we need to modify the syntax of the read_csv method slightly.

如果要导入不在工作目录中的.csv文件,则需要稍微修改read_csv方法的语法。

If the file is in a folder deeper than what you’re working in now, you need to specify the full path of the file in the read_csv method argument. As an example, if the stock_prices.csv file was contained in a folder called new_folder, then we could import it like this:

如果该文件位于比您现在正在使用的文件夹更深的文件夹中,则需要在read_csv方法参数中指定文件的完整路径。 例如,如果stock_prices.csv文件包含在名为new_folder的文件夹中,那么我们可以这样导入它:

new_data_frame = pd.read_csv('./new_folder/stock_prices.csv')

For those unfamiliar with working with directory notation, the . at the start of the filepath indicates the current directory. Similarly, a .. indicates one directory above the current directory, and a ...indicates two directories above the current directory.

对于那些不熟悉使用目录符号的人,可以使用. 文件路径开头的表示当前目录。 同样, ..表示当前目录上方的一个目录,而...表示当前目录上方的两个目录。

This syntax (using periods) is exactly how we reference (and import) files that are above our current working directory. As an example, open a Jupyter Notebook inside the new_folder folder, and place stock_prices.csv in the parent folder. With this file layout, you could import the stock_prices.csv file using the following command:

这种语法(使用句点)正是我们引用(和导入)当前工作目录上方的文件的方式。 例如,在new_folder文件夹中打开Jupyter Notebook,并将stock_prices.csv放在父文件夹中。 使用此文件布局,可以使用以下命令导入stock_prices.csv文件:

new_data_frame = pd.read_csv('../stock_prices.csv')

Note that this directory syntax is the same for all types of file imports, so we will not be revisiting how to import files from different directories when we explore different import methods later in this course.

请注意,此目录语法对于所有类型的文件导入都是相同的,因此在本课程稍后的内容中探索不同的导入方法时,我们将不会再讨论如何从不同的目录导入文件。

如何使用熊猫导出.csv文件 (How To Export .csv Files Using Pandas)

To demonstrate how to save a new .csv file, let’s first create a new DataFrame. Specifically, let’s fill a DataFrame with 3 columns and 50 rows with random data using the np.random.randn method:

为了演示如何保存新的.csv文件,我们首先创建一个新的DataFrame。 具体来说,让我们使用np.random.randn方法用随机数据填充3列50行的np.random.randn

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.randn(50,3))

Now that we have a DataFrame, we can save it using the to_csv method. This method takes in the name of the new file as its argument.

现在我们有了一个DataFrame,我们可以使用to_csv方法保存它。 此方法将新文件的名称作为其参数。

df.to_csv('my_new_csv.csv')

You will notice that if you run the code above, the new .csv file will begin with an unlabeled column that contains the index of the DataFrame. An example is below (after opening the .csv in Microsoft Excel):

您会注意到,如果运行上面的代码,新的.csv文件将以一个未标记的列开头,该列包含DataFrame的索引。 下面是一个示例(在Microsoft Excel中打开.csv之后):

In many cases, this is undesirable. To remove the blank index column, pass in index=False as a second argument to the to_csv method, like this:

在许多情况下,这是不希望的。 要删除空白索引列,请将index=False作为第二个参数传递给to_csv方法,如下所示:

new_data_frame.to_csv('my_new_csv.csv', index = False)

The new .csv file does not have the unlabelled index column:

新的.csv文件没有未标记的索引列:

The read_csv and to_csv methods make it very easy to import and export data from .csv files using pandas. We will see later in this section that for every read method that allows us to import data, there is usually a corresponding to function that allows us to save that data!

使用read_csvto_csv方法可以非常轻松地使用熊猫从.csv文件导入和导出数据。 我们将在本节后面看到,对于允许我们导入数据的每种read方法,通常都有一个对应to函数允许我们保存数据!

如何使用熊猫导入.json文件 (How To Import .json Files Using Pandas)

If you are not experienced in working with large datasets, then you may not be familiar with the JSON file type.

如果您没有使用大型数据集的经验,那么您可能不熟悉JSON文件类型。

JSON stands for JavaScript Object Notation. JSON files are very similar to Python Dictionaries.

JSON代表JavaScript对象符号。 JSON文件与Python字典非常相似。

JSON files are one of the most commonly-used data types among software developers because they can be manipulated using basically every programming language.

JSON文件是软件开发人员中最常用的数据类型之一,因为它们基本上可以使用每种编程语言进行操作。

Pandas has a method called read_json that makes it very easy to import JSON files as a pandas DataFrame. An example is below.

Pandas有一个名为read_json的方法,可以很容易地将JSON文件作为pandas DataFrame导入。 下面是一个示例。

json_data_frame = pd.read_json('stock_prices.json')

We’ll learn how to export JSON files next.

接下来,我们将学习如何导出JSON文件。

如何使用熊猫导出.json文件 (How To Export .json Files Using Pandas)

As I mentioned earlier, there is generally a to method for every read method. This means that we can save a DataFrame to a JSON file using the to_json method.

如前所述,每个read方法通常都有一个to方法。 这意味着我们可以使用to_json方法将to_json保存到JSON文件。

As an example, let’s take the randomly-generated DataFrame df from earlier in this section and save it as a JSON file in our local directory:

作为示例,让我们从本节前面的部分中随机生成的DataFrame df并将其另存为JSON文件到我们的本地目录中:

df.to_json('my_new_json.json')

We’ll learn how to work with Excel files - which have the file extension .xlsx - next.

接下来,我们将学习如何使用Excel文件(文件扩展名为.xlsx )。

如何使用熊猫导入.xlsx文件 (How To Import .xlsx Files Using Pandas)

Pandas’ read_excel method makes it very easy to import data from an Excel document into a pandas DataFrame:

Pandas的read_excel方法使将Excel文档中的数据导入pandas DataFrame非常容易:

new_data_frame = pd.read_excel('stock_prices.xlsx')

Unlike the read_csv and read_json methods that we explored earlier in this section, the read_excel method can accept a second argument. The reason why read_excel accepts multiple arguments is that Excel spreadsheets can contain multiple sheets. The second argument specifies which sheet you are trying to import and is called sheet_name.

与本节前面探讨的read_csvread_json方法不同, read_excel方法可以接受第二个参数。 read_excel接受多个参数的原因是Excel电子表格可以包含多个工作表。 第二个参数指定您要导入的工作表,称为sheet_name

As an example, if our stock_prices had a second sheet called Sheet2, you would import that sheet to a pandas DataFrame like this:

例如,如果我们的stock_prices有第二张工作表Sheet2 ,则可以将该工作表导入到熊猫DataFrame中,如下所示:

new_data_frame.to_excel('stock_prices.xlsx', sheet_name='Sheet2')

If you do not specify any value for sheet_name, then read_excel will import the first sheet of the Excel spreadsheet by default.

If you do not specify any value for sheet_name , then read_excel will import the first sheet of the Excel spreadsheet by default.

While importing Excel documents, it is very important to note that pandas only imports data. It cannot import other Excel capabilities like formatting, formulas, or macros. Trying to import data from an Excel document that has these features may cause pandas to crash.

While importing Excel documents, it is very important to note that pandas only imports data. It cannot import other Excel capabilities like formatting, formulas, or macros. Trying to import data from an Excel document that has these features may cause pandas to crash.

How To Export .xlsx Files Using Pandas (How To Export .xlsx Files Using Pandas)

Exporting Excel files is very similar to importing Excel files, except we use to_excel instead of read_excel. An example is below using our randomly-generated df DataFrame:

Exporting Excel files is very similar to importing Excel files, except we use to_excel instead of read_excel . An example is below using our randomly-generated df DataFrame:

df.to_excel('my_new_excel_file.xlsx')

Like read_excel, to_excel accepts a second argument called sheet_name that allows you to specify the name of the sheet that you’re saving. For example, we could have named the sheet of the new .xlsx file My New Sheet! by passing it into the to_excel method like this:

Like read_excel , to_excel accepts a second argument called sheet_name that allows you to specify the name of the sheet that you're saving. For example, we could have named the sheet of the new .xlsx file My New Sheet! by passing it into the to_excel method like this:

df.to_excel('my_new_excel_file.xlsx', sheet_name='My New Sheet!')

If you do not specify a value for sheet_name, then the sheet will be named Sheet1 by default (just like when you create a new Excel document using the actual application).

If you do not specify a value for sheet_name , then the sheet will be named Sheet1 by default (just like when you create a new Excel document using the actual application).

Remote Data Input and Output (I/O) in Pandas (Remote Data Input and Output (I/O) in Pandas)

In the last section of this course, we learned how to import data from .csv, .json, and .xlsx files that were saved on our local computer. We will follow up by showing you how you can import files without actually saving them to your local machine first. This is called remote importing.

In the last section of this course, we learned how to import data from .csv , .json , and .xlsx files that were saved on our local computer. We will follow up by showing you how you can import files without actually saving them to your local machine first. This is called remote importing .

What Is Remote Importing and Why Is It Useful? (What Is Remote Importing and Why Is It Useful?)

Remote importing means bringing a file into your Python script without having that file saved on your computer.

Remote importing means bringing a file into your Python script without having that file saved on your computer.

On the surface, it may not seem clear why we might want to engage in remote importing. However, it can be very useful.

On the surface, it may not seem clear why we might want to engage in remote importing. However, it can be very useful.

The reason why remote importing is useful is because, by definition, it means the Python script will continue to function even if the file being imported is not saved on your computer. This means I can send my code to colleagues or friends and it will still function properly.

The reason why remote importing is useful is because, by definition, it means the Python script will continue to function even if the file being imported is not saved on your computer. This means I can send my code to colleagues or friends and it will still function properly.

Throughout the rest of this section, I will demonstrate how to perform remote imports in pandas for .csv, .json, and .xlsx files.

Throughout the rest of this section, I will demonstrate how to perform remote imports in pandas for .csv , .json , and .xlsx files.

How To Import Remote .csv Files (How To Import Remote .csv Files)

First, navigate to this course’s GitHub Repository. Open up the stock_prices folder. Click on the file stock_prices.csv and then click the button for the Raw file, as shown below.

First, navigate to this course's GitHub Repository . Open up the stock_prices folder. Click on the file stock_prices.csv and then click the button for the Raw file, as shown below.

This will take you to a new page that has the data from the .csv file contained within stock_prices.csv.

This will take you to a new page that has the data from the .csv file contained within stock_prices.csv .

To import this remote file into your into your Python script, you must first copy its URL to your clipboard. You can do this by either (1) highlighting the entire URL, right-clicking the selected text, and clicking copy, or (2) highlighting the entire URL and typing CTRL+C on your keyboard.

To import this remote file into your into your Python script, you must first copy its URL to your clipboard. You can do this by either (1) highlighting the entire URL, right-clicking the selected text, and clicking copy , or (2) highlighting the entire URL and typing CTRL+C on your keyboard.

The URL will look like this:

The URL will look like this:

[https://raw.githubusercontent.com/nicholasmccullum/advanced-python/master/stock_prices/stock_prices.csv](https://raw.githubusercontent.com/nicholasmccullum/advanced-python/master/stock_prices/stock_prices.csv)

You can pass this URL into the read_csv method to import the dataset into a pandas DataFrame without saving the dataset to your computer first:

You can pass this URL into the read_csv method to import the dataset into a pandas DataFrame without saving the dataset to your computer first:

pd.read_csv('https://raw.githubusercontent.com/nicholasmccullum/advanced-python/master/stock_prices/stock_prices.csv')

How To Import Remote .json Files (How To Import Remote .json Files)

We can import remote .json files in a similar fashion to .csv files.

We can import remote .json files in a similar fashion to .csv files.

First, grab the raw URL from GitHub. It will look like this:

First, grab the raw URL from GitHub. 它看起来像这样:

https://raw.githubusercontent.com/nicholasmccullum/advanced-python/master/stock_prices/stock_prices.json

Next, pass this URL into the read_json method like this:

Next, pass this URL into the read_json method like this:

pd.read_json('https://raw.githubusercontent.com/nicholasmccullum/advanced-python/master/stock_prices/stock_prices.json')

How To Import Remote .xlsx Files (How To Import Remote .xlsx Files)

We can import remote .xlsx files in a similar fashion to .csv and .json files. Note that you will need to click in a slightly different place on the GitHub interface. Specifically, you’ll need to right-click ‘View Raw’ and select ‘Copy Link Address,’ as shown below.

We can import remote .xlsx files in a similar fashion to .csv and .json files. Note that you will need to click in a slightly different place on the GitHub interface. Specifically, you'll need to right-click 'View Raw' and select 'Copy Link Address,' as shown below.

The raw URL will look like this:

The raw URL will look like this:

https://github.com/nicholasmccullum/advanced-python/blob/master/stock_prices/stock_prices.xlsx?raw=true

Then, pass this URL into the read_excel method, like this:

Then, pass this URL into the read_excel method, like this:

pd.read_excel('https://github.com/nicholasmccullum/advanced-python/blob/master/stock_prices/stock_prices.xlsx?raw=true')

The Downsides to Remote Importing (The Downsides to Remote Importing)

Remote importing means that you do not need to first save the file being imported onto your local computer, which is an unquestionable benefit.

Remote importing means that you do not need to first save the file being imported onto your local computer, which is an unquestionable benefit.

However, remote importing also has two downsides:

However, remote importing also has two downsides:

  1. You must have an Internet connection to perform remote imports

    You must have an Internet connection to perform remote imports
  2. Pinging the URL to retrieve the dataset is fairly time-consuming, which means that performing remote imports will slow the speed of your Python code

    Pinging the URL to retrieve the dataset is fairly time-consuming, which means that performing remote imports will slow the speed of your Python code

Final Thoughts & Special Offer (Final Thoughts & Special Offer)

Thanks for reading this article on Pandas, which is one of my favorite Python packages and a must-know library for every Python developer.

Thanks for reading this article on Pandas, which is one of my favorite Python packages and a must-know library for every Python developer.

This tutorial is an excerpt from my course Python For Finance and Data Science. If you're interested in learning more core Python skills, the course is 50% off for the first 50 freeCodeCamp readers that sign up - click here to get your discounted course now!

This tutorial is an excerpt from my course Python For Finance and Data Science . If you're interested in learning more core Python skills, the course is 50% off for the first 50 freeCodeCamp readers that sign up - click here to get your discounted course now !

翻译自: https://www.freecodecamp.org/news/the-ultimate-guide-to-the-pandas-library-for-data-science-in-python/

python 数据挖掘图书

你可能感兴趣的:(编程语言,python,java,人工智能,大数据)