使用NumPy和Pandas进行Pythonic数据清理

Data scientists spend a large amount of their time cleaning datasets and getting them down to a form with which they can work. In fact, a lot of data scientists argue that the initial steps of obtaining and cleaning data constitute 80% of the job.

数据科学家花费大量时间清理数据集,并将其简化为可以使用的形式。 实际上,许多数据科学家认为,获取和清理数据的初始步骤占工作的80%。

Therefore, if you are just stepping into this field or planning to step into this field, it is important to be able to deal with messy data, whether that means missing values, inconsistent formatting, malformed records, or nonsensical outliers.

因此,如果您只是进入该领域或计划进入该领域,那么重要的是能够处理混乱的数据,无论这意味着缺少值,格式不一致,记录格式错误还是异常的离群值。

In this tutorial, we’ll leverage Python’s Pandas and NumPy libraries to clean data.

在本教程中,我们将利用Python的Pandas和NumPy库清除数据。

We’ll cover the following:

我们将介绍以下内容:

  • Dropping unnecessary columns in a DataFrame
  • Changing the index of a DataFrame
  • Using .str() methods to clean columns
  • Using the DataFrame.applymap() function to clean the entire dataset, element-wise
  • Renaming columns to a more recognizable set of labels
  • Skipping unnecessary rows in a CSV file
  • DataFrame删除不必要的列
  • 更改数据DataFrame的索引
  • 使用.str()方法清理列
  • 使用DataFrame.applymap()函数以元素为单位清理整个数据集
  • 将列重命名为更易于识别的标签集
  • 跳过CSV文件中不必要的行

Here are the datasets that we will be using:

以下是我们将使用的数据集:

  • BL-Flickr-Images-Book.csv – A CSV file containing information about books from the British Library
  • university_towns.txt – A text file containing names of college towns in every US state
  • olympics.csv – A CSV file summarizing the participation of all countries in the Summer and Winter Olympics
  • BL-Flickr-Images-Book.csv –包含大英图书馆书籍信息的CSV文件
  • university_towns.txt –包含美国各州大学城名称的文本文件
  • olympics.csv –一个CSV文件,总结了所有国家在夏季和冬季奥运会中的参与情况

You can download the datasets from Real Python’s GitHub repository in order to follow the examples here.

您可以从Real Python的GitHub存储库下载数据集,以便遵循此处的示例。

Note: I recommend using Jupyter Notebooks to follow along.

注意 :我建议使用Jupyter Notebooks进行后续操作。

This tutorial assumes a basic understanding of the Pandas and NumPy libraries, including Panda’s workhorse Series and DataFrame objects, common methods that can be applied to these objects, and familiarity with NumPy’s NaN values.

本教程假定您对Pandas和NumPy库具有基本的了解,包括Panda的主力SeriesDataFrame对象 ,可以应用于这些对象的常用方法以及熟悉NumPy的NaN值。

Let’s import the required modules and get started!

让我们导入所需的模块并开始使用!

 >>> >>>  import import pandas pandas as as pd
pd
>>> >>>  import import numpy numpy as as np
np

DataFrame删除列 (Dropping Columns in a DataFrame)

Often, you’ll find that not all the categories of data in a dataset are useful to you. For example, you might have a dataset containing student information (name, grade, standard, parents’ names, and address) but want to focus on analyzing student grades.

通常,您会发现并非数据集中的所有数据类别都对您有用。 例如,您可能有一个包含学生信息(姓名,年级,标准,父母的姓名和地址)的数据集,但希望专注于分析学生的成绩。

In this case, the address or parents’ names categories are not important to you. Retaining these unneeded categories will take up unnecessary space and potentially also bog down runtime.

在这种情况下,地址或父母的姓名类别对您而言并不重要。 保留这些不必要的类别将占用不必要的空间,并有可能使运行时陷入困境。

Pandas provides a handy way of removing unwanted columns or rows from a DataFrame with the drop() function. Let’s look at a simple example where we drop a number of columns from a DataFrame.

Pandas提供了一种方便的方法, DataFrame使用drop()函数从DataFrame中删除不需要的列或行。 让我们看一个简单的示例,其中从DataFrame删除许多列。

First, let’s create a DataFrame out of the CSV file ‘BL-Flickr-Images-Book.csv’. In the examples below, we pass a relative path to pd.read_csv, meaning that all of the datasets are in a folder named Datasets in our current working directory:

首先,让我们从CSV文件“ BL-Flickr-Images-Book.csv”中创建一个DataFrame 。 在下面的示例中,我们将相对路径传递给pd.read_csv ,这意味着所有数据集都位于当前工作目录中名为Datasets的文件夹中:

When we look at the first five entries using the head() method, we can see that a handful of columns provide ancillary information that would be helpful to the library but isn’t very descriptive of the books themselves: Edition Statement, Corporate Author, Corporate Contributors, Former owner, Engraver, Issuance type and Shelfmarks.

当我们使用head()方法查看前五个条目时,我们可以看到少量的列提供了对图书馆有帮​​助的辅助信息,但并不能很好地描述图书本身: Edition StatementCorporate AuthorCorporate ContributorsFormer ownerEngraverIssuance typeShelfmarks

We can drop these columns in the following way:

我们可以通过以下方式删除这些列:

 >>> >>>  to_drop to_drop = = [[ 'Edition Statement''Edition Statement' ,
,
...            ...            'Corporate Author''Corporate Author' ,
,
...            ...            'Corporate Contributors''Corporate Contributors' ,
,
...            ...            'Former owner''Former owner' ,
,
...            ...            'Engraver''Engraver' ,
,
...            ...            'Contributors''Contributors' ,
,
...            ...            'Issuance type''Issuance type' ,
,
...            ...            'Shelfmarks''Shelfmarks' ]

]

>>> >>>  dfdf .. dropdrop (( to_dropto_drop , , inplaceinplace == TrueTrue , , axisaxis == 11 )
)

Above, we defined a list that contains the names of all the columns we want to drop. Next, we call the drop() function on our object, passing in the inplace parameter as True and the axis parameter as 1. This tells Pandas that we want the changes to be made directly in our object and that it should look for the values to be dropped in the columns of the object.

上面,我们定义了一个列表,其中包含我们要删除的所有列的名称。 接下来,我们在对象上调用drop()函数,将inplace参数传递为True ,而axis参数传递为1 。 这告诉Pandas我们希望直接在对象中进行更改,并且应该在对象的列中查找要删除的值。

When we inspect the DataFrame again, we’ll see that the unwanted columns have been removed:

当我们再次检查DataFrame时,我们将看到不需要的列已被删除:

Alternatively, we could also remove the columns by passing them to the columns parameter directly instead of separately specifying the labels to be removed and the axis where Pandas should look for the labels:

另外,我们也可以通过直接将它们传递给columns参数来删除列,而不是分别指定要删除的标签和熊猫在其中寻找标签的轴:

 >>> >>>  dfdf .. dropdrop (( columnscolumns == to_dropto_drop , , inplaceinplace == TrueTrue )
)

This syntax is more intuitive and readable. What we’re trying to do here is directly apparent.

此语法更加直观易读。 我们在这里要做的事情是显而易见的。

If you know in advance which columns you’d like to retain, another option is to pass them to the usecols argument of pd.read_csv.

如果事先你想保留哪些列知道,另一种选择是将它们传递给usecols的说法pd.read_csv

更改数据DataFrame的索引 (Changing the Index of a DataFrame)

A Pandas Index extends the functionality of NumPy arrays to allow for more versatile slicing and labeling. In many cases, it is helpful to use a uniquely valued identifying field of the data as its index.

熊猫Index扩展了NumPy数组的功能,以实现更通用的切片和标记。 在许多情况下,使用数据的唯一值标识字段作为索引是有帮助的。

For example, in the dataset used in the previous section, it can be expected that when a librarian searches for a record, they may input the unique identifier (values in the Identifier column) for a book:

例如,在上一部分中使用的数据集中,可以期望当馆员搜索记录时,他们可以输入书籍的唯一标识符(“ Identifier列中的值):

Let’s replace the existing index with this column using set_index:

让我们使用set_index将此列替换现有索引:

 >>> >>>  df df = = dfdf .. set_indexset_index (( 'Identifier''Identifier' )
)
>>> >>>  dfdf .. headhead ()
()
                Place of Publication Date of Publication  
                Place of Publication Date of Publication  
206                           London         1879 [1878]
206                           London         1879 [1878]
216         London; Virtue & Yorston                1868
216         London; Virtue & Yorston                1868
218                           London                1869
218                           London                1869
472                           London                1851
472                           London                1851
480                           London                1857

480                           London                1857

                        Publisher  
                        Publisher  
206              S. Tinsley & Co.
206              S. Tinsley & Co.
216                  Virtue & Co.
216                  Virtue & Co.
218         Bradbury, Evans & Co.
218         Bradbury, Evans & Co.
472                 James Darling
472                 James Darling
480          Wertheim & Macintosh

480          Wertheim & Macintosh

                                                        Title     Author  
                                                        Title     Author  
206                         Walter Forbes. [A novel.] By A. A      A. A.
206                         Walter Forbes. [A novel.] By A. A      A. A.
216         All for Greed. [A novel. The dedication signed...  A., A. A.
216         All for Greed. [A novel. The dedication signed...  A., A. A.
218         Love the Avenger. By the author of “All for Gr...  A., A. A.
218         Love the Avenger. By the author of “All for Gr...  A., A. A.
472         Welsh Sketches, chiefly ecclesiastical, to the...  A., E. S.
472         Welsh Sketches, chiefly ecclesiastical, to the...  A., E. S.
480         [The World in which I live, and my place in it...  A., E. S.

480         [The World in which I live, and my place in it...  A., E. S.

                                                   Flickr URL
                                                   Flickr URL
206         http://www.flickr.com/photos/britishlibrary/ta...
206         http://www.flickr.com/photos/britishlibrary/ta...
216         http://www.flickr.com/photos/britishlibrary/ta...
216         http://www.flickr.com/photos/britishlibrary/ta...
218         http://www.flickr.com/photos/britishlibrary/ta...
218         http://www.flickr.com/photos/britishlibrary/ta...
472         http://www.flickr.com/photos/britishlibrary/ta...
472         http://www.flickr.com/photos/britishlibrary/ta...
480         http://www.flickr.com/photos/britishlibrary/ta...
480         http://www.flickr.com/photos/britishlibrary/ta...

Technical Detail: Unlike primary keys in SQL, a Pandas Index doesn’t make any guarantee of being unique, although many indexing and merging operations will notice a speedup in runtime if it is.

技术细节 :与SQL中的主键不同,Pandas Index不保证唯一性,尽管许多索引和合并操作都会注意到运行时的加速。

We can access each record in a straightforward way with loc[]. Although loc[] may not have all that intuitive of a name, it allows us to do label-based indexing, which is the labeling of a row or record without regard to its position:

我们可以使用loc[]以直接的方式访问每个记录。 尽管loc[]可能不具有名称的直观性,但它允许我们进行基于标签的索引,这是对行或记录的标签,而不考虑其位置:

In other words, 206 is the first label of the index. To access it by position, we could use df.iloc[0], which does position-based indexing.

换句话说,206是索引的第一个标签。 要按位置访问它,我们可以使用df.iloc[0]进行基于位置的索引。

Technical Detail: .loc[] is technically a class instance and has some special syntax that doesn’t conform exactly to most plain-vanilla Python instance methods.

技术细节.loc[]从技术上讲是一个类实例,并且具有某些特殊语法 ,这些语法与大多数普通的Python实例方法不完全一致。

Previously, our index was a RangeIndex: integers starting from 0, analogous to Python’s built-in range. By passing a column name to set_index, we have changed the index to the values in Identifier.

以前,我们的索引是RangeIndex:从0开始的整数,类似于Python的内置range 。 通过将列名传递给set_index ,我们已将索引更改为Identifier的值。

You may have noticed that we reassigned the variable to the object returned by the method with df = df.set_index(...). This is because, by default, the method returns a modified copy of our object and does not make the changes directly to the object. We can avoid this by setting the inplace parameter:

您可能已经注意到,我们使用df = df.set_index(...)将变量重新分配给方法返回的对象。 这是因为,默认情况下,该方法返回我们对象的修改后的副本,并且不直接对对象进行更改。 我们可以通过设置inplace参数来避免这种情况:

 dfdf .. set_indexset_index (( 'Identifier''Identifier' , , inplaceinplace == TrueTrue )
)

整理数据中的字段 (Tidying up Fields in the Data)

So far, we have removed unnecessary columns and changed the index of our DataFrame to something more sensible. In this section, we will clean specific columns and get them to a uniform format to get a better understanding of the dataset and enforce consistency. In particular, we will be cleaning Date of Publication and Place of Publication.

到目前为止,我们已经删除了不必要的列,并将DataFrame的索引更改为更合理的名称。 在本节中,我们将清理特定的列并将它们格式化为统一格式,以更好地理解数据集并增强一致性。 特别是,我们将清理Date of Publication Place of Publication

Upon inspection, all of the data types are currently the object dtype, which is roughly analogous to str in native Python.

经过检查,所有数据类型当前都是object dtype ,与本地Python中的str大致相似。

It encapsulates any field that can’t be neatly fit as numerical or categorical data. This makes sense since we’re working with data that is initially a bunch of messy strings:

它封装了不能完全适合作为数值或分类数据的任何字段。 这很有意义,因为我们正在处理的数据最初是一堆乱七八糟的字符串:

One field where it makes sense to enforce a numeric value is the date of publication so that we can do calculations down the road:

强制使用数字值的一个字段是发布日期,以便我们可以进行后续计算:

 >>> >>>  dfdf .. locloc [[ 19051905 :, :, 'Date of Publication''Date of Publication' ]] .. headhead (( 1010 )
)
Identifier
Identifier
1905           1888
1905           1888
1929    1839, 38-54
1929    1839, 38-54
2836        [1897?]
2836        [1897?]
2854           1865
2854           1865
2956        1860-63
2956        1860-63
2957           1873
2957           1873
3017           1866
3017           1866
3131           1899
3131           1899
4598           1814
4598           1814
4884           1820
4884           1820
Name: Date of Publication, dtype: object
Name: Date of Publication, dtype: object

A particular book can have only one date of publication. Therefore, we need to do the following:

一本书只能有一个出版日期。 因此,我们需要执行以下操作:

  • Remove the extra dates in square brackets, wherever present: 1879 [1878]
  • Convert date ranges to their “start date”, wherever present: 1860-63; 1839, 38-54
  • Completely remove the dates we are not certain about and replace them with NumPy’s NaN: [1897?]
  • Convert the string nan to NumPy’s NaN value
  • 删除多余的日期,放在方括号中,如果存在的话:1879 [1878]
  • 将日期范围转换为其“开始日期”,无论何时:1860-63; 1839,38-54
  • 完全删除我们不确定的日期,并用NumPy的NaN代替:[1897?]
  • 将字符串nan转换为NumPy的NaN

Synthesizing these patterns, we can actually take advantage of a single regular expression to extract the publication year:

综合这些模式,我们实际上可以利用单个正则表达式来提取出版年份:

The regular expression above is meant to find any four digits at the beginning of a string, which suffices for our case. The above is a raw string (meaning that a backslash is no longer an escape character), which is standard practice with regular expressions.

上面的正则表达式用于在字符串的开头查找任何四个数字,这足以满足我们的情况。 上面是原始字符串(意味着反斜杠不再是转义字符),这是使用正则表达式的标准做法。

The d represents any digit, and {4} repeats this rule four times. The ^ character matches the start of a string, and the parentheses denote a capturing group, which signals to Pandas that we want to extract that part of the regex. (We want ^ to avoid cases where [ starts off the string.)

d代表任何数字, {4}重复此规则四次。 ^字符匹配字符串的开头,括号表示捕获组,该捕获组向Pandas表示我们要提取正则表达式的该部分。 (我们希望^避免[以字符串开头的情况。)

Let’s see what happens when we run this regex across our dataset:

让我们看看在整个数据集中运行此正则表达式时会发生什么:

 >>> >>>  extr extr = = dfdf [[ 'Date of Publication''Date of Publication' ]] .. strstr .. extractextract (( rr '^(d{4})''^(d{4})' , , expandexpand == FalseFalse )
)
>>> >>>  extrextr .. headhead ()
()
Identifier
Identifier
206    1879
206    1879
216    1868
216    1868
218    1869
218    1869
472    1851
472    1851
480    1857
480    1857
Name: Date of Publication, dtype: object
Name: Date of Publication, dtype: object

Not familiar with regex? You can inspect the expression above at regex101.com and read more at the Python Regular Expressions HOWTO.

不熟悉正则表达式? 您可以在regex101.com上检查上述表达式,并在Python Regular Expressions HOWTO中阅读更多内容。

Technically, this column still has object dtype, but we can easily get its numerical version with pd.to_numeric:

从技术上讲,此列仍具有object dtype,但我们可以使用pd.to_numeric轻松获得其数字版本:

This results in about one in every ten values being missing, which is a small price to pay for now being able to do computations on the remaining valid values:

这导致每十个值中约有一个丢失,这是一个很小的代价,现在可以对剩余的有效值进行计算:

 >>> >>>  dfdf [[ 'Date of Publication''Date of Publication' ]] .. isnullisnull ()() .. sumsum () () / / lenlen (( dfdf )
)
0.11717147339205986
0.11717147339205986

Great! That’s done!

大! 大功告成!

str方法与NumPy结合使用以清洁色谱柱 (Combining str Methods with NumPy to Clean Columns)

Above, you may have noticed the use of df['Date of Publication'].str. This attribute is a way to access speedy string operations in Pandas that largely mimic operations on native Python strings or compiled regular expressions, such as .split(), .replace(), and .capitalize().

在上面,您可能已经注意到df['Date of Publication'].str 。 此属性是一种能够快速访问字符串操作的大熊猫,关于本机Python字符串或编译的正则表达式主要是模拟操作,如.split() .replace().capitalize()

To clean the Place of Publication field, we can combine Pandas str methods with NumPy’s np.where function, which is basically a vectorized form of Excel’s IF() macro. It has the following syntax:

为了清理“发布Place of Publication字段,我们可以将Pandas str方法与NumPy的np.where函数结合使用,该函数基本上是Excel IF()宏的矢量化形式。 它具有以下语法:

Here, condition is either an array-like object or a boolean mask. then is the value to be used if condition evaluates to True, and else is the value to be used otherwise.

在这里, condition可以是类似数组的对象,也可以是布尔掩码。 then是,如果要使用的值condition的计算结果为True ,而else将被以其他方式使用的值。

Essentially, .where() takes each element in the object used for condition, checks whether that particular element evaluates to True in the context of the condition, and returns an ndarray containing then or else, depending on which applies.

本质上, .where()接受对象中用于condition每个元素,检查该特定元素在条件的上下文中是否为True ,并返回一个包含thenelsendarray ,取决于哪个适用。

It can be nested into a compound if-then statement, allowing us to compute values based on multiple conditions:

它可以嵌套在复合的if-then语句中,从而使我们能够基于多个条件来计算值:

 >>> >>>  npnp .. wherewhere (( condition1condition1 , , x1x1 , 
, 
        np.where(condition2, x2, 
        np.where(condition2, x2, 
            np.where(condition3, x3, ...)))
            np.where(condition3, x3, ...)))

We’ll be making use of these two functions to clean Place of Publication since this column has string objects. Here are the contents of the column:

由于此列具有字符串对象,因此我们将利用这两个函数来清理Place of Publication 。 以下是该列的内容:

We see that for some rows, the place of publication is surrounded by other unnecessary information. If we were to look at more values, we would see that this is the case for only some rows that have their place of publication as ‘London’ or ‘Oxford’.

我们看到,对于某些行,发布位置被其他不必要的信息包围。 如果我们要看更多的值,我们会看到只有某些行的发布位置为“伦敦”或“牛津”的情况才是这种情况。

Let’s take a look at two specific entries:

让我们看两个具体的条目:

 >>> >>>  dfdf .. locloc [[ 41578624157862 ]
]
Place of Publication                                  Newcastle-upon-Tyne
Place of Publication                                  Newcastle-upon-Tyne
Date of Publication                                                  1867
Date of Publication                                                  1867
Publisher                                                      T. Fordyce
Publisher                                                      T. Fordyce
Title                   Local Records; or, Historical Register of rema...
Title                   Local Records; or, Historical Register of rema...
Author                                                        T.  Fordyce
Author                                                        T.  Fordyce
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 4157862, dtype: object

Name: 4157862, dtype: object

>>> >>>  dfdf .. locloc [[ 41595874159587 ]
]
Place of Publication                                  Newcastle upon Tyne
Place of Publication                                  Newcastle upon Tyne
Date of Publication                                                  1834
Date of Publication                                                  1834
Publisher                                                Mackenzie & Dent
Publisher                                                Mackenzie & Dent
Title                   An historical, topographical and descriptive v...
Title                   An historical, topographical and descriptive v...
Author                                               E. (Eneas) Mackenzie
Author                                               E. (Eneas) Mackenzie
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 4159587, dtype: object
Name: 4159587, dtype: object

These two books were published in the same place, but one has hyphens in the name of the place while the other does not.

这两本书在同一地方出版,但其中一本书以该地方的名称带有连字符,而另一本书则没有。

To clean this column in one sweep, we can use str.contains() to get a boolean mask.

要一次扫除该列,我们可以使用str.contains()获得一个布尔掩码。

We clean the column as follows:

我们按以下方式清理该列:

We combine them with np.where:

我们将它们与np.where结合在一起:

 df['Place of Publication'] = np.where(london, 'London',
df['Place of Publication'] = np.where(london, 'London',
                                      np.where(oxford, 'Oxford',
                                      np.where(oxford, 'Oxford',
                                               pub.str.replace('-', ' ')))

                                               pub.str.replace('-', ' ')))

>>> >>>  dfdf [[ 'Place of Publication''Place of Publication' ]] .. headhead ()
()
Identifier
Identifier
206    London
206    London
216    London
216    London
218    London
218    London
472    London
472    London
480    London
480    London
Name: Place of Publication, dtype: object
Name: Place of Publication, dtype: object

Here, the np.where function is called in a nested structure, with condition being a Series of booleans obtained with str.contains(). The contains() method works similarly to the built-in in keyword used to find the occurrence of an entity in an iterable (or substring in a string).

这里, np.where函数被调用以嵌套结构, condition是一个Series的布尔值与获得str.contains() contains()方法的工作方式类似于内置关键字 , in关键字用于查找可迭代(或字符串中的子字符串)中某个实体的出现。

The replacement to be used is a string representing our desired place of publication. We also replace hyphens with a space with str.replace() and reassign to the column in our DataFrame.

将使用的替换字符串表示我们期望的出版地。 我们还使用str.replace()将连字符替换为空格,然后将其重新分配给DataFrame的列。

Although there is more dirty data in this dataset, we will discuss only these two columns for now.

尽管此数据集中有更多脏数据,但我们现在仅讨论这两列。

Let’s have a look at the first five entries, which look a lot crisper than when we started out:

让我们看一下前五个条目,它们看起来比刚开始时要清晰得多:

Note: At this point, Place of Publication would be a good candidate for conversion to a Categorical dtype, because we can encode the fairly small unique set of cities with integers. (The memory usage of a Categorical is proportional to the number of categories plus the length of the data; an object dtype is a constant times the length of the data.)

注意 :在这一点上, Place of Publication将是转换为Categorical dtype的理想选择,因为我们可以用整数对相当小的唯一城市集进行编码。 (“分类”的内存使用量与类别的数量加数据的长度成正比;对象dtype是常数乘以数据的长度。)

使用applymap函数清洗整个数据集 (Cleaning the Entire Dataset Using the applymap Function)

In certain situations, you will see that the “dirt” is not localized to one column but is more spread out.

在某些情况下,您会看到“污垢”未定位到一列,而是分布得更广。

There are some instances where it would be helpful to apply a customized function to each cell or element of a DataFrame. Pandas .applymap() method is similar to the in-built map() function and simply applies a function to all the elements in a DataFrame.

在某些情况下,将自定义函数应用于DataFrame的每个单元或元素会有所帮助。 Pandas .applymap()方法类似于内置的map()函数,只是将函数应用于DataFrame所有元素。

Let’s look at an example. We will create a DataFrame out of the “university_towns.txt” file:

让我们来看一个例子。 我们将从“ university_towns.txt”文件中创建一个DataFrame

 $ head Datasets/univerisity_towns.txt
$ head Datasets/univerisity_towns.txt
Alabama[edit]
Alabama[edit]
Auburn (Auburn University)[1]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Alaska[edit]

We see that we have periodic state names followed by the university towns in that state: StateA TownA1 TownA2 StateB TownB1 TownB2.... If we look at the way state names are written in the file, we’ll see that all of them have the “[edit]” substring in them.

我们看到我们有一个周期性的州名,后跟该州的大学城: StateA TownA1 TownA2 StateB TownB1 TownB2... 如果我们查看状态名称在文件中的写入方式,我们将看到它们都包含“ [edit]”子字符串。

We can take advantage of this pattern by creating a list of (state, city) tuples and wrapping that list in a DataFrame:

我们可以通过创建(state, city)元组的列表并将该列表包装在DataFrame来利用此模式:

We can wrap this list in a DataFrame and set the columns as “State” and “RegionName”. Pandas will take each element in the list and set State to the left value and RegionName to the right value.

我们可以将此列表包装在DataFrame中,并将列设置为“ State”和“ RegionName”。 熊猫将获取列表中的每个元素,并将State设置为左侧的值,将RegionName设置为右侧的值。

The resulting DataFrame looks like this:

产生的DataFrame如下所示:

 >>> >>>  towns_df towns_df = = pdpd .. DataFrameDataFrame (( university_townsuniversity_towns ,
,
...                         ...                         columnscolumns == [[ 'State''State' , , 'RegionName''RegionName' ])

])

>>> >>>  towns_dftowns_df .. headhead ()
()
 State                                         RegionName
 State                                         RegionName
0  Alabama[edit]n                    Auburn (Auburn University)[1]n
0  Alabama[edit]n                    Auburn (Auburn University)[1]n
1  Alabama[edit]n           Florence (University of North Alabama)n
1  Alabama[edit]n           Florence (University of North Alabama)n
2  Alabama[edit]n  Jacksonville (Jacksonville State University)[2]n
2  Alabama[edit]n  Jacksonville (Jacksonville State University)[2]n
3  Alabama[edit]n       Livingston (University of West Alabama)[2]n
3  Alabama[edit]n       Livingston (University of West Alabama)[2]n
4  Alabama[edit]n         Montevallo (University of Montevallo)[2]n
4  Alabama[edit]n         Montevallo (University of Montevallo)[2]n

While we could have cleaned these strings in the for loop above, Pandas makes it easy. We only need the state name and the town name and can remove everything else. While we could use Pandas’ .str() methods again here, we could also use applymap() to map a Python callable to each element of the DataFrame.

尽管我们可以在上面的for循环中清理这些字符串,但Pandas使其变得容易。 我们只需要州名和镇名,就可以删除其他所有内容。 虽然我们可以在此处再次使用Pandas的.str()方法,但也可以使用applymap()将可调用的Python映射到DataFrame的每个元素。

We have been using the term element, but what exactly do we mean by it? Consider the following “toy” DataFrame:

我们一直在使用术语元素,但实际上是什么意思呢? 考虑以下“玩具”数据框:

In this example, each cell (‘Mock’, ‘Dataset’, ‘Python’, ‘Pandas’, etc.) is an element. Therefore, applymap() will apply a function to each of these independently. Let’s define that function:

在此示例中,每个单元格(“模拟”,“数据集”,“ Python”,“熊猫”等)都是一个元素。 因此, applymap()将对每个函数独立应用一个函数。 让我们定义一个函数:

 >>> >>>  def def get_citystateget_citystate (( itemitem ):
):
...     ...     if if ' (' ' (' in in itemitem :
:
...         ...         return return itemitem [:[: itemitem .. findfind (( ' ('' (' )]
)]
...     ...     elif elif '[' '[' in in itemitem :
:
...         ...         return return itemitem [:[: itemitem .. findfind (( '[''[' )]
)]
...     ...     elseelse :
:
...         ...         return return item
item

Pandas’ .applymap() only takes one parameter, which is the function (callable) that should be applied to each element:

熊猫的.applymap()仅采用一个参数,这是应应用于每个元素的函数(可调用):

First, we define a Python function that takes an element from the DataFrame as its parameter. Inside the function, checks are performed to determine whether there’s a ( or [ in the element or not.

首先,我们定义一个Python函数,该函数将DataFrame的元素作为参数。 在函数内部,将执行检查以确定元素中是否存在([

Depending on the check, values are returned accordingly by the function. Finally, the applymap() function is called on our object. Now the DataFrame is much neater:

根据检查,函数会相应返回值。 最后,对我们的对象调用applymap()函数。 现在,DataFrame变得更加整洁:

 >>> >>>  towns_dftowns_df .. headhead ()
()
     State    RegionName
     State    RegionName
0  Alabama        Auburn
0  Alabama        Auburn
1  Alabama      Florence
1  Alabama      Florence
2  Alabama  Jacksonville
2  Alabama  Jacksonville
3  Alabama    Livingston
3  Alabama    Livingston
4  Alabama    Montevallo
4  Alabama    Montevallo

The applymap() method took each element from the DataFrame, passed it to the function, and the original value was replaced by the returned value. It’s that simple!

applymap()方法从DataFrame中获取每个元素,并将其传递给函数,然后将原始值替换为返回的值。 就这么简单!

Technical Detail: While it is a convenient and versatile method, .applymap can have significant runtime for larger datasets, because it maps a Python callable to each individual element. In some cases, it can be more efficient to do vectorized operations that utilize Cython or NumPY (which, in turn, makes calls in C) under the hood.

技术细节 :尽管.applymap是一种方便且通用的方法, .applymap对于较大的数据集,它具有显着的运行时,因为它将可调用的Python映射到每个单独的元素。 在某些情况下,使用Cython或NumPY(反过来,用C进行调用)进行矢量化操作可能会更有效。

重命名列和跳过行 (Renaming Columns and Skipping Rows)

Often, the datasets you’ll work with will have either column names that are not easy to understand, or unimportant information in the first few and/or last rows, such as definitions of the terms in the dataset, or footnotes.

通常,您将使用的数据集要么不容易理解的列名,要么在前几行和/或最后一行中具有不重要的信息,例如数据集中术语的定义或脚注。

In that case, we’d want to rename columns and skip certain rows so that we can drill down to necessary information with correct and sensible labels.

在这种情况下,我们希望重命名列并跳过某些行,以便我们可以使用正确且明智的标签来深入研究必要的信息。

To demonstrate how we can go about doing this, let’s first take a glance at the initial five rows of the “olympics.csv” dataset:

为了演示如何做到这一点,让我们首先看一下“ olympics.csv”数据集的前五行:

Now, we’ll read it into a Pandas DataFrame:

现在,我们将其读取到Pandas DataFrame中:

 >>> >>>  olympics_df olympics_df = = pdpd .. read_csvread_csv (( 'Datasets/olympics.csv''Datasets/olympics.csv' )
)
>>> >>>  olympics_dfolympics_df .. headhead ()
()
                   0         1     2     3     4      5         6     7     8  
                   0         1     2     3     4      5         6     7     8  
0                NaN  ? Summer  01 !  02 !  03 !  Total  ? Winter  01 !  02 !
0                NaN  ? Summer  01 !  02 !  03 !  Total  ? Winter  01 !  02 !
1  Afghanistan (AFG)        13     0     0     2      2         0     0     0
1  Afghanistan (AFG)        13     0     0     2      2         0     0     0
2      Algeria (ALG)        12     5     2     8     15         3     0     0
2      Algeria (ALG)        12     5     2     8     15         3     0     0
3    Argentina (ARG)        23    18    24    28     70        18     0     0
3    Argentina (ARG)        23    18    24    28     70        18     0     0
4      Armenia (ARM)         5     1     2     9     12         6     0     0

4      Armenia (ARM)         5     1     2     9     12         6     0     0

      9     10       11    12    13    14              15
      9     10       11    12    13    14              15
0  03 !  Total  ? Games  01 !  02 !  03 !  Combined total
0  03 !  Total  ? Games  01 !  02 !  03 !  Combined total
1     0      0       13     0     0     2               2
1     0      0       13     0     0     2               2
2     0      0       15     5     2     8              15
2     0      0       15     5     2     8              15
3     0      0       41    18    24    28              70
3     0      0       41    18    24    28              70
4     0      0       11     1     2     9              12
4     0      0       11     1     2     9              12

This is messy indeed! The columns are the string form of integers indexed at 0. The row which should have been our header (i.e. the one to be used to set the column names) is at olympics_df.iloc[0]. This happened because our CSV file starts with 0, 1, 2, …, 15.

确实是凌乱的! 这些列是索引为0的整数的字符串形式。应该作为我们标题的行(即用于设置列名称的行)位于olympics_df.iloc[0] 。 发生这种情况是因为我们的CSV文件以0、1、2,...,15开头。

Also, if we were to go to the source of this dataset, we’d see that NaN above should really be something like “Country”, ? Summer is supposed to represent “Summer Games”, 01 ! should be “Gold”, and so on.

另外,如果我们要转到该数据集的源 ,我们会看到上面的NaN确实应该是“ Country”(国家/地区)之类的? Summer ? Summer应该代表“夏季运动会”, 01 ! 应该是“黄金”,依此类推。

Therefore, we need to do two things:

因此,我们需要做两件事:

  • Skip one row and set the header as the first (0-indexed) row
  • Rename the columns
  • 跳过一行并将标题设置为第一行(索引为0)
  • 重命名列

We can skip rows and set the header while reading the CSV file by passing some parameters to the read_csv() function.

通过将一些参数传递给read_csv()函数,我们可以在读取CSV文件时跳过行并设置标题。

This function takes a lot of optional parameters, but in this case we only need one (header) to remove the 0th row:

这个函数需要很多可选参数,但是在这种情况下,我们只需要一个( header )即可删除第0行:

We now have the correct row set as the header and all unnecessary rows removed. Take note of how Pandas has changed the name of the column containing the name of the countries from NaN to Unnamed: 0.

现在,我们将正确的行设置为标题,并删除了所有不必要的行。 注意熊猫如何将包含国家名称的列的名称从NaN更改为Unnamed: 0

To rename the columns, we will make use of a DataFrame’s rename() method, which allows you to relabel an axis based on a mapping (in this case, a dict).

要重命名列,我们将使用DataFrame的rename()方法,该方法允许您基于映射(在本例中为dict )重新标记轴。

Let’s start by defining a dictionary that maps current column names (as keys) to more usable ones (the dictionary’s values):

让我们首先定义一个字典,该字典将当前列名(作为键)映射到更可用的列名(该字典的值):

 >>> >>>  new_names new_names =  =  {
       {
        'Unnamed: 0''Unnamed: 0' : : 'Country''Country' ,
,
...               ...               '? Summer''? Summer' : : 'Summer Olympics''Summer Olympics' ,
,
...               ...               '01 !''01 !' : : 'Gold''Gold' ,
,
...               ...               '02 !''02 !' : : 'Silver''Silver' ,
,
...               ...               '03 !''03 !' : : 'Bronze''Bronze' ,
,
...               ...               '? Winter''? Winter' : : 'Winter Olympics''Winter Olympics' ,
,
...               ...               '01 !.1''01 !.1' : : 'Gold.1''Gold.1' ,
,
...               ...               '02 !.1''02 !.1' : : 'Silver.1''Silver.1' ,
,
...               ...               '03 !.1''03 !.1' : : 'Bronze.1''Bronze.1' ,
,
...               ...               '? Games''? Games' : : '# Games''# Games' ,
,
...               ...               '01 !.2''01 !.2' : : 'Gold.2''Gold.2' ,
,
...               ...               '02 !.2''02 !.2' : : 'Silver.2''Silver.2' ,
,
...               ...               '03 !.2''03 !.2' : : 'Bronze.2''Bronze.2' }
}

We call the rename() function on our object:

我们在对象上调用rename()函数:

Setting inplace to True specifies that our changes be made directly to the object. Let’s see if this checks out:

将Inplace设置为True表示直接对对象进行更改。 让我们看看这是否签出:

 >>> >>>  olympics_dfolympics_df .. headhead ()
()
                   Country  Summer Olympics  Gold  Silver  Bronze  Total  
                   Country  Summer Olympics  Gold  Silver  Bronze  Total  
0        Afghanistan (AFG)               13     0       0       2      2
0        Afghanistan (AFG)               13     0       0       2      2
1            Algeria (ALG)               12     5       2       8     15
1            Algeria (ALG)               12     5       2       8     15
2          Argentina (ARG)               23    18      24      28     70
2          Argentina (ARG)               23    18      24      28     70
3            Armenia (ARM)                5     1       2       9     12
3            Armenia (ARM)                5     1       2       9     12
4  Australasia (ANZ) [ANZ]                2     3       4       5     12

4  Australasia (ANZ) [ANZ]                2     3       4       5     12

   Winter Olympics  Gold.1  Silver.1  Bronze.1  Total.1  # Games  Gold.2  
   Winter Olympics  Gold.1  Silver.1  Bronze.1  Total.1  # Games  Gold.2  
0                0       0         0         0        0       13       0
0                0       0         0         0        0       13       0
1                3       0         0         0        0       15       5
1                3       0         0         0        0       15       5
2               18       0         0         0        0       41      18
2               18       0         0         0        0       41      18
3                6       0         0         0        0       11       1
3                6       0         0         0        0       11       1
4                0       0         0         0        0        2       3

4                0       0         0         0        0        2       3

   Silver.2  Bronze.2  Combined total
   Silver.2  Bronze.2  Combined total
0         0         2               2
0         0         2               2
1         2         8              15
1         2         8              15
2        24        28              70
2        24        28              70
3         2         9              12
3         2         9              12
4         4         5              12
4         4         5              12

Python数据清理:回顾与资源 (Python Data Cleaning: Recap and Resources)

In this tutorial, you learned how you can drop unnecessary information from a dataset using the drop() function, as well as how to set an index for your dataset so that items in it can be referenced easily.

在本教程中,您学习了如何使用drop()函数从数据集中删除不必要的信息,以及如何为数据集设置索引,以便可以轻松引用其中的项目。

Moreover, you learned how to clean object fields with the .str() accessor and how to clean the entire dataset using the applymap() method. Lastly, we explored how to skip rows in a CSV file and rename columns using the rename() method.

此外,您还学习了如何使用.str()访问器来清理object字段,以及如何使用applymap()方法来清理整个数据集。 最后,我们探讨了如何跳过CSV文件中的行并使用rename()方法重命名列。

Knowing about data cleaning is very important, because it is a big part of data science. You now have a basic understanding of how Pandas and NumPy can be leveraged to clean datasets!

了解数据清理非常重要,因为它是数据科学的重要组成部分。 您现在对如何利用Pandas和NumPy清除数据集有了基本的了解!

Check out the links below to find additional resources that will help you on your Python data science journey:

请查看下面的链接,以找到有助于您进行Python数据科学之旅的其他资源:

  • The Pandas documentation
  • The NumPy documentation
  • Python for Data Analysis by Wes McKinney, the creator of Pandas
  • Pandas Cookbook by Ted Petrou, a data science trainer and consultant
  • 熊猫文件
  • NumPy 文档
  • Pandas的作者Wes McKinney的Python for Data Analysis
  • 数据科学培训师和顾问Ted Petrou撰写的《 Pandas Cookbook》

Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are: Malay Agarwal (Author) and Brad Solomon (Editor).

Real Python上的每个教程都是由一组开发人员创建的,因此符合我们的高质量标准。 从事本教程工作的团队成员是: Malay Agarwal (作者)和Brad Solomon (编辑)。

翻译自: https://www.pybloggers.com/2018/03/pythonic-data-cleaning-with-numpy-and-pandas/

你可能感兴趣的:(字符串,python,java,人工智能,机器学习)