数据分析工具Pandas基础 数据清洗--处理缺失数据、处理重复数据、替换数据处理

理论:

明确问题:

数据需要修改吗?

有什么需要修改的吗?

数据应该怎么调整才能适用于接下来的分析和挖掘?

数据清洗的特点:

是一个迭代的过程,实际项目中可能需要不止一次地执行这些清洗操作

处理缺失数据:

判断是否存在缺失值ser_obj.isnull(), df_obj.isnull(),可以结合any()判断行/列中是否存在缺失值

1.丢弃缺失数据:dropna(),注意inplace参数

2.填充缺失数据:fillna(value),以value填充数据

ffill()和bfill(),项目中使用ffill()或bfill()时,注意数据的排列顺序

df.ffill()或者df.fillna(method=”ffill”),按之前的数据填充

df.bfill()或者df.bfill(method=”bfill”),按之后的数据填充

3.axis:默认axis=0,0表示以垂直方向进行操作,也就是按照列索引中的每列数据进行操作,1表示在水平方向进行操作,也就是按照行索引中的每行数据进行操作。

处理重复数据:

duplicated(subset) 返回布尔型Series,表示每行是否为重复行

drop_duplicates(subset, keep) 过滤重复行,是对subset中的数据进行重复行过滤

默认判断全部列,可通过参数subset指定某些列

keep,默认(first)保留第一次出现的数据,last表示保留重复行的最后一行

替换数据处理:

df.replace(to_replace)参数to_replace为可以是:

数值,字符串:需要替换的值,新的值,比如to_replace=23,45,表示将df数据中的23替换为45

列表:第一个列表中的元素是需要被替换掉的值,第二个列表中的元素是新的值。两个列表需要一一对应。

字典,键是需要被替换掉的值,值为新的值

实验:

第五课 数据分析工具Pandas基础

数据清洗--处理缺失数据

In [1]:

 

import pandas as pd

In [2]:

 

 
# 读取文件
filepath = r'C:\Users\ML Learning\Projects\第四章-数据分析预习内容\第四章-数据分析预习内容\第一节-数据分析工具pandas基础\lesson_05\lesson_05\examples\datasets\log.csv'
log_data = pd.read_csv(filepath)
log_data

Out[2]:

  time user video playback position paused volume
0 1469974424 cheryl intro.html 5 False 10.0
1 1469974454 cheryl intro.html 6 NaN NaN
2 1469974544 cheryl intro.html 9 NaN NaN
3 1469974574 cheryl intro.html 10 NaN NaN
4 1469977514 bob intro.html 1 NaN NaN
5 1469977544 bob intro.html 1 NaN NaN
6 1469977574 bob intro.html 1 NaN NaN
7 1469977604 bob intro.html 1 NaN NaN
8 1469974604 cheryl intro.html 11 NaN NaN
9 1469974694 cheryl intro.html 14 NaN NaN
10 1469974724 cheryl intro.html 15 NaN NaN
11 1469974454 sue advanced.html 24 NaN NaN
12 1469974524 sue advanced.html 25 NaN NaN
13 1469974424 sue advanced.html 23 False 10.0
14 1469974554 sue advanced.html 26 NaN NaN
15 1469974624 sue advanced.html 27 NaN NaN
16 1469974654 sue advanced.html 28 NaN 5.0
17 1469974724 sue advanced.html 29 NaN NaN
18 1469974484 cheryl intro.html 7 NaN NaN
19 1469974514 cheryl intro.html 8 NaN NaN
20 1469974754 sue advanced.html 30 NaN NaN
21 1469974824 sue advanced.html 31 NaN NaN
22 1469974854 sue advanced.html 32 NaN NaN
23 1469974924 sue advanced.html 33 NaN NaN
24 1469977424 bob intro.html 1 True 10.0
25 1469977454 bob intro.html 1 NaN NaN
26 1469977484 bob intro.html 1 NaN NaN
27 1469977634 bob intro.html 1 NaN NaN
28 1469977664 bob intro.html 1 NaN NaN
29 1469974634 cheryl intro.html 12 NaN NaN
30 1469974664 cheryl intro.html 13 NaN NaN
31 1469977694 bob intro.html 1 NaN NaN
32 1469977724 bob intro.html 1 NaN NaN

判断是否存在缺失值

In [3]:

 

 
log_data.isnull()

Out[3]:

  time user video playback position paused volume
0 False False False False False False
1 False False False False True True
2 False False False False True True
3 False False False False True True
4 False False False False True True
5 False False False False True True
6 False False False False True True
7 False False False False True True
8 False False False False True True
9 False False False False True True
10 False False False False True True
11 False False False False True True
12 False False False False True True
13 False False False False False False
14 False False False False True True
15 False False False False True True
16 False False False False True False
17 False False False False True True
18 False False False False True True
19 False False False False True True
20 False False False False True True
21 False False False False True True
22 False False False False True True
23 False False False False True True
24 False False False False False False
25 False False False False True True
26 False False False False True True
27 False False False False True True
28 False False False False True True
29 False False False False True True
30 False False False False True True
31 False False False False True True
32 False False False False True True

In [4]:

 

log_data.isnull().any()  # 针对每一列

Out[4]:

time                 False
user                 False
video                False
playback position    False
paused                True
volume                True
dtype: bool

In [5]:

 

log_data.isnull().any(axis=1)

Out[5]:

0     False
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
13    False
14     True
15     True
16     True
17     True
18     True
19     True
20     True
21     True
22     True
23     True
24    False
25     True
26     True
27     True
28     True
29     True
30     True
31     True
32     True
dtype: bool

丢失缺失值

In [11]:

 

log_data.dropna(subset=['volume'])

Out[11]:

  time user video playback position paused volume
0 1469974424 cheryl intro.html 5 False 10.0
13 1469974424 sue advanced.html 23 False 10.0
16 1469974654 sue advanced.html 28 NaN 5.0
24 1469977424 bob intro.html 1 True 10.0

填充缺失值

In [8]:

 

log_data.fillna(-1)

Out[8]:

  time user video playback position paused volume
0 1469974424 cheryl intro.html 5 False 10.0
1 1469974454 cheryl intro.html 6 -1 -1.0
2 1469974544 cheryl intro.html 9 -1 -1.0
3 1469974574 cheryl intro.html 10 -1 -1.0
4 1469977514 bob intro.html 1 -1 -1.0
5 1469977544 bob intro.html 1 -1 -1.0
6 1469977574 bob intro.html 1 -1 -1.0
7 1469977604 bob intro.html 1 -1 -1.0
8 1469974604 cheryl intro.html 11 -1 -1.0
9 1469974694 cheryl intro.html 14 -1 -1.0
10 1469974724 cheryl intro.html 15 -1 -1.0
11 1469974454 sue advanced.html 24 -1 -1.0
12 1469974524 sue advanced.html 25 -1 -1.0
13 1469974424 sue advanced.html 23 False 10.0
14 1469974554 sue advanced.html 26 -1 -1.0
15 1469974624 sue advanced.html 27 -1 -1.0
16 1469974654 sue advanced.html 28 -1 5.0
17 1469974724 sue advanced.html 29 -1 -1.0
18 1469974484 cheryl intro.html 7 -1 -1.0
19 1469974514 cheryl intro.html 8 -1 -1.0
20 1469974754 sue advanced.html 30 -1 -1.0
21 1469974824 sue advanced.html 31 -1 -1.0
22 1469974854 sue advanced.html 32 -1 -1.0
23 1469974924 sue advanced.html 33 -1 -1.0
24 1469977424 bob intro.html 1 True 10.0
25 1469977454 bob intro.html 1 -1 -1.0
26 1469977484 bob intro.html 1 -1 -1.0
27 1469977634 bob intro.html 1 -1 -1.0
28 1469977664 bob intro.html 1 -1 -1.0
29 1469974634 cheryl intro.html 12 -1 -1.0
30 1469974664 cheryl intro.html 13 -1 -1.0
31 1469977694 bob intro.html 1 -1 -1.0
32 1469977724 bob intro.html 1 -1 -1.0

ffill() 按之前进行排序 bfill()按之后进行排列

In [12]:

 

 
# 对数据进行排序
sorted_log_data = log_data.sort_values(['time','user'])
sorted_log_data

Out[12]:

  time user video playback position paused volume
0 1469974424 cheryl intro.html 5 False 10.0
13 1469974424 sue advanced.html 23 False 10.0
1 1469974454 cheryl intro.html 6 NaN NaN
11 1469974454 sue advanced.html 24 NaN NaN
18 1469974484 cheryl intro.html 7 NaN NaN
19 1469974514 cheryl intro.html 8 NaN NaN
12 1469974524 sue advanced.html 25 NaN NaN
2 1469974544 cheryl intro.html 9 NaN NaN
14 1469974554 sue advanced.html 26 NaN NaN
3 1469974574 cheryl intro.html 10 NaN NaN
8 1469974604 cheryl intro.html 11 NaN NaN
15 1469974624 sue advanced.html 27 NaN NaN
29 1469974634 cheryl intro.html 12 NaN NaN
16 1469974654 sue advanced.html 28 NaN 5.0
30 1469974664 cheryl intro.html 13 NaN NaN
9 1469974694 cheryl intro.html 14 NaN NaN
10 1469974724 cheryl intro.html 15 NaN NaN
17 1469974724 sue advanced.html 29 NaN NaN
20 1469974754 sue advanced.html 30 NaN NaN
21 1469974824 sue advanced.html 31 NaN NaN
22 1469974854 sue advanced.html 32 NaN NaN
23 1469974924 sue advanced.html 33 NaN NaN
24 1469977424 bob intro.html 1 True 10.0
25 1469977454 bob intro.html 1 NaN NaN
26 1469977484 bob intro.html 1 NaN NaN
4 1469977514 bob intro.html 1 NaN NaN
5 1469977544 bob intro.html 1 NaN NaN
6 1469977574 bob intro.html 1 NaN NaN
7 1469977604 bob intro.html 1 NaN NaN
27 1469977634 bob intro.html 1 NaN NaN
28 1469977664 bob intro.html 1 NaN NaN
31 1469977694 bob intro.html 1 NaN NaN
32 1469977724 bob intro.html 1 NaN NaN

In [14]:

 

 
sorted_log_data.ffill()

Out[14]:

  time user video playback position paused volume
0 1469974424 cheryl intro.html 5 False 10.0
13 1469974424 sue advanced.html 23 False 10.0
1 1469974454 cheryl intro.html 6 False 10.0
11 1469974454 sue advanced.html 24 False 10.0
18 1469974484 cheryl intro.html 7 False 10.0
19 1469974514 cheryl intro.html 8 False 10.0
12 1469974524 sue advanced.html 25 False 10.0
2 1469974544 cheryl intro.html 9 False 10.0
14 1469974554 sue advanced.html 26 False 10.0
3 1469974574 cheryl intro.html 10 False 10.0
8 1469974604 cheryl intro.html 11 False 10.0
15 1469974624 sue advanced.html 27 False 10.0
29 1469974634 cheryl intro.html 12 False 10.0
16 1469974654 sue advanced.html 28 False 5.0
30 1469974664 cheryl intro.html 13 False 5.0
9 1469974694 cheryl intro.html 14 False 5.0
10 1469974724 cheryl intro.html 15 False 5.0
17 1469974724 sue advanced.html 29 False 5.0
20 1469974754 sue advanced.html 30 False 5.0
21 1469974824 sue advanced.html 31 False 5.0
22 1469974854 sue advanced.html 32 False 5.0
23 1469974924 sue advanced.html 33 False 5.0
24 1469977424 bob intro.html 1 True 10.0
25 1469977454 bob intro.html 1 True 10.0
26 1469977484 bob intro.html 1 True 10.0
4 1469977514 bob intro.html 1 True 10.0
5 1469977544 bob intro.html 1 True 10.0
6 1469977574 bob intro.html 1 True 10.0
7 1469977604 bob intro.html 1 True 10.0
27 1469977634 bob intro.html 1 True 10.0
28 1469977664 bob intro.html 1 True 10.0
31 1469977694 bob intro.html 1 True 10.0
32 1469977724 bob intro.html 1 True 10.0

In [15]:

 

 
sorted_log_data.bfill()

Out[15]:

  time user video playback position paused volume
0 1469974424 cheryl intro.html 5 False 10.0
13 1469974424 sue advanced.html 23 False 10.0
1 1469974454 cheryl intro.html 6 True 5.0
11 1469974454 sue advanced.html 24 True 5.0
18 1469974484 cheryl intro.html 7 True 5.0
19 1469974514 cheryl intro.html 8 True 5.0
12 1469974524 sue advanced.html 25 True 5.0
2 1469974544 cheryl intro.html 9 True 5.0
14 1469974554 sue advanced.html 26 True 5.0
3 1469974574 cheryl intro.html 10 True 5.0
8 1469974604 cheryl intro.html 11 True 5.0
15 1469974624 sue advanced.html 27 True 5.0
29 1469974634 cheryl intro.html 12 True 5.0
16 1469974654 sue advanced.html 28 True 5.0
30 1469974664 cheryl intro.html 13 True 10.0
9 1469974694 cheryl intro.html 14 True 10.0
10 1469974724 cheryl intro.html 15 True 10.0
17 1469974724 sue advanced.html 29 True 10.0
20 1469974754 sue advanced.html 30 True 10.0
21 1469974824 sue advanced.html 31 True 10.0
22 1469974854 sue advanced.html 32 True 10.0
23 1469974924 sue advanced.html 33 True 10.0
24 1469977424 bob intro.html 1 True 10.0
25 1469977454 bob intro.html 1 NaN NaN
26 1469977484 bob intro.html 1 NaN NaN
4 1469977514 bob intro.html 1 NaN NaN
5 1469977544 bob intro.html 1 NaN NaN
6 1469977574 bob intro.html 1 NaN NaN
7 1469977604 bob intro.html 1 NaN NaN
27 1469977634 bob intro.html 1 NaN NaN
28 1469977664 bob intro.html 1 NaN NaN
31 1469977694 bob intro.html 1 NaN NaN
32 1469977724 bob intro.html 1 NaN NaN

处理重复数据

In [16]:

 

data = pd.DataFrame(
    {'age':[28, 31, 27, 28],
     'gender':['M','M','M','F'],
     'surname':['Liu','Li','Chen','Liu']}
)
data

Out[16]:

  age gender surname
0 28 M Liu
1 31 M Li
2 27 M Chen
3 28 F Liu

判断是否存在重复数据

In [17]:

 

 
data.duplicated()

Out[17]:

0    False
1    False
2    False
3    False
dtype: bool

In [18]:

 

 
data.duplicated(subset=['age','surname'])

Out[18]:

0    False
1    False
2    False
3     True
dtype: bool

In [19]:

 

 
data.drop_duplicates(subset=['age','surname'])

Out[19]:

  age gender surname
0 28 M Liu
1 31 M Li
2 27 M Chen

In [23]:

 

 
data.drop_duplicates(subset=['age','surname'],keep='last')

Out[23]:

  age gender surname
1 31 M Li
2 27 M Chen
3 28 F Liu

数据清洗--替换数据

In [24]:

 

 
data = pd.DataFrame(
    {'A':[0, 1, 2, 3, 4],
     'B':[5, 6, 7, 8, 9],
     'C':['a', 'b', 'c', 'd', 'e']}
)
data

Out[24]:

  A B C
0 0 5 a
1 1 6 b
2 2 7 c
3 3 8 d
4 4 9 e

替换

In [25]:

 

# 数据替换
data.replace(0,100)

Out[25]:

  A B C
0 100 5 a
1 1 6 b
2 2 7 c
3 3 8 d
4 4 9 e

In [26]:

 

 
# 列表替换为值
data.replace([0, 1, 2, 3],4)

Out[26]:

  A B C
0 4 5 a
1 4 6 b
2 4 7 c
3 4 8 d
4 4 9 e

In [27]:

 

 
# 列表替换为对应列表中的值
data.replace([0, 1, 2, 3],[4, 3, 2, 1])

Out[27]:

  A B C
0 4 5 a
1 3 6 b
2 2 7 c
3 1 8 d
4 4 9 e

In [29]:

 

 
# 按字典替换
data.replace({0:10,1:100})

Out[29]:

  A B C
0 10 5 a
1 100 6 b
2 2 7 c
3 3 8 d
4 4 9 e

In [ ]:

 

你可能感兴趣的:(数据科学,NLP)