数据需要修改吗?
有什么需要修改的吗?
数据应该怎么调整才能适用于接下来的分析和挖掘?
是一个迭代的过程,实际项目中可能需要不止一次地执行这些清洗操作
判断是否存在缺失值,ser_obj.isnull(), df_obj.isnull(),可以结合any()判断行/列中是否存在缺失值
1.丢弃缺失数据:dropna(),注意inplace参数
2.填充缺失数据:fillna(value),以value填充数据
ffill()和bfill(),项目中使用ffill()或bfill()时,注意数据的排列顺序
df.ffill()或者df.fillna(method=”ffill”),按之前的数据填充
df.bfill()或者df.bfill(method=”bfill”),按之后的数据填充
3.axis:默认axis=0,0表示以垂直方向进行操作,也就是按照列索引中的每列数据进行操作,1表示在水平方向进行操作,也就是按照行索引中的每行数据进行操作。
duplicated(subset) 返回布尔型Series,表示每行是否为重复行
drop_duplicates(subset, keep) 过滤重复行,是对subset中的数据进行重复行过滤
默认判断全部列,可通过参数subset指定某些列
keep,默认(first)保留第一次出现的数据,last表示保留重复行的最后一行
df.replace(to_replace),参数to_replace为可以是:
数值,字符串:需要替换的值,新的值,比如to_replace=23,45,表示将df数据中的23替换为45
列表:第一个列表中的元素是需要被替换掉的值,第二个列表中的元素是新的值。两个列表需要一一对应。
字典,键是需要被替换掉的值,值为新的值
In [1]:
import pandas as pd
In [2]:
# 读取文件
filepath = r'C:\Users\ML Learning\Projects\第四章-数据分析预习内容\第四章-数据分析预习内容\第一节-数据分析工具pandas基础\lesson_05\lesson_05\examples\datasets\log.csv'
log_data = pd.read_csv(filepath)
log_data
Out[2]:
time | user | video | playback position | paused | volume | |
---|---|---|---|---|---|---|
0 | 1469974424 | cheryl | intro.html | 5 | False | 10.0 |
1 | 1469974454 | cheryl | intro.html | 6 | NaN | NaN |
2 | 1469974544 | cheryl | intro.html | 9 | NaN | NaN |
3 | 1469974574 | cheryl | intro.html | 10 | NaN | NaN |
4 | 1469977514 | bob | intro.html | 1 | NaN | NaN |
5 | 1469977544 | bob | intro.html | 1 | NaN | NaN |
6 | 1469977574 | bob | intro.html | 1 | NaN | NaN |
7 | 1469977604 | bob | intro.html | 1 | NaN | NaN |
8 | 1469974604 | cheryl | intro.html | 11 | NaN | NaN |
9 | 1469974694 | cheryl | intro.html | 14 | NaN | NaN |
10 | 1469974724 | cheryl | intro.html | 15 | NaN | NaN |
11 | 1469974454 | sue | advanced.html | 24 | NaN | NaN |
12 | 1469974524 | sue | advanced.html | 25 | NaN | NaN |
13 | 1469974424 | sue | advanced.html | 23 | False | 10.0 |
14 | 1469974554 | sue | advanced.html | 26 | NaN | NaN |
15 | 1469974624 | sue | advanced.html | 27 | NaN | NaN |
16 | 1469974654 | sue | advanced.html | 28 | NaN | 5.0 |
17 | 1469974724 | sue | advanced.html | 29 | NaN | NaN |
18 | 1469974484 | cheryl | intro.html | 7 | NaN | NaN |
19 | 1469974514 | cheryl | intro.html | 8 | NaN | NaN |
20 | 1469974754 | sue | advanced.html | 30 | NaN | NaN |
21 | 1469974824 | sue | advanced.html | 31 | NaN | NaN |
22 | 1469974854 | sue | advanced.html | 32 | NaN | NaN |
23 | 1469974924 | sue | advanced.html | 33 | NaN | NaN |
24 | 1469977424 | bob | intro.html | 1 | True | 10.0 |
25 | 1469977454 | bob | intro.html | 1 | NaN | NaN |
26 | 1469977484 | bob | intro.html | 1 | NaN | NaN |
27 | 1469977634 | bob | intro.html | 1 | NaN | NaN |
28 | 1469977664 | bob | intro.html | 1 | NaN | NaN |
29 | 1469974634 | cheryl | intro.html | 12 | NaN | NaN |
30 | 1469974664 | cheryl | intro.html | 13 | NaN | NaN |
31 | 1469977694 | bob | intro.html | 1 | NaN | NaN |
32 | 1469977724 | bob | intro.html | 1 | NaN | NaN |
判断是否存在缺失值
In [3]:
log_data.isnull()
Out[3]:
time | user | video | playback position | paused | volume | |
---|---|---|---|---|---|---|
0 | False | False | False | False | False | False |
1 | False | False | False | False | True | True |
2 | False | False | False | False | True | True |
3 | False | False | False | False | True | True |
4 | False | False | False | False | True | True |
5 | False | False | False | False | True | True |
6 | False | False | False | False | True | True |
7 | False | False | False | False | True | True |
8 | False | False | False | False | True | True |
9 | False | False | False | False | True | True |
10 | False | False | False | False | True | True |
11 | False | False | False | False | True | True |
12 | False | False | False | False | True | True |
13 | False | False | False | False | False | False |
14 | False | False | False | False | True | True |
15 | False | False | False | False | True | True |
16 | False | False | False | False | True | False |
17 | False | False | False | False | True | True |
18 | False | False | False | False | True | True |
19 | False | False | False | False | True | True |
20 | False | False | False | False | True | True |
21 | False | False | False | False | True | True |
22 | False | False | False | False | True | True |
23 | False | False | False | False | True | True |
24 | False | False | False | False | False | False |
25 | False | False | False | False | True | True |
26 | False | False | False | False | True | True |
27 | False | False | False | False | True | True |
28 | False | False | False | False | True | True |
29 | False | False | False | False | True | True |
30 | False | False | False | False | True | True |
31 | False | False | False | False | True | True |
32 | False | False | False | False | True | True |
In [4]:
log_data.isnull().any() # 针对每一列
Out[4]:
time False
user False
video False
playback position False
paused True
volume True
dtype: bool
In [5]:
log_data.isnull().any(axis=1)
Out[5]:
0 False
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 False
14 True
15 True
16 True
17 True
18 True
19 True
20 True
21 True
22 True
23 True
24 False
25 True
26 True
27 True
28 True
29 True
30 True
31 True
32 True
dtype: bool
丢失缺失值
In [11]:
log_data.dropna(subset=['volume'])
Out[11]:
time | user | video | playback position | paused | volume | |
---|---|---|---|---|---|---|
0 | 1469974424 | cheryl | intro.html | 5 | False | 10.0 |
13 | 1469974424 | sue | advanced.html | 23 | False | 10.0 |
16 | 1469974654 | sue | advanced.html | 28 | NaN | 5.0 |
24 | 1469977424 | bob | intro.html | 1 | True | 10.0 |
填充缺失值
In [8]:
log_data.fillna(-1)
Out[8]:
time | user | video | playback position | paused | volume | |
---|---|---|---|---|---|---|
0 | 1469974424 | cheryl | intro.html | 5 | False | 10.0 |
1 | 1469974454 | cheryl | intro.html | 6 | -1 | -1.0 |
2 | 1469974544 | cheryl | intro.html | 9 | -1 | -1.0 |
3 | 1469974574 | cheryl | intro.html | 10 | -1 | -1.0 |
4 | 1469977514 | bob | intro.html | 1 | -1 | -1.0 |
5 | 1469977544 | bob | intro.html | 1 | -1 | -1.0 |
6 | 1469977574 | bob | intro.html | 1 | -1 | -1.0 |
7 | 1469977604 | bob | intro.html | 1 | -1 | -1.0 |
8 | 1469974604 | cheryl | intro.html | 11 | -1 | -1.0 |
9 | 1469974694 | cheryl | intro.html | 14 | -1 | -1.0 |
10 | 1469974724 | cheryl | intro.html | 15 | -1 | -1.0 |
11 | 1469974454 | sue | advanced.html | 24 | -1 | -1.0 |
12 | 1469974524 | sue | advanced.html | 25 | -1 | -1.0 |
13 | 1469974424 | sue | advanced.html | 23 | False | 10.0 |
14 | 1469974554 | sue | advanced.html | 26 | -1 | -1.0 |
15 | 1469974624 | sue | advanced.html | 27 | -1 | -1.0 |
16 | 1469974654 | sue | advanced.html | 28 | -1 | 5.0 |
17 | 1469974724 | sue | advanced.html | 29 | -1 | -1.0 |
18 | 1469974484 | cheryl | intro.html | 7 | -1 | -1.0 |
19 | 1469974514 | cheryl | intro.html | 8 | -1 | -1.0 |
20 | 1469974754 | sue | advanced.html | 30 | -1 | -1.0 |
21 | 1469974824 | sue | advanced.html | 31 | -1 | -1.0 |
22 | 1469974854 | sue | advanced.html | 32 | -1 | -1.0 |
23 | 1469974924 | sue | advanced.html | 33 | -1 | -1.0 |
24 | 1469977424 | bob | intro.html | 1 | True | 10.0 |
25 | 1469977454 | bob | intro.html | 1 | -1 | -1.0 |
26 | 1469977484 | bob | intro.html | 1 | -1 | -1.0 |
27 | 1469977634 | bob | intro.html | 1 | -1 | -1.0 |
28 | 1469977664 | bob | intro.html | 1 | -1 | -1.0 |
29 | 1469974634 | cheryl | intro.html | 12 | -1 | -1.0 |
30 | 1469974664 | cheryl | intro.html | 13 | -1 | -1.0 |
31 | 1469977694 | bob | intro.html | 1 | -1 | -1.0 |
32 | 1469977724 | bob | intro.html | 1 | -1 | -1.0 |
In [12]:
# 对数据进行排序
sorted_log_data = log_data.sort_values(['time','user'])
sorted_log_data
Out[12]:
time | user | video | playback position | paused | volume | |
---|---|---|---|---|---|---|
0 | 1469974424 | cheryl | intro.html | 5 | False | 10.0 |
13 | 1469974424 | sue | advanced.html | 23 | False | 10.0 |
1 | 1469974454 | cheryl | intro.html | 6 | NaN | NaN |
11 | 1469974454 | sue | advanced.html | 24 | NaN | NaN |
18 | 1469974484 | cheryl | intro.html | 7 | NaN | NaN |
19 | 1469974514 | cheryl | intro.html | 8 | NaN | NaN |
12 | 1469974524 | sue | advanced.html | 25 | NaN | NaN |
2 | 1469974544 | cheryl | intro.html | 9 | NaN | NaN |
14 | 1469974554 | sue | advanced.html | 26 | NaN | NaN |
3 | 1469974574 | cheryl | intro.html | 10 | NaN | NaN |
8 | 1469974604 | cheryl | intro.html | 11 | NaN | NaN |
15 | 1469974624 | sue | advanced.html | 27 | NaN | NaN |
29 | 1469974634 | cheryl | intro.html | 12 | NaN | NaN |
16 | 1469974654 | sue | advanced.html | 28 | NaN | 5.0 |
30 | 1469974664 | cheryl | intro.html | 13 | NaN | NaN |
9 | 1469974694 | cheryl | intro.html | 14 | NaN | NaN |
10 | 1469974724 | cheryl | intro.html | 15 | NaN | NaN |
17 | 1469974724 | sue | advanced.html | 29 | NaN | NaN |
20 | 1469974754 | sue | advanced.html | 30 | NaN | NaN |
21 | 1469974824 | sue | advanced.html | 31 | NaN | NaN |
22 | 1469974854 | sue | advanced.html | 32 | NaN | NaN |
23 | 1469974924 | sue | advanced.html | 33 | NaN | NaN |
24 | 1469977424 | bob | intro.html | 1 | True | 10.0 |
25 | 1469977454 | bob | intro.html | 1 | NaN | NaN |
26 | 1469977484 | bob | intro.html | 1 | NaN | NaN |
4 | 1469977514 | bob | intro.html | 1 | NaN | NaN |
5 | 1469977544 | bob | intro.html | 1 | NaN | NaN |
6 | 1469977574 | bob | intro.html | 1 | NaN | NaN |
7 | 1469977604 | bob | intro.html | 1 | NaN | NaN |
27 | 1469977634 | bob | intro.html | 1 | NaN | NaN |
28 | 1469977664 | bob | intro.html | 1 | NaN | NaN |
31 | 1469977694 | bob | intro.html | 1 | NaN | NaN |
32 | 1469977724 | bob | intro.html | 1 | NaN | NaN |
In [14]:
sorted_log_data.ffill()
Out[14]:
time | user | video | playback position | paused | volume | |
---|---|---|---|---|---|---|
0 | 1469974424 | cheryl | intro.html | 5 | False | 10.0 |
13 | 1469974424 | sue | advanced.html | 23 | False | 10.0 |
1 | 1469974454 | cheryl | intro.html | 6 | False | 10.0 |
11 | 1469974454 | sue | advanced.html | 24 | False | 10.0 |
18 | 1469974484 | cheryl | intro.html | 7 | False | 10.0 |
19 | 1469974514 | cheryl | intro.html | 8 | False | 10.0 |
12 | 1469974524 | sue | advanced.html | 25 | False | 10.0 |
2 | 1469974544 | cheryl | intro.html | 9 | False | 10.0 |
14 | 1469974554 | sue | advanced.html | 26 | False | 10.0 |
3 | 1469974574 | cheryl | intro.html | 10 | False | 10.0 |
8 | 1469974604 | cheryl | intro.html | 11 | False | 10.0 |
15 | 1469974624 | sue | advanced.html | 27 | False | 10.0 |
29 | 1469974634 | cheryl | intro.html | 12 | False | 10.0 |
16 | 1469974654 | sue | advanced.html | 28 | False | 5.0 |
30 | 1469974664 | cheryl | intro.html | 13 | False | 5.0 |
9 | 1469974694 | cheryl | intro.html | 14 | False | 5.0 |
10 | 1469974724 | cheryl | intro.html | 15 | False | 5.0 |
17 | 1469974724 | sue | advanced.html | 29 | False | 5.0 |
20 | 1469974754 | sue | advanced.html | 30 | False | 5.0 |
21 | 1469974824 | sue | advanced.html | 31 | False | 5.0 |
22 | 1469974854 | sue | advanced.html | 32 | False | 5.0 |
23 | 1469974924 | sue | advanced.html | 33 | False | 5.0 |
24 | 1469977424 | bob | intro.html | 1 | True | 10.0 |
25 | 1469977454 | bob | intro.html | 1 | True | 10.0 |
26 | 1469977484 | bob | intro.html | 1 | True | 10.0 |
4 | 1469977514 | bob | intro.html | 1 | True | 10.0 |
5 | 1469977544 | bob | intro.html | 1 | True | 10.0 |
6 | 1469977574 | bob | intro.html | 1 | True | 10.0 |
7 | 1469977604 | bob | intro.html | 1 | True | 10.0 |
27 | 1469977634 | bob | intro.html | 1 | True | 10.0 |
28 | 1469977664 | bob | intro.html | 1 | True | 10.0 |
31 | 1469977694 | bob | intro.html | 1 | True | 10.0 |
32 | 1469977724 | bob | intro.html | 1 | True | 10.0 |
In [15]:
sorted_log_data.bfill()
Out[15]:
time | user | video | playback position | paused | volume | |
---|---|---|---|---|---|---|
0 | 1469974424 | cheryl | intro.html | 5 | False | 10.0 |
13 | 1469974424 | sue | advanced.html | 23 | False | 10.0 |
1 | 1469974454 | cheryl | intro.html | 6 | True | 5.0 |
11 | 1469974454 | sue | advanced.html | 24 | True | 5.0 |
18 | 1469974484 | cheryl | intro.html | 7 | True | 5.0 |
19 | 1469974514 | cheryl | intro.html | 8 | True | 5.0 |
12 | 1469974524 | sue | advanced.html | 25 | True | 5.0 |
2 | 1469974544 | cheryl | intro.html | 9 | True | 5.0 |
14 | 1469974554 | sue | advanced.html | 26 | True | 5.0 |
3 | 1469974574 | cheryl | intro.html | 10 | True | 5.0 |
8 | 1469974604 | cheryl | intro.html | 11 | True | 5.0 |
15 | 1469974624 | sue | advanced.html | 27 | True | 5.0 |
29 | 1469974634 | cheryl | intro.html | 12 | True | 5.0 |
16 | 1469974654 | sue | advanced.html | 28 | True | 5.0 |
30 | 1469974664 | cheryl | intro.html | 13 | True | 10.0 |
9 | 1469974694 | cheryl | intro.html | 14 | True | 10.0 |
10 | 1469974724 | cheryl | intro.html | 15 | True | 10.0 |
17 | 1469974724 | sue | advanced.html | 29 | True | 10.0 |
20 | 1469974754 | sue | advanced.html | 30 | True | 10.0 |
21 | 1469974824 | sue | advanced.html | 31 | True | 10.0 |
22 | 1469974854 | sue | advanced.html | 32 | True | 10.0 |
23 | 1469974924 | sue | advanced.html | 33 | True | 10.0 |
24 | 1469977424 | bob | intro.html | 1 | True | 10.0 |
25 | 1469977454 | bob | intro.html | 1 | NaN | NaN |
26 | 1469977484 | bob | intro.html | 1 | NaN | NaN |
4 | 1469977514 | bob | intro.html | 1 | NaN | NaN |
5 | 1469977544 | bob | intro.html | 1 | NaN | NaN |
6 | 1469977574 | bob | intro.html | 1 | NaN | NaN |
7 | 1469977604 | bob | intro.html | 1 | NaN | NaN |
27 | 1469977634 | bob | intro.html | 1 | NaN | NaN |
28 | 1469977664 | bob | intro.html | 1 | NaN | NaN |
31 | 1469977694 | bob | intro.html | 1 | NaN | NaN |
32 | 1469977724 | bob | intro.html | 1 | NaN | NaN |
In [16]:
data = pd.DataFrame(
{'age':[28, 31, 27, 28],
'gender':['M','M','M','F'],
'surname':['Liu','Li','Chen','Liu']}
)
data
Out[16]:
age | gender | surname | |
---|---|---|---|
0 | 28 | M | Liu |
1 | 31 | M | Li |
2 | 27 | M | Chen |
3 | 28 | F | Liu |
In [17]:
data.duplicated()
Out[17]:
0 False
1 False
2 False
3 False
dtype: bool
In [18]:
data.duplicated(subset=['age','surname'])
Out[18]:
0 False
1 False
2 False
3 True
dtype: bool
In [19]:
data.drop_duplicates(subset=['age','surname'])
Out[19]:
age | gender | surname | |
---|---|---|---|
0 | 28 | M | Liu |
1 | 31 | M | Li |
2 | 27 | M | Chen |
In [23]:
data.drop_duplicates(subset=['age','surname'],keep='last')
Out[23]:
age | gender | surname | |
---|---|---|---|
1 | 31 | M | Li |
2 | 27 | M | Chen |
3 | 28 | F | Liu |
In [24]:
data = pd.DataFrame(
{'A':[0, 1, 2, 3, 4],
'B':[5, 6, 7, 8, 9],
'C':['a', 'b', 'c', 'd', 'e']}
)
data
Out[24]:
A | B | C | |
---|---|---|---|
0 | 0 | 5 | a |
1 | 1 | 6 | b |
2 | 2 | 7 | c |
3 | 3 | 8 | d |
4 | 4 | 9 | e |
In [25]:
# 数据替换
data.replace(0,100)
Out[25]:
A | B | C | |
---|---|---|---|
0 | 100 | 5 | a |
1 | 1 | 6 | b |
2 | 2 | 7 | c |
3 | 3 | 8 | d |
4 | 4 | 9 | e |
In [26]:
# 列表替换为值
data.replace([0, 1, 2, 3],4)
Out[26]:
A | B | C | |
---|---|---|---|
0 | 4 | 5 | a |
1 | 4 | 6 | b |
2 | 4 | 7 | c |
3 | 4 | 8 | d |
4 | 4 | 9 | e |
In [27]:
# 列表替换为对应列表中的值
data.replace([0, 1, 2, 3],[4, 3, 2, 1])
Out[27]:
A | B | C | |
---|---|---|---|
0 | 4 | 5 | a |
1 | 3 | 6 | b |
2 | 2 | 7 | c |
3 | 1 | 8 | d |
4 | 4 | 9 | e |
In [29]:
# 按字典替换
data.replace({0:10,1:100})
Out[29]:
A | B | C | |
---|---|---|---|
0 | 10 | 5 | a |
1 | 100 | 6 | b |
2 | 2 | 7 | c |
3 | 3 | 8 | d |
4 | 4 | 9 | e |
In [ ]: