数据加载及探索性分析
- 数据加载与初步分析
- 载入数据
- 初步观察
- 保存数据
- 数据类型
- 筛选的逻辑
- 排序
数据加载与初步分析
载入数据
载入数据
import pandas as pd
import numpy as np
df = pd.read_csv('train.csv')
df.head(5)
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
把标题改为中文
df = pd.read_csv('train.csv', names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],index_col='乘客ID',header=0)
df.head()
|
是否幸存 |
仓位等级 |
姓名 |
性别 |
年龄 |
兄弟姐妹个数 |
父母子女个数 |
船票信息 |
票价 |
客舱 |
登船港口 |
乘客ID |
|
|
|
|
|
|
|
|
|
|
|
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
初步观察
df.info()
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 是否幸存 891 non-null int64
1 仓位等级 891 non-null int64
2 姓名 891 non-null object
3 性别 891 non-null object
4 年龄 714 non-null float64
5 兄弟姐妹个数 891 non-null int64
6 父母子女个数 891 non-null int64
7 船票信息 891 non-null object
8 票价 891 non-null float64
9 客舱 204 non-null object
10 登船港口 889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB
df.isnull().head()
|
是否幸存 |
仓位等级 |
姓名 |
性别 |
年龄 |
兄弟姐妹个数 |
父母子女个数 |
船票信息 |
票价 |
客舱 |
登船港口 |
乘客ID |
|
|
|
|
|
|
|
|
|
|
|
1 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
True |
False |
2 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
3 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
True |
False |
4 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
5 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
True |
False |
df.describe()
|
是否幸存 |
仓位等级 |
年龄 |
兄弟姐妹个数 |
父母子女个数 |
票价 |
count |
891.000000 |
891.000000 |
714.000000 |
891.000000 |
891.000000 |
891.000000 |
mean |
0.383838 |
2.308642 |
29.699118 |
0.523008 |
0.381594 |
32.204208 |
std |
0.486592 |
0.836071 |
14.526497 |
1.102743 |
0.806057 |
49.693429 |
min |
0.000000 |
1.000000 |
0.420000 |
0.000000 |
0.000000 |
0.000000 |
25% |
0.000000 |
2.000000 |
20.125000 |
0.000000 |
0.000000 |
7.910400 |
50% |
0.000000 |
3.000000 |
28.000000 |
0.000000 |
0.000000 |
14.454200 |
75% |
1.000000 |
3.000000 |
38.000000 |
1.000000 |
0.000000 |
31.000000 |
max |
1.000000 |
3.000000 |
80.000000 |
8.000000 |
6.000000 |
512.329200 |
df.nunique()
是否幸存 2
仓位等级 3
姓名 891
性别 2
年龄 88
兄弟姐妹个数 7
父母子女个数 7
船票信息 681
票价 248
客舱 147
登船港口 3
dtype: int64
df['年龄'].unique()
array([22. , 38. , 26. , 35. , nan, 54. , 2. , 27. , 14. ,
4. , 58. , 20. , 39. , 55. , 31. , 34. , 15. , 28. ,
8. , 19. , 40. , 66. , 42. , 21. , 18. , 3. , 7. ,
49. , 29. , 65. , 28.5 , 5. , 11. , 45. , 17. , 32. ,
16. , 25. , 0.83, 30. , 33. , 23. , 24. , 46. , 59. ,
71. , 37. , 47. , 14.5 , 70.5 , 32.5 , 12. , 9. , 36.5 ,
51. , 55.5 , 40.5 , 44. , 1. , 61. , 56. , 50. , 36. ,
45.5 , 20.5 , 62. , 41. , 52. , 63. , 23.5 , 0.92, 43. ,
60. , 10. , 64. , 13. , 48. , 0.75, 53. , 57. , 80. ,
70. , 24.5 , 6. , 0.67, 30.5 , 0.42, 34.5 , 74. ])
df['年龄'].count()
714
df['年龄'].value_counts()
24.00 30
22.00 27
18.00 26
19.00 25
30.00 25
..
55.50 1
70.50 1
66.00 1
23.50 1
0.42 1
Name: 年龄, Length: 88, dtype: int64
保存数据
df.to_csv('train_chinese.csv')
数据类型
Series
对于一个Series,其中最常用的属性为值(values),索引(index),名字(name),类型(dtype)
s = pd.Series(np.random.randn(5),index=['a','b','c','d','e'],name='这是一个Series',dtype='float64')
s
a -1.037213
b 0.102598
c 0.470972
d -0.497304
e -1.753896
Name: 这是一个Series, dtype: float64
Dataframe
df1 = pd.DataFrame(np.arange(12).reshape(3,4),columns = ['col1','col2','col3','col4'],index = list('123'))
df1
|
col1 |
col2 |
col3 |
col4 |
1 |
0 |
1 |
2 |
3 |
2 |
4 |
5 |
6 |
7 |
3 |
8 |
9 |
10 |
11 |
df1.rename(index={'1':'one'},columns={'col1':'new_col1'})
|
new_col1 |
col2 |
col3 |
col4 |
one |
0 |
1 |
2 |
3 |
2 |
4 |
5 |
6 |
7 |
3 |
8 |
9 |
10 |
11 |
df1.drop(columns = ['col2','col3'])
|
col1 |
col4 |
1 |
0 |
3 |
2 |
4 |
7 |
3 |
8 |
11 |
df1
|
col1 |
col2 |
col3 |
col4 |
1 |
0 |
1 |
2 |
3 |
2 |
4 |
5 |
6 |
7 |
3 |
8 |
9 |
10 |
11 |
df1.drop(columns = ['col2','col3'], inplace = True)
df1
|
col1 |
col4 |
1 |
0 |
3 |
2 |
4 |
7 |
3 |
8 |
11 |
筛选的逻辑
原理是使用布尔值对每一条数据进行筛选
df[df["年龄"]<10].head(3)
|
是否幸存 |
仓位等级 |
姓名 |
性别 |
年龄 |
兄弟姐妹个数 |
父母子女个数 |
船票信息 |
票价 |
客舱 |
登船港口 |
乘客ID |
|
|
|
|
|
|
|
|
|
|
|
8 |
0 |
3 |
Palsson, Master. Gosta Leonard |
male |
2.0 |
3 |
1 |
349909 |
21.075 |
NaN |
S |
11 |
1 |
3 |
Sandstrom, Miss. Marguerite Rut |
female |
4.0 |
1 |
1 |
PP 9549 |
16.700 |
G6 |
S |
17 |
0 |
3 |
Rice, Master. Eugene |
male |
2.0 |
4 |
1 |
382652 |
29.125 |
NaN |
Q |
midage = df[(df["年龄"]>10)& (df["年龄"]<50)]
midage.head(3)
|
是否幸存 |
仓位等级 |
姓名 |
性别 |
年龄 |
兄弟姐妹个数 |
父母子女个数 |
船票信息 |
票价 |
客舱 |
登船港口 |
乘客ID |
|
|
|
|
|
|
|
|
|
|
|
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
midage.loc[[100],['仓位等级','性别']]
midage.loc[[100,105,109],['仓位等级','姓名','性别']]
|
仓位等级 |
姓名 |
性别 |
乘客ID |
|
|
|
100 |
2 |
Kantor, Mr. Sinai |
male |
105 |
3 |
Gustafsson, Mr. Anders Vilhelm |
male |
109 |
3 |
Rekic, Mr. Tido |
male |
midage.iloc[[100,105,108],[1,2,3]]
|
仓位等级 |
姓名 |
性别 |
乘客ID |
|
|
|
150 |
2 |
Byles, Rev. Thomas Roussel Davids |
male |
161 |
3 |
Cribb, Mr. John Hatfield |
male |
164 |
3 |
Calic, Mr. Jovo |
male |
midage = midage.reset_index(drop = True)
midage
|
是否幸存 |
仓位等级 |
姓名 |
性别 |
年龄 |
兄弟姐妹个数 |
父母子女个数 |
船票信息 |
票价 |
客舱 |
登船港口 |
0 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
571 |
0 |
3 |
Rice, Mrs. William (Margaret Norton) |
female |
39.0 |
0 |
5 |
382652 |
29.1250 |
NaN |
Q |
572 |
0 |
2 |
Montvila, Rev. Juozas |
male |
27.0 |
0 |
0 |
211536 |
13.0000 |
NaN |
S |
573 |
1 |
1 |
Graham, Miss. Margaret Edith |
female |
19.0 |
0 |
0 |
112053 |
30.0000 |
B42 |
S |
574 |
1 |
1 |
Behr, Mr. Karl Howell |
male |
26.0 |
0 |
0 |
111369 |
30.0000 |
C148 |
C |
575 |
0 |
3 |
Dooley, Mr. Patrick |
male |
32.0 |
0 |
0 |
370376 |
7.7500 |
NaN |
Q |
576 rows × 11 columns
midage.loc[[100,105,108],['仓位等级','姓名','性别']]
|
仓位等级 |
姓名 |
性别 |
100 |
2 |
Byles, Rev. Thomas Roussel Davids |
male |
105 |
3 |
Cribb, Mr. John Hatfield |
male |
108 |
3 |
Calic, Mr. Jovo |
male |
排序
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
index=['2', '1'],
columns=['d', 'a', 'b', 'c'])
frame
|
d |
a |
b |
c |
2 |
0 |
1 |
2 |
3 |
1 |
4 |
5 |
6 |
7 |
frame.sort_index(ascending = False)
|
d |
a |
b |
c |
2 |
0 |
1 |
2 |
3 |
1 |
4 |
5 |
6 |
7 |
frame.sort_index(axis = 1)
|
a |
b |
c |
d |
2 |
1 |
2 |
3 |
0 |
1 |
5 |
6 |
7 |
4 |
frame.sort_values(by = ['b','d'])
|
d |
a |
b |
c |
2 |
0 |
1 |
2 |
3 |
1 |
4 |
5 |
6 |
7 |