>>> import numpy as np
>>> import pandas as pd
>>> df = pd.read_csv('E:/python/titanic/train.csv')
>>> df.head(3)
Out[3]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
【思考】pd.read_csv() 和 pd.read_table() 的不同,了解一下’.tsv’和’.csv’的不同,如何加载这两个数据集?
CSV文件读写.
# tsv 数据格式:
>>> tf = pd.read_table('titanic/train.csv')
>>> tf
Out[22]:
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/...
1 2,1,1,"Cumings, Mrs. John Bradley (Florence Br...
2 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,S...
3 4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May ...
4 5,0,3,"Allen, Mr. William Henry",male,35,0,0,3...
.. ...
886 887,0,2,"Montvila, Rev. Juozas",male,27,0,0,21...
887 888,1,1,"Graham, Miss. Margaret Edith",female,...
888 889,0,3,"Johnston, Miss. Catherine Helen ""Car...
889 890,1,1,"Behr, Mr. Karl Howell",male,26,0,0,11...
890 891,0,3,"Dooley, Mr. Patrick",male,32,0,0,3703...
[891 rows x 1 columns]
# csv 数据格式:
>>> df
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/...
1 2,1,1,"Cumings, Mrs. John Bradley (Florence Br...
2 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,S...
3 4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May ...
4 5,0,3,"Allen, Mr. William Henry",male,35,0,0,3...
.. ...
886 887,0,2,"Montvila, Rev. Juozas",male,27,0,0,21...
887 888,1,1,"Graham, Miss. Margaret Edith",female,...
888 889,0,3,"Johnston, Miss. Catherine Helen ""Car...
889 890,1,1,"Behr, Mr. Karl Howell",male,26,0,0,11...
890 891,0,3,"Dooley, Mr. Patrick",male,32,0,0,3703...
[891 rows x 1 columns]
>>> chunker = pd.read_csv('titanic/train.csv', chunksize=1000)
>>> print(chunker)
<pandas.io.parsers.TextFileReader object at 0x000002E5A8697648>
【思考】什么是逐块读取?为什么要逐块读取呢?
使用pandas来处理文件的时候,经常会遇到大文件,而有时候我们只想要读取其中的一部分数据或对文件进行逐块处理。
1、读取文件中前部分
通过nrows参数,来设置读取文件的前多少行,nrows是一个大于等于0的整数。
data = pd.read_csv('titanic/train.csv',nrows=5)
2、逐块读取文件
chunker = pd.read_csv('titanic/train.csv', chunksize=1000)
>>> df = pd.read_csv('titanic/train.csv', names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],index_col='乘客ID',header=0)
>>> df.head()
Out[13]:
是否幸存 仓位等级 ... 客舱 登船港口
乘客ID ...
1 0 3 ... NaN S
2 1 1 ... C85 C
3 1 3 ... NaN S
4 1 1 ... C123 S
5 0 3 ... NaN S
[5 rows x 11 columns]
观察数据大小、有多少列,各列都是什么格式的,是否包含null等
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 是否幸存 891 non-null int64
1 仓位等级 891 non-null int64
2 姓名 891 non-null object
3 性别 891 non-null object
4 年龄 714 non-null float64
5 兄弟姐妹个数 891 non-null int64
6 父母子女个数 891 non-null int64
7 船票信息 891 non-null object
8 票价 891 non-null float64
9 客舱 204 non-null object
10 登船港口 889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB
>>> df.head(10) #表格的前10行
Out[15]:
是否幸存 仓位等级 ... 客舱 登船港口
乘客ID ...
1 0 3 ... NaN S
2 1 1 ... C85 C
3 1 3 ... NaN S
4 1 1 ... C123 S
5 0 3 ... NaN S
6 0 3 ... NaN Q
7 0 1 ... E46 S
8 0 3 ... NaN S
9 1 3 ... NaN S
10 1 2 ... NaN C
[10 rows x 11 columns]
>>> df.tail(15) #表格的后15行
Out[16]:
是否幸存 仓位等级 ... 客舱 登船港口
乘客ID ...
877 0 3 ... NaN S
878 0 3 ... NaN S
879 0 3 ... NaN S
880 1 1 ... C50 C
881 1 2 ... NaN S
882 0 3 ... NaN S
883 0 3 ... NaN S
884 0 2 ... NaN S
885 0 3 ... NaN S
886 0 3 ... NaN Q
887 0 2 ... NaN S
888 1 1 ... B42 S
889 0 3 ... NaN S
890 1 1 ... C148 C
891 0 3 ... NaN Q
[15 rows x 11 columns]
df.isnull().head()
Out[17]:
是否幸存 仓位等级 姓名 性别 年龄 ... 父母子女个数 船票信息 票价 客舱 登船港口
乘客ID ...
1 False False False False False ... False False False True False
2 False False False False False ... False False False False False
3 False False False False False ... False False False True False
4 False False False False False ... False False False False False
5 False False False False False ... False False False True False
[5 rows x 11 columns]
【思考】对于一个数据,还可以从哪些方面来观察?找找答案,这个将对下面的数据分析有很大的帮助
df.notnull().head() # 有数据地方为 True
将你加载并做出改变的数据,在工作目录下保存为一个新文件train_chinese.csv
df.to_csv('titanic/train_chinese.csv')
# Series类型
>>> sdata = {'Beijing': 35000, 'Tianjin': 71000, 'Shanghai': 16000, 'Chongqin': 5000}
>>> example_1 = pd.Series(sdata)
>>> example_1
Out[3]:
Beijing 35000
Tianjin 71000
Shanghai 16000
Chongqin 5000
dtype: int64
# DataFrame类型
>>> data = {'state': ['Beijing', 'Beijing', 'Beijing', 'Tianjin', 'Tianjin', 'Tianjin'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
>>> example_2 = pd.DataFrame(data)
>>> example_2
Out[3]:
Beijing 35000
Tianjin 71000
Shanghai 16000
Chongqin 5000
dtype: int64
>>> df = pd.read_csv('titanic/train.csv')
>>> df.head(3)
Out[4]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
[3 rows x 12 columns]
>>> df.columns
Out[5]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
>>> df['Cabin'].head(3)
Out[6]:
0 NaN
1 C85
2 NaN
Name: Cabin, dtype: object
>>> df.Cabin.head(3)
Out[7]:
0 NaN
1 C85
2 NaN
Name: Cabin, dtype: object
>>> test = pd.read_csv('titanic/test.csv')
>>> test.head(3)
删除多余的列
# 删除多余的列
del test['a']
【思考】还有其他的删除多余的列的方式吗?
DataFrame对象.drop('str') 可以删除指定行、列索引
df.drop(['PassengerId','Name','Age','Ticket'],axis=1).head(3)
Out[11]:
Survived Pclass Sex SibSp Parch Fare Cabin Embarked
0 0 3 male 1 0 7.2500 NaN S
1 1 1 female 1 0 71.2833 C85 C
2 1 3 female 0 0 7.9250 NaN S
# 如果想要完全的删除你的数据结构,使用inplace=True,因为使用inplace就将原数据覆盖了,所以这里没有用
>>> df[df["Age"]<10].head(3)
Out[12]:
PassengerId Survived Pclass ... Fare Cabin Embarked
7 8 0 3 ... 21.075 NaN S
10 11 1 3 ... 16.700 G6 S
16 17 0 3 ... 29.125 NaN Q
>>> midage = df[(df["Age"]>10)& (df["Age"]<50)]
>>> midage.head(3)
Out[13]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
[3 rows x 12 columns]
>>> midage = midage.reset_index(drop=True)
>>> midage.head(3)
Out[14]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
[3 rows x 12 columns]
>>> midage.loc[[100],['Pclass','Sex']]
Out[17]:
Pclass Sex
100 2 male
>>> midage.loc[[100,105,108],['Pclass','Name','Sex']] #因为你主动的延长了行的距离,所以会产生表格形式
Out[15]:
Pclass Name Sex
100 2 Byles, Rev. Thomas Roussel Davids male
105 3 Cribb, Mr. John Hatfield male
108 3 Calic, Mr. Jovo male
>>> midage.iloc[[100,105,108],[2,3,4]]
Out[16]:
Pclass Name Sex
100 2 Byles, Rev. Thomas Roussel Davids male
105 3 Cribb, Mr. John Hatfield male
108 3 Calic, Mr. Jovo male
text = pd.read_csv('train_chinese.csv')
text.head()
# 构建一个DataFrame数据
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
index=['2', '1'],
columns=['d', 'a', 'b', 'c'])
# 表格2*4,行索引index,列索引colums
frame
# 对‘C’ column进行排序,降序
frame.sort_values(by='c', ascending=False)
# 对行index进行排序,升序
frame.sort_index()
# 对列索引排序,排序
frame.sort_index(axis=1)
# 对列索引排序,降序
frame.sort_index(axis=1, ascending=False)
# 让任选两列数据 a、c 同时降序排序
frame.sort_values(by=['a', 'c'])
text.sort_values(by=['票价', '年龄'], ascending=False).head(50) # 前50名
text.sort_values(by=['票价', '年龄'], ascending=False).tail(50) # 后50名
可以发现票价最高的前50名乘客,存活率很大
frame1_a = pd.DataFrame(np.arange(9.).reshape(3, 3),
columns=['a', 'b', 'c'],
index=['one', 'two', 'three'])
frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3),
columns=['a', 'e', 'c'],
index=['first', 'one', 'two', 'second'])
frame1_a:
a b c
one 0.0 1.0 2.0
two 3.0 4.0 5.0
three 6.0 7.0 8.0
frame1_b:
a e c
first 0.0 1.0 2.0
one 3.0 4.0 5.0
two 6.0 7.0 8.0
second 9.0 10.0 11.0
frame1_a + frame1_b:
a b c e
first NaN NaN NaN NaN
one 3.0 NaN 7.0 NaN
second NaN NaN NaN NaN
three NaN NaN NaN NaN
two 9.0 NaN 13.0 NaN
'''DataFrame对应的位置数值进行了相加,不存在的位置返回了NaN'''
max(text['兄弟姐妹个数'] + text['父母子女个数'])
frame2 = pd.DataFrame([[1.4, np.nan],
[7.1, -4.5],
[np.nan, np.nan],
[0.75, -1.3]
], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
frame2
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
# 调用 describe 函数,观察frame2的数据基本信息
frame2.describe()
'''
count : 样本数据大小
mean : 样本数据的平均值
std : 样本数据的标准差
min : 样本数据的最小值
25% : 样本数据25%的时候的值
50% : 样本数据50%的时候的值
75% : 样本数据75%的时候的值
max : 样本数据的最大值
'''
text['票价'].describe()
可以看出平均值=32 和 max=512,有着非常大的差距,标准差49.69,票价的波动也很大,整艘船的贫富差距是非常大的,
text['父母子女个数'].describe()