前言
这是一门诞生于datawhale的课程,学习它的时候搭配datawhale所配备其他资源会更好,项目地址:https://github.com/datawhalechina/hands-on-data-analysis
数据集下载:https://www.kaggle.com/c/titanic/overview
import os
os.getcwd()
tip:相对路径载入报错时,尝试使用os.getcwd()查看当前工作目录
'C:\\Users\\lyj\\Desktop\\pyproject\\动手学数据分析_datawhale\\第一章_数据加载'
1.加载数据
1.1载入数据
1.1.1 导入相关库
import numpy as np
import pandas as pd
df = pd.read_csv('../titanic/train.csv')
df.head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
思考:以下几个问题
- 知道数据加载的方法后,pd.read_csv()和pd.read_table()的不同,如果想让他们效果一样,需要怎么做?
- 了解一下’.tsv’和’.csv’的不同,如何加载这两个数据集
- 不同的数据格式:eg:csv,tsv,xlsx的读取方式
由此可知,read_csv和read_table都是是加载带分隔符的数据,每一个分隔符作为一个数据的标志,
但二者读出来的数据格式还是不一样的,read_table是以制表符 \t 作为数据的标志,也就是以行为单位进行存储。
tabel_df = pd.read_table('../titanic/train.csv')
tabel_df.head()
|
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked |
0 |
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/... |
1 |
2,1,1,"Cumings, Mrs. John Bradley (Florence Br... |
2 |
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,S... |
3 |
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May ... |
4 |
5,0,3,"Allen, Mr. William Henry",male,35,0,0,3... |
table_df2 = pd.read_table('../titanic/train.csv',sep = ',')
table_df2.head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
TSV与CSV的区别:
- (1)从名称上即可知道,TSV是用制表符(Tab,’\t’)作为字段值的分隔符;CSV是用半角逗号(’,’)作为字段值的分隔符;
- 2)IANA规定的标准TSV格式,字段值之中是不允许出现制表符的。
关于xlxs类型文件的读取可以参考blog:https://blog.csdn.net/sinat_28576553/article/details/81275650
1.1.2每1000行为一个数据模块,逐块读取
chunker = pd.read_csv('../titanic/train.csv',chunksize = 1000)
- q:什么是逐块读取呢?为什么要逐块读取呢?
- a:有chunksize参数可以进行逐块加载。经测试,它的本质就是将文本分成若干块,每次处理chunksize行的数据,最终返回一个TextParser对象
对该对象进行迭代遍历,可以完成逐块统计的合并处理
chunker
for piece in chunker:
print(piece.head())
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
1.1.3将表头改为中文,索引改为乘客id
PassengerId => 乘客ID
Survived => 是否幸存
Pclass => 乘客等级(1/2/3等舱位)
Name => 乘客姓名
Sex => 性别
Age => 年龄
SibSp => 堂兄弟/妹个数
Parch => 父母与小孩个数
Ticket => 船票信息
Fare => 票价
Cabin => 客舱
Embarked => 登船港口
data_path = '../titanic/train.csv'
df = pd.read_csv(data_path,names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],index_col='乘客ID',header=0)
df.head()
|
是否幸存 |
仓位等级 |
姓名 |
性别 |
年龄 |
兄弟姐妹个数 |
父母子女个数 |
船票信息 |
票价 |
客舱 |
登船港口 |
乘客ID |
|
|
|
|
|
|
|
|
|
|
|
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
- 以上是在读取的时候就进行重命名,还有什么方法吗?
- ①对全部列重命名 df.columns = [‘a1’, ‘b1’, ‘c1’, ‘d1’]
- ②对部分列重命名 df.rename(columns={‘a’: ‘A’})
old_df = pd.read_csv(data_path)
old_df.head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
old_df.columns = ['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口']
old_df.head()
|
乘客ID |
是否幸存 |
仓位等级 |
姓名 |
性别 |
年龄 |
兄弟姐妹个数 |
父母子女个数 |
船票信息 |
票价 |
客舱 |
登船港口 |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
对部分列重命名,此时是新生成了一个新的dataframe,如果想替换原有df,则需要设置inplace = True
df1 = pd.read_csv(data_path)
df1.head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
df1.rename( columns={'PassengerId': '乘客id'}).head()
|
乘客id |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
1.2.观察数据
1.2.1查看数据的基本信息
df = pd.read_csv(data_path,names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],header = 0)
df.info()
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
乘客ID 891 non-null int64
是否幸存 891 non-null int64
仓位等级 891 non-null int64
姓名 891 non-null object
性别 891 non-null object
年龄 714 non-null float64
兄弟姐妹个数 891 non-null int64
父母子女个数 891 non-null int64
船票信息 891 non-null object
票价 891 non-null float64
客舱 204 non-null object
登船港口 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
可以看到数据的一些信息,包括dtype,还可以发现一些缺失值
1.2.2观察表格前10行的数据和后15行的数据
df.head(10)
|
乘客ID |
是否幸存 |
仓位等级 |
姓名 |
性别 |
年龄 |
兄弟姐妹个数 |
父母子女个数 |
船票信息 |
票价 |
客舱 |
登船港口 |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
5 |
6 |
0 |
3 |
Moran, Mr. James |
male |
NaN |
0 |
0 |
330877 |
8.4583 |
NaN |
Q |
6 |
7 |
0 |
1 |
McCarthy, Mr. Timothy J |
male |
54.0 |
0 |
0 |
17463 |
51.8625 |
E46 |
S |
7 |
8 |
0 |
3 |
Palsson, Master. Gosta Leonard |
male |
2.0 |
3 |
1 |
349909 |
21.0750 |
NaN |
S |
8 |
9 |
1 |
3 |
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) |
female |
27.0 |
0 |
2 |
347742 |
11.1333 |
NaN |
S |
9 |
10 |
1 |
2 |
Nasser, Mrs. Nicholas (Adele Achem) |
female |
14.0 |
1 |
0 |
237736 |
30.0708 |
NaN |
C |
df.tail(15)
|
乘客ID |
是否幸存 |
仓位等级 |
姓名 |
性别 |
年龄 |
兄弟姐妹个数 |
父母子女个数 |
船票信息 |
票价 |
客舱 |
登船港口 |
876 |
877 |
0 |
3 |
Gustafsson, Mr. Alfred Ossian |
male |
20.0 |
0 |
0 |
7534 |
9.8458 |
NaN |
S |
877 |
878 |
0 |
3 |
Petroff, Mr. Nedelio |
male |
19.0 |
0 |
0 |
349212 |
7.8958 |
NaN |
S |
878 |
879 |
0 |
3 |
Laleff, Mr. Kristo |
male |
NaN |
0 |
0 |
349217 |
7.8958 |
NaN |
S |
879 |
880 |
1 |
1 |
Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) |
female |
56.0 |
0 |
1 |
11767 |
83.1583 |
C50 |
C |
880 |
881 |
1 |
2 |
Shelley, Mrs. William (Imanita Parrish Hall) |
female |
25.0 |
0 |
1 |
230433 |
26.0000 |
NaN |
S |
881 |
882 |
0 |
3 |
Markun, Mr. Johann |
male |
33.0 |
0 |
0 |
349257 |
7.8958 |
NaN |
S |
882 |
883 |
0 |
3 |
Dahlberg, Miss. Gerda Ulrika |
female |
22.0 |
0 |
0 |
7552 |
10.5167 |
NaN |
S |
883 |
884 |
0 |
2 |
Banfield, Mr. Frederick James |
male |
28.0 |
0 |
0 |
C.A./SOTON 34068 |
10.5000 |
NaN |
S |
884 |
885 |
0 |
3 |
Sutehall, Mr. Henry Jr |
male |
25.0 |
0 |
0 |
SOTON/OQ 392076 |
7.0500 |
NaN |
S |
885 |
886 |
0 |
3 |
Rice, Mrs. William (Margaret Norton) |
female |
39.0 |
0 |
5 |
382652 |
29.1250 |
NaN |
Q |
886 |
887 |
0 |
2 |
Montvila, Rev. Juozas |
male |
27.0 |
0 |
0 |
211536 |
13.0000 |
NaN |
S |
887 |
888 |
1 |
1 |
Graham, Miss. Margaret Edith |
female |
19.0 |
0 |
0 |
112053 |
30.0000 |
B42 |
S |
888 |
889 |
0 |
3 |
Johnston, Miss. Catherine Helen "Carrie" |
female |
NaN |
1 |
2 |
W./C. 6607 |
23.4500 |
NaN |
S |
889 |
890 |
1 |
1 |
Behr, Mr. Karl Howell |
male |
26.0 |
0 |
0 |
111369 |
30.0000 |
C148 |
C |
890 |
891 |
0 |
3 |
Dooley, Mr. Patrick |
male |
32.0 |
0 |
0 |
370376 |
7.7500 |
NaN |
Q |
1.2.3判断数据是否为空,为空的地方返回True,其余地方返回False
df.isnull().head()
|
乘客ID |
是否幸存 |
仓位等级 |
姓名 |
性别 |
年龄 |
兄弟姐妹个数 |
父母子女个数 |
船票信息 |
票价 |
客舱 |
登船港口 |
0 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
True |
False |
1 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
2 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
True |
False |
3 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
4 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
True |
False |
df.isnull().sum()
乘客ID 0
是否幸存 0
仓位等级 0
姓名 0
性别 0
年龄 177
兄弟姐妹个数 0
父母子女个数 0
船票信息 0
票价 0
客舱 687
登船港口 2
dtype: int64
由上可知,客舱缺失值非常多,占了总数据的3/4以上
1.2.4保存我们做出的改变的数据
df.to_csv('train_chinese.csv')
2.pandas基础
2.1认识Series和dataframe
sdata = {'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
example = pd.Series(sdata)
example
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
data = {'state':list(range(10)),'year':list(range(2010,2020)),'pop':[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,0.0]}
df = pd.DataFrame(data)
df
|
pop |
state |
year |
0 |
1.1 |
0 |
2010 |
1 |
2.2 |
1 |
2011 |
2 |
3.3 |
2 |
2012 |
3 |
4.4 |
3 |
2013 |
4 |
5.5 |
4 |
2014 |
5 |
6.6 |
5 |
2015 |
6 |
7.7 |
6 |
2016 |
7 |
8.8 |
7 |
2017 |
8 |
9.9 |
8 |
2018 |
9 |
0.0 |
9 |
2019 |
df['pop']
0 1.1
1 2.2
2 3.3
3 4.4
4 5.5
5 6.6
6 7.7
7 8.8
8 9.9
9 0.0
Name: pop, dtype: float64
实际上dataframe的每一列都是Series类型
2.2查看数据
df = pd.read_csv(data_path)
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
test_data_path = '../titanic/test.csv'
test_df = pd.read_csv(test_data_path)
test_df.head()
|
PassengerId |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
892 |
3 |
Kelly, Mr. James |
male |
34.5 |
0 |
0 |
330911 |
7.8292 |
NaN |
Q |
1 |
893 |
3 |
Wilkes, Mrs. James (Ellen Needs) |
female |
47.0 |
1 |
0 |
363272 |
7.0000 |
NaN |
S |
2 |
894 |
2 |
Myles, Mr. Thomas Francis |
male |
62.0 |
0 |
0 |
240276 |
9.6875 |
NaN |
Q |
3 |
895 |
3 |
Wirz, Mr. Albert |
male |
27.0 |
0 |
0 |
315154 |
8.6625 |
NaN |
S |
4 |
896 |
3 |
Hirvonen, Mrs. Alexander (Helga E Lindqvist) |
female |
22.0 |
1 |
1 |
3101298 |
12.2875 |
NaN |
S |
test_df.columns
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
2.3删除某几列
- ①del df[]
- ②df.drop([],axis = 1)
del df['Name']
df.head()
|
PassengerId |
Survived |
Pclass |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
df.drop(['Age'],axis = 1).head(2)
|
PassengerId |
Survived |
Pclass |
Sex |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
male |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
female |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2.4筛选的逻辑
2.4.1以‘age’为筛选条件,显示年龄在10岁以下的乘客信息
df[df['Age']<10].head(2)
|
PassengerId |
Survived |
Pclass |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
7 |
8 |
0 |
3 |
male |
2.0 |
3 |
1 |
349909 |
21.075 |
NaN |
S |
10 |
11 |
1 |
3 |
female |
4.0 |
1 |
1 |
PP 9549 |
16.700 |
G6 |
S |
2.4.2以‘Age’为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage
midage = df[(df['Age']>10)&(df['Age']<50)]
midage.head(2)
|
PassengerId |
Survived |
Pclass |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2.4.3将midage的数据中第100行中的‘Pclass’和‘Sex’的数据显示出来
midage
midage = midage.reset_index(drop = True)
midage.head()
|
PassengerId |
Survived |
Pclass |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
midage.loc[[100],['Pclass','Sex']]
取值时里面用[]包含才是dataframe,否则取出来就是Series
midage.loc[100,['Pclass','Sex']]
Pclass 2
Sex male
Name: 100, dtype: object
2.4.4将midage的数据中第100行,105,108中的‘Pclass’和‘Sex’和’Name’的数据显示出来
midage.loc[[100,105,108],['Pclass','Sex','Name']]
|
Pclass |
Sex |
Name |
100 |
2 |
male |
NaN |
105 |
3 |
male |
NaN |
108 |
3 |
male |
NaN |
2.4.5将midage的数据中第100行,105,108中的‘Pclass’和‘Sex’的数据显示出来,使用iloc
midage.head()
|
PassengerId |
Survived |
Pclass |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
midage.iloc[[100,105,108],[2,3]]
|
Pclass |
Sex |
100 |
2 |
male |
105 |
3 |
male |
108 |
3 |
male |
3探索性数据分析
3.1利用Pandas对示例数据进行排序,要求升序
df = pd.DataFrame(np.arange(8).reshape(2,4),index = ['a','b'],columns=['A','B','C','D'])
df
|
A |
B |
C |
D |
a |
0 |
1 |
2 |
3 |
b |
4 |
5 |
6 |
7 |
df.sort_values(by = ['B'],ascending=True)
|
A |
B |
C |
D |
a |
0 |
1 |
2 |
3 |
b |
4 |
5 |
6 |
7 |
排序总结
- 让行索引升序排序 df.sort_index()
- 让列索引升序排序df.sort_index(axis = 1)
- 让列索引降序排序df.sort_index(axis= 1,ascending = True)
- 任选两列数据进行排序 df.sort_values(by = [])
3.2对train_data按票价和年龄进行降序排序
train_df = pd.read_csv(data_path)
train_df.head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
train_df.sort_values(by = ['Fare','Age'],ascending = False).head().append(train_df.sort_values(by = ['Fare','Age'],ascending = False).tail())
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
679 |
680 |
1 |
1 |
Cardeza, Mr. Thomas Drake Martinez |
male |
36.0 |
0 |
1 |
PC 17755 |
512.3292 |
B51 B53 B55 |
C |
258 |
259 |
1 |
1 |
Ward, Miss. Anna |
female |
35.0 |
0 |
0 |
PC 17755 |
512.3292 |
NaN |
C |
737 |
738 |
1 |
1 |
Lesurer, Mr. Gustave J |
male |
35.0 |
0 |
0 |
PC 17755 |
512.3292 |
B101 |
C |
438 |
439 |
0 |
1 |
Fortune, Mr. Mark |
male |
64.0 |
1 |
4 |
19950 |
263.0000 |
C23 C25 C27 |
S |
341 |
342 |
1 |
1 |
Fortune, Miss. Alice Elizabeth |
female |
24.0 |
3 |
2 |
19950 |
263.0000 |
C23 C25 C27 |
S |
481 |
482 |
0 |
2 |
Frost, Mr. Anthony Wood "Archie" |
male |
NaN |
0 |
0 |
239854 |
0.0000 |
NaN |
S |
633 |
634 |
0 |
1 |
Parr, Mr. William Henry Marsh |
male |
NaN |
0 |
0 |
112052 |
0.0000 |
NaN |
S |
674 |
675 |
0 |
2 |
Watson, Mr. Ennis Hastings |
male |
NaN |
0 |
0 |
239856 |
0.0000 |
NaN |
S |
732 |
733 |
0 |
2 |
Knight, Mr. Robert J |
male |
NaN |
0 |
0 |
239855 |
0.0000 |
NaN |
S |
815 |
816 |
0 |
1 |
Fry, Mr. Richard |
male |
NaN |
0 |
0 |
112058 |
0.0000 |
B102 |
S |
看到前面的富人存活率非常高,但是最后的相对贫穷的基本存活率为0
3.3计算两个dataframe的相加
frame1 = pd.DataFrame(np.arange(100).reshape(10,10),index= list(range(10)),columns=list(range(2010,2020)))
frame2 = pd.DataFrame(np.arange(100,200).reshape(10,10),index= list(range(10)),columns=list(range(2010,2020)))
frame1+frame2
|
2010 |
2011 |
2012 |
2013 |
2014 |
2015 |
2016 |
2017 |
2018 |
2019 |
0 |
100 |
102 |
104 |
106 |
108 |
110 |
112 |
114 |
116 |
118 |
1 |
120 |
122 |
124 |
126 |
128 |
130 |
132 |
134 |
136 |
138 |
2 |
140 |
142 |
144 |
146 |
148 |
150 |
152 |
154 |
156 |
158 |
3 |
160 |
162 |
164 |
166 |
168 |
170 |
172 |
174 |
176 |
178 |
4 |
180 |
182 |
184 |
186 |
188 |
190 |
192 |
194 |
196 |
198 |
5 |
200 |
202 |
204 |
206 |
208 |
210 |
212 |
214 |
216 |
218 |
6 |
220 |
222 |
224 |
226 |
228 |
230 |
232 |
234 |
236 |
238 |
7 |
240 |
242 |
244 |
246 |
248 |
250 |
252 |
254 |
256 |
258 |
8 |
260 |
262 |
264 |
266 |
268 |
270 |
272 |
274 |
276 |
278 |
9 |
280 |
282 |
284 |
286 |
288 |
290 |
292 |
294 |
296 |
298 |
3.4计算出船上最大的家族有多少人?
train_df.head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
train_df['family'] = train_df['SibSp']+train_df['Parch']
max(train_df['family'])
10
3.5查看train_data的基本统计数据
del train_df['family']
train_df.describe()
|
PassengerId |
Survived |
Pclass |
Age |
SibSp |
Parch |
Fare |
count |
891.000000 |
891.000000 |
891.000000 |
714.000000 |
891.000000 |
891.000000 |
891.000000 |
mean |
446.000000 |
0.383838 |
2.308642 |
29.699118 |
0.523008 |
0.381594 |
32.204208 |
std |
257.353842 |
0.486592 |
0.836071 |
14.526497 |
1.102743 |
0.806057 |
49.693429 |
min |
1.000000 |
0.000000 |
1.000000 |
0.420000 |
0.000000 |
0.000000 |
0.000000 |
25% |
223.500000 |
0.000000 |
2.000000 |
20.125000 |
0.000000 |
0.000000 |
7.910400 |
50% |
446.000000 |
0.000000 |
3.000000 |
28.000000 |
0.000000 |
0.000000 |
14.454200 |
75% |
668.500000 |
1.000000 |
3.000000 |
38.000000 |
1.000000 |
0.000000 |
31.000000 |
max |
891.000000 |
1.000000 |
3.000000 |
80.000000 |
8.000000 |
6.000000 |
512.329200 |