动手学数据分析——Task01_数据加载

前言

这是一门诞生于datawhale的课程,学习它的时候搭配datawhale所配备其他资源会更好,项目地址:https://github.com/datawhalechina/hands-on-data-analysis
数据集下载:https://www.kaggle.com/c/titanic/overview

import os
os.getcwd()

tip:相对路径载入报错时,尝试使用os.getcwd()查看当前工作目录

'C:\\Users\\lyj\\Desktop\\pyproject\\动手学数据分析_datawhale\\第一章_数据加载'

1.加载数据

1.1载入数据

1.1.1 导入相关库

import numpy as np
import pandas as pd
df = pd.read_csv('../titanic/train.csv')
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

思考:以下几个问题

  • 知道数据加载的方法后,pd.read_csv()和pd.read_table()的不同,如果想让他们效果一样,需要怎么做?
  • 了解一下’.tsv’和’.csv’的不同,如何加载这两个数据集
  • 不同的数据格式:eg:csv,tsv,xlsx的读取方式

动手学数据分析——Task01_数据加载_第1张图片

由此可知,read_csv和read_table都是是加载带分隔符的数据,每一个分隔符作为一个数据的标志,
但二者读出来的数据格式还是不一样的,read_table是以制表符 \t 作为数据的标志,也就是以行为单位进行存储。

  • read_table读取看看效果
# 默认分隔符是'\t'
tabel_df = pd.read_table('../titanic/train.csv')
tabel_df.head()
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/...
1 2,1,1,"Cumings, Mrs. John Bradley (Florence Br...
2 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,S...
3 4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May ...
4 5,0,3,"Allen, Mr. William Henry",male,35,0,0,3...
table_df2 = pd.read_table('../titanic/train.csv',sep = ',')
table_df2.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

TSV与CSV的区别:

  • (1)从名称上即可知道,TSV是用制表符(Tab,’\t’)作为字段值的分隔符;CSV是用半角逗号(’,’)作为字段值的分隔符;
  • 2)IANA规定的标准TSV格式,字段值之中是不允许出现制表符的。

关于xlxs类型文件的读取可以参考blog:https://blog.csdn.net/sinat_28576553/article/details/81275650

1.1.2每1000行为一个数据模块,逐块读取

chunker = pd.read_csv('../titanic/train.csv',chunksize = 1000)
  • q:什么是逐块读取呢?为什么要逐块读取呢?
  • a:有chunksize参数可以进行逐块加载。经测试,它的本质就是将文本分成若干块,每次处理chunksize行的数据,最终返回一个TextParser对象

    对该对象进行迭代遍历,可以完成逐块统计的合并处理
chunker

for piece in chunker:
    print(piece.head())
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  

1.1.3将表头改为中文,索引改为乘客id

PassengerId => 乘客ID

Survived => 是否幸存

Pclass => 乘客等级(1/2/3等舱位)

Name => 乘客姓名

Sex => 性别

Age => 年龄

SibSp => 堂兄弟/妹个数

Parch => 父母与小孩个数

Ticket => 船票信息

Fare => 票价

Cabin => 客舱

Embarked => 登船港口

data_path = '../titanic/train.csv'
df = pd.read_csv(data_path,names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],index_col='乘客ID',header=0)
df.head()
是否幸存 仓位等级 姓名 性别 年龄 兄弟姐妹个数 父母子女个数 船票信息 票价 客舱 登船港口
乘客ID
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
  • 以上是在读取的时候就进行重命名,还有什么方法吗?
  • ①对全部列重命名 df.columns = [‘a1’, ‘b1’, ‘c1’, ‘d1’]
  • ②对部分列重命名 df.rename(columns={‘a’: ‘A’})
old_df = pd.read_csv(data_path)
old_df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
old_df.columns = ['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口']
old_df.head()
乘客ID 是否幸存 仓位等级 姓名 性别 年龄 兄弟姐妹个数 父母子女个数 船票信息 票价 客舱 登船港口
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

对部分列重命名,此时是新生成了一个新的dataframe,如果想替换原有df,则需要设置inplace = True

df1 = pd.read_csv(data_path)
df1.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
df1.rename( columns={'PassengerId': '乘客id'}).head()
乘客id Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

1.2.观察数据

1.2.1查看数据的基本信息

df = pd.read_csv(data_path,names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],header = 0)
df.info()

RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
乘客ID      891 non-null int64
是否幸存      891 non-null int64
仓位等级      891 non-null int64
姓名        891 non-null object
性别        891 non-null object
年龄        714 non-null float64
兄弟姐妹个数    891 non-null int64
父母子女个数    891 non-null int64
船票信息      891 non-null object
票价        891 non-null float64
客舱        204 non-null object
登船港口      889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

可以看到数据的一些信息,包括dtype,还可以发现一些缺失值

1.2.2观察表格前10行的数据和后15行的数据

df.head(10)
乘客ID 是否幸存 仓位等级 姓名 性别 年龄 兄弟姐妹个数 父母子女个数 船票信息 票价 客舱 登船港口
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
df.tail(15)
乘客ID 是否幸存 仓位等级 姓名 性别 年龄 兄弟姐妹个数 父母子女个数 船票信息 票价 客舱 登船港口
876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 NaN S
877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S
881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

1.2.3判断数据是否为空,为空的地方返回True,其余地方返回False

df.isnull().head()
乘客ID 是否幸存 仓位等级 姓名 性别 年龄 兄弟姐妹个数 父母子女个数 船票信息 票价 客舱 登船港口
0 False False False False False False False False False False True False
1 False False False False False False False False False False False False
2 False False False False False False False False False False True False
3 False False False False False False False False False False False False
4 False False False False False False False False False False True False
df.isnull().sum()
乘客ID        0
是否幸存        0
仓位等级        0
姓名          0
性别          0
年龄        177
兄弟姐妹个数      0
父母子女个数      0
船票信息        0
票价          0
客舱        687
登船港口        2
dtype: int64

由上可知,客舱缺失值非常多,占了总数据的3/4以上

1.2.4保存我们做出的改变的数据

df.to_csv('train_chinese.csv')

2.pandas基础

2.1认识Series和dataframe

sdata = {'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
example = pd.Series(sdata)
example
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
data = {'state':list(range(10)),'year':list(range(2010,2020)),'pop':[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,0.0]}
df = pd.DataFrame(data)
df
pop state year
0 1.1 0 2010
1 2.2 1 2011
2 3.3 2 2012
3 4.4 3 2013
4 5.5 4 2014
5 6.6 5 2015
6 7.7 6 2016
7 8.8 7 2017
8 9.9 8 2018
9 0.0 9 2019
df['pop']
0    1.1
1    2.2
2    3.3
3    4.4
4    5.5
5    6.6
6    7.7
7    8.8
8    9.9
9    0.0
Name: pop, dtype: float64

实际上dataframe的每一列都是Series类型

2.2查看数据

df = pd.read_csv(data_path)
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
test_data_path = '../titanic/test.csv'
test_df = pd.read_csv(test_data_path)
test_df.head()
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
test_df.columns
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

2.3删除某几列

  • ①del df[]
  • ②df.drop([],axis = 1)
del df['Name']
df.head()
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
df.drop(['Age'],axis = 1).head(2)#使用inplace关键字才是真正删除
PassengerId Survived Pclass Sex SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 male 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 female 1 0 PC 17599 71.2833 C85 C

2.4筛选的逻辑

2.4.1以‘age’为筛选条件,显示年龄在10岁以下的乘客信息

df[df['Age']<10].head(2)
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
7 8 0 3 male 2.0 3 1 349909 21.075 NaN S
10 11 1 3 female 4.0 1 1 PP 9549 16.700 G6 S

2.4.2以‘Age’为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage

midage = df[(df['Age']>10)&(df['Age']<50)]
midage.head(2)
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C

2.4.3将midage的数据中第100行中的‘Pclass’和‘Sex’的数据显示出来

midage
# index需要重新设置
midage = midage.reset_index(drop = True)
midage.head()
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
midage.loc[[100],['Pclass','Sex']]
Pclass Sex
100 2 male

取值时里面用[]包含才是dataframe,否则取出来就是Series

midage.loc[100,['Pclass','Sex']]
Pclass       2
Sex       male
Name: 100, dtype: object

2.4.4将midage的数据中第100行,105,108中的‘Pclass’和‘Sex’和’Name’的数据显示出来

midage.loc[[100,105,108],['Pclass','Sex','Name']]#Name这列在上面被删除了
Pclass Sex Name
100 2 male NaN
105 3 male NaN
108 3 male NaN

2.4.5将midage的数据中第100行,105,108中的‘Pclass’和‘Sex’的数据显示出来,使用iloc

midage.head()
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
midage.iloc[[100,105,108],[2,3]]
Pclass Sex
100 2 male
105 3 male
108 3 male

3探索性数据分析

3.1利用Pandas对示例数据进行排序,要求升序

df = pd.DataFrame(np.arange(8).reshape(2,4),index = ['a','b'],columns=['A','B','C','D'])
df
A B C D
a 0 1 2 3
b 4 5 6 7
df.sort_values(by = ['B'],ascending=True)
A B C D
a 0 1 2 3
b 4 5 6 7

排序总结

  • 让行索引升序排序 df.sort_index()
  • 让列索引升序排序df.sort_index(axis = 1)
  • 让列索引降序排序df.sort_index(axis= 1,ascending = True)
  • 任选两列数据进行排序 df.sort_values(by = [])

3.2对train_data按票价和年龄进行降序排序

train_df = pd.read_csv(data_path)
train_df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
train_df.sort_values(by = ['Fare','Age'],ascending = False).head().append(train_df.sort_values(by = ['Fare','Age'],ascending = False).tail())
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
679 680 1 1 Cardeza, Mr. Thomas Drake Martinez male 36.0 0 1 PC 17755 512.3292 B51 B53 B55 C
258 259 1 1 Ward, Miss. Anna female 35.0 0 0 PC 17755 512.3292 NaN C
737 738 1 1 Lesurer, Mr. Gustave J male 35.0 0 0 PC 17755 512.3292 B101 C
438 439 0 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0000 C23 C25 C27 S
341 342 1 1 Fortune, Miss. Alice Elizabeth female 24.0 3 2 19950 263.0000 C23 C25 C27 S
481 482 0 2 Frost, Mr. Anthony Wood "Archie" male NaN 0 0 239854 0.0000 NaN S
633 634 0 1 Parr, Mr. William Henry Marsh male NaN 0 0 112052 0.0000 NaN S
674 675 0 2 Watson, Mr. Ennis Hastings male NaN 0 0 239856 0.0000 NaN S
732 733 0 2 Knight, Mr. Robert J male NaN 0 0 239855 0.0000 NaN S
815 816 0 1 Fry, Mr. Richard male NaN 0 0 112058 0.0000 B102 S

看到前面的富人存活率非常高,但是最后的相对贫穷的基本存活率为0

3.3计算两个dataframe的相加

frame1 = pd.DataFrame(np.arange(100).reshape(10,10),index= list(range(10)),columns=list(range(2010,2020)))
frame2 = pd.DataFrame(np.arange(100,200).reshape(10,10),index= list(range(10)),columns=list(range(2010,2020)))
frame1+frame2
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 100 102 104 106 108 110 112 114 116 118
1 120 122 124 126 128 130 132 134 136 138
2 140 142 144 146 148 150 152 154 156 158
3 160 162 164 166 168 170 172 174 176 178
4 180 182 184 186 188 190 192 194 196 198
5 200 202 204 206 208 210 212 214 216 218
6 220 222 224 226 228 230 232 234 236 238
7 240 242 244 246 248 250 252 254 256 258
8 260 262 264 266 268 270 272 274 276 278
9 280 282 284 286 288 290 292 294 296 298

3.4计算出船上最大的家族有多少人?

train_df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
train_df['family'] = train_df['SibSp']+train_df['Parch']
max(train_df['family'])
10

3.5查看train_data的基本统计数据

del train_df['family']
train_df.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

你可能感兴趣的:(动手学数据分析,数据分析,python,numpy)