通过导入kaggle上泰坦尼克的数据集,实战数据分析全流程
项目地址: 动手学数据分析
python中对目录路径的操作在OS库中,使用os.getcwd()
查看当前工作目录
# 相对路径读取
df = pd.read_csv("train.csv")
# 绝对路径读取
df = pd.read_csv(r"C:/Users/user/Desktop/动手学数据分析-组队学习版/动手学数据分析-组队学习版/第一单元项目集合/train.csv")
pd.read_csv()
与pd.read_table()
的不同?区别:
df = pd.read_csv("train.csv")
print(df.values)
df.values.shape
[[1 0 3 ... 7.25 nan 'S']
[2 1 1 ... 71.2833 'C85' 'C']
[3 1 3 ... 7.925 nan 'S']
...
[889 0 3 ... 23.45 nan 'S']
[890 1 1 ... 30.0 'C148' 'C']
[891 0 3 ... 7.75 nan 'Q']]
(891, 12)
df = pd.read_table("train.csv")
# print(df.values)
df.values.shape
(891, 1)
通过read_csv函数中的chunksize设置读取块数的大小.
df = pd.read_csv("train.csv", chunksize=1000)
#写入代码
df = pd.read_csv("train.csv", chunksize=500)
for temp in df:
print(temp)
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
.. ... ... ...
495 496 0 3
496 497 1 1
497 498 0 3
498 499 0 1
499 500 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
.. ... ... ... ...
495 Yousseff, Mr. Gerious male NaN 0
496 Eustis, Miss. Elizabeth Mussey female 54.0 1
497 Shellard, Mr. Frederick William male NaN 0
498 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0 1
499 Svensson, Mr. Olof male 24.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
.. ... ... ... ... ...
495 0 2627 14.4583 NaN C
496 0 36947 78.2667 D20 C
497 0 C.A. 6212 15.1000 NaN S
498 2 113781 151.5500 C22 C26 S
499 0 350035 7.7958 NaN S
[500 rows x 12 columns]
PassengerId Survived Pclass Name \
500 501 0 3 Calic, Mr. Petar
501 502 0 3 Canavan, Miss. Mary
502 503 0 3 O'Sullivan, Miss. Bridget Mary
503 504 0 3 Laitinen, Miss. Kristina Sofia
504 505 1 1 Maioni, Miss. Roberta
.. ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas
887 888 1 1 Graham, Miss. Margaret Edith
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie"
889 890 1 1 Behr, Mr. Karl Howell
890 891 0 3 Dooley, Mr. Patrick
Sex Age SibSp Parch Ticket Fare Cabin Embarked
500 male 17.0 0 0 315086 8.6625 NaN S
501 female 21.0 0 0 364846 7.7500 NaN Q
502 female NaN 0 0 330909 7.6292 NaN Q
503 female 37.0 0 0 4135 9.5875 NaN S
504 female 16.0 0 0 110152 86.5000 B79 S
.. ... ... ... ... ... ... ... ...
886 male 27.0 0 0 211536 13.0000 NaN S
887 female 19.0 0 0 112053 30.0000 B42 S
888 female NaN 1 2 W./C. 6607 23.4500 NaN S
889 male 26.0 0 0 111369 30.0000 C148 C
890 male 32.0 0 0 370376 7.7500 NaN Q
[391 rows x 12 columns]
通过将数据集划分,按块读取数据集
read_csv中的chunksize参数设置分块大小,返回的是可迭代对象
逐块读取原因:
DataFrame是一个重量级的数据结构,当一个dataframe比较大,占据较大内存的时候,同时又需要对这个dataframe做较复杂或者复杂度非O(1)的操作时,会由于内存占用过大而导致处理速度极速下降。
参考链接:pandas性能提升之利用chunksize参数对大数据分块处理
表头名为中文
通过修改DataFrame的columns和index修改表头
也可以通过rename函数中columns参数设置列表头 index设置索引名 传入的参数类型为字典
原表名 | 中文表名 |
---|---|
PassengerId | 乘客ID |
Survived | 是否幸存 |
Pclass | 乘客等级(1/2/3等舱位) |
Name | 乘客姓名 |
Sex | 性别 |
Age | 年龄 |
SibSp | 堂兄弟/妹个数 |
Parch | 父母与小孩个数 |
Ticket | 船票信息 |
Fare | 票价 |
Cabin | 客舱 |
Embarked | 登船港口 |
columns
修改及重新设置index
#写入代码
titles = ["乘客ID", "是否幸存", "乘客等级(1/2/3等舱位)", "乘客姓名", "性别", "年龄", "堂兄弟/妹个数", "父母与小孩个数", "船票信息", "票价", "客舱", "登船港口"]
index_name = "乘客ID"
df = pd.read_csv("train.csv")
df.columns = titles
df = df.set_index("乘客ID")
df.head()
是否幸存 | 乘客等级(1/2/3等舱位) | 乘客姓名 | 性别 | 年龄 | 堂兄弟/妹个数 | 父母与小孩个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|
乘客ID | |||||||||||
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
rename()函数
修改表名titles = {"PassengerId":"乘客ID", "Survived":"是否幸存", "Pclass":"乘客等级(1/2/3等舱位)", "Name":"姓名", "Sex":"性别", "Age":"年龄",
"SibSp":"堂兄弟/妹个数", "Parch":"父母与小孩个数", "Ticket":"船票信息", "Fare":"票价", "Cabin":"客舱", "Embarked":"登船港口"}
df = pd.read_csv("train.csv")
df.rename(columns=titles, inplace=True)
df.set_index("乘客ID").head()
是否幸存 | 乘客等级(1/2/3等舱位) | 姓名 | 性别 | 年龄 | 堂兄弟/妹个数 | 父母与小孩个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|
乘客ID | |||||||||||
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
header参数
设置表头位置, index_col设置index索引# header参数 表头设置
df = pd.read_csv('train.csv', names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],index_col='乘客ID',header=0)
df.head()
是否幸存 | 仓位等级 | 姓名 | 性别 | 年龄 | 兄弟姐妹个数 | 父母子女个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|
乘客ID | |||||||||||
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
通过info()
函数查询数据的整体结构
df.info(): # 打印摘要
df.describe(): # 描述性统计信息
df.values: # 数据
df.shape: # 形状 (行数, 列数)
df.columns: # 列标签
df.columns.values: # 列标签
df.index: # 行标签
df.index.values: # 行标签
df.head(n): # 前n行
df.tail(n): # 尾n行
pd.options.display.max_columns=n: # 最多显示n列
pd.options.display.max_rows=n: # 最多显示n行
df.memory_usage(): # 占用内存(字节B)
#写入代码
# df.isnull()
df.isna()
# 清晰查看
df.isna().sum()
#另一种缺失查看
# df.info()
# 查找那些行缺失
df[df.notna().all(1)]
是否幸存 | 仓位等级 | 姓名 | 性别 | 年龄 | 兄弟姐妹个数 | 父母子女个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|
乘客ID | |||||||||||
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
872 | 1 | 1 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | female | 47.0 | 1 | 1 | 11751 | 52.5542 | D35 | S |
873 | 0 | 1 | Carlsson, Mr. Frans Olof | male | 33.0 | 0 | 0 | 695 | 5.0000 | B51 B53 B55 | S |
880 | 1 | 1 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 0 | 1 | 11767 | 83.1583 | C50 | C |
888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
183 rows × 11 columns
使用to_csv()
函数保存DataFrame数据集为csv类型文件
pandas是基于NumPy构建的模块,pandas包含序列Series
(一维数据)和数据框DataFrame
(二维数据)两种最主要数据结构.
#写入代码
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
[]
查找方法#写入代码
df['Cabin'].head()
0 NaN
1 C85
2 NaN
3 C123
4 NaN
Name: Cabin, dtype: object
loc
标签索引查找#写入代码
df.loc[:, "Cabin"].head()
0 NaN
1 C85
2 NaN
3 C123
4 NaN
Name: Cabin, dtype: object
.
属性查找方法# 方法不推荐容易出错
df.Cabin.head()
0 NaN
1 C85
2 NaN
3 C123
4 NaN
Name: Cabin, dtype: object
iloc
位置索引查找方法df.iloc[:, -2].head()
0 NaN
1 C85
2 NaN
3 C123
4 NaN
Name: Cabin, dtype: object
经过我们的观察发现测试集test_1.csv有两列是多余的,我们需要将这些多余的列删去(['Unnamed: 0', ' a'])
读取数据
#写入代码
df_test = pd.read_csv("test_1.csv")
df_test.head()
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | a | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 100 |
1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 100 |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 100 |
3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 100 |
4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 100 |
使用drop()
函数删除两个DataFrame的列中不重复的部分(方法一)
使用append添加原来数据集中的列到test测试集的列中通过drop_duplicates()
函数中的keep
参数把重复值全部删除
上述两个方法原理相同
#写入代码
df_test.drop(columns = df_test.columns[~df_test.columns.isin(df.columns)])
# df_test.columns.append(df.columns).drop_duplicates(keep=False)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
【思考】还有其他的删除多余的列的方式吗?
[]
选取标签获取新的DataFrame# 思考回答
# 同理其他取索引方式
df_test[df_test.columns[df_test.columns.isin(df.columns)]]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
loc
选取不在给出的列表中的列列表推导式
选出不在给出列表中的列#写入代码
# df.loc[:, ~df.columns.isin(['PassengerId','Name','Age','Ticket'])]
df[[x for x in df.columns if x not in ['PassengerId','Name','Age','Ticket']]]
Survived | Pclass | Sex | SibSp | Parch | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 1 | 0 | 7.2500 | NaN | S |
1 | 1 | 1 | female | 1 | 0 | 71.2833 | C85 | C |
2 | 1 | 3 | female | 0 | 0 | 7.9250 | NaN | S |
3 | 1 | 1 | female | 1 | 0 | 53.1000 | C123 | S |
4 | 0 | 3 | male | 0 | 0 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 0 | 2 | male | 0 | 0 | 13.0000 | NaN | S |
887 | 1 | 1 | female | 0 | 0 | 30.0000 | B42 | S |
888 | 0 | 3 | female | 1 | 2 | 23.4500 | NaN | S |
889 | 1 | 1 | male | 0 | 0 | 30.0000 | C148 | C |
890 | 0 | 3 | male | 0 | 0 | 7.7500 | NaN | Q |
891 rows × 8 columns
df.drop(columns=['PassengerId','Name','Age','Ticket'])
Survived | Pclass | Sex | SibSp | Parch | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 1 | 0 | 7.2500 | NaN | S |
1 | 1 | 1 | female | 1 | 0 | 71.2833 | C85 | C |
2 | 1 | 3 | female | 0 | 0 | 7.9250 | NaN | S |
3 | 1 | 1 | female | 1 | 0 | 53.1000 | C123 | S |
4 | 0 | 3 | male | 0 | 0 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 0 | 2 | male | 0 | 0 | 13.0000 | NaN | S |
887 | 1 | 1 | female | 0 | 0 | 30.0000 | B42 | S |
888 | 0 | 3 | female | 1 | 2 | 23.4500 | NaN | S |
889 | 1 | 1 | male | 0 | 0 | 30.0000 | C148 | C |
890 | 0 | 3 | male | 0 | 0 | 7.7500 | NaN | Q |
891 rows × 8 columns
表格数据中,最重要的一个功能就是要具有可筛选的能力,选出我所需要的信息,丢弃无用的信息。
#写入代码
df[df['Age']<10]
#写入代码
midage = df[(df['Age']>10)&(df['Age']<50)]
midage.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
这里感觉参考答案给出的有一点小问题,loc给出的是标签索引这里需要选出的是第100行是位置索引
midage.iloc[100:101, [2,4]]
Pclass | Sex | |
---|---|---|
149 | 2 | male |
reset_index()作用主要将索引重置
loc采用的是标签索引
iloc采用的是位置索引
如果不重置的话采用loc提取标签索引和位置索引不对齐
#写入代码
midage.iloc[[100, 105, 108], 2:5]
Pclass | Name | Sex | |
---|---|---|---|
149 | 2 | Byles, Rev. Thomas Roussel Davids | male |
160 | 3 | Cribb, Mr. John Hatfield | male |
163 | 3 | Calic, Mr. Jovo | male |
midage.reset_index().loc[[100, 105, 108], ['Pclass', 'Name', 'Sex']]
Pclass | Name | Sex | |
---|---|---|---|
100 | 2 | Byles, Rev. Thomas Roussel Davids | male |
105 | 3 | Cribb, Mr. John Hatfield | male |
108 | 3 | Calic, Mr. Jovo | male |
#回答代码
frame.sort_values(by='d', ascending=False)
d | a | b | c | |
---|---|---|---|---|
1 | 4 | 5 | 6 | 7 |
2 | 0 | 1 | 2 | 3 |
另一种排序方式 按index排序
作用:默认根据行标签对所有行排序,或根据列标签对所有列排序,或根据指定某列或某几列对行排序。
注意:df. sort_index()可以完成和df. sort_values()完全相同的功能,但python更推荐用只用df. sort_index()对“根据行标签”和“根据列标签”排序,其他排序方式用df.sort_values()。
调用方式:
sort_index(axis=0, level=None, ascending=True, inplace=False, kind=‘quicksort’, na_position=‘last’, sort_remaining=True, by=None)
按多列进行排序
【总结】下面将不同的排序方式做一个总结
1.让行索引升序排序
#代码
frame.sort_index()
d | a | b | c | |
---|---|---|---|---|
1 | 4 | 5 | 6 | 7 |
2 | 0 | 1 | 2 | 3 |
2.让列索引升序排序
#代码
frame.sort_index(1)
a | b | c | d | |
---|---|---|---|---|
2 | 1 | 2 | 3 | 0 |
1 | 5 | 6 | 7 | 4 |
3.让列索引降序排序
#代码
frame.sort_index(1, ascending=False)
d | c | b | a | |
---|---|---|---|---|
2 | 0 | 3 | 2 | 1 |
1 | 4 | 7 | 6 | 5 |
4.让任选两列数据同时降序排序
#代码
frame.sort_values(by=['a', 'c'], ascending=False)
d | a | b | c | |
---|---|---|---|---|
1 | 4 | 5 | 6 | 7 |
2 | 0 | 1 | 2 | 3 |
#代码
df.sort_values(by=['是否幸存', '年龄','票价',"父母子女个数"], ascending=False)
排序后,可以发现父母子女个数越少,船票价格较高,年龄较高存活率较高。
可能与自身关系相关,对风险的警惕意识较高。
#代码
df['家族人数'] = df['兄弟姐妹个数'] + df['父母子女个数']
max(df['家族人数'].value_counts().index)
10
#代码
df.describe()
结论
从max数据来看 人群中年龄最大的为80岁 票价最高位512.3292元
从百分数数据来看 大部分人都是独自一人旅行
从均值来看 幸存的人数少 大部分使用的是三等舱且年龄在三十岁左右