第一章 准备工作
本书内容
什么类型的数据
- 表格型的数据
- 多维数组(矩阵)
- 由键位列关联的多张表数据(对于SQL用户来说就是主键或外键)
- 均匀或非均匀的时间序列。
为何利用Python进行数据分析
Python作为胶水
- Python在科学计算方面的成功部分是因为它很容易整合C、C++和FORTRAN等语言的代码。大部分现代计算环境都拥有相似的存量程序集,这些程序集使用FORTRAN和C的库进行线性代数、调优、积分、快速傅里叶变换等算法运算
解决“双语言”难题
为何不使用Python
重要的Python库
NumPy
- NumPy(http://numpy.org)是Numerical Python的简写,是Python数值计算的基石。它提供多种数据结构、算法以及大部分涉及Python数值计算所需的接口。NumPy还包括其他内容:
- 快速、高效的多维数组对象ndarray
- 基于元素的数组计算或数组间数学操作函数
- 用于读写硬盘中基于数组的数据集的工具
- 线性代数操作、傅里叶变换以及随机数生成
pandas
- pandas(http://pandas.pyda ta.org)提供了高级数据结构和函数,这些数据结构和函数的设计使得利用结构化、表格化数据的工作快速、简单、有表现力。
matplotlib
- matplotlib(http://matplotlib.org)是最流行的用于制图及其他二维数据可视化的Python库。
IPython与Jupyter
SciPy
- SciPy(http://scipy.org)是科学计算领域针对不同标准问题域的包集合。以下是SciPy中包含的一些包:
- scipy.integrate数值积分例程和微分方程求解器
- scipy.linalg线性代数例程和基于numpy.linalg的矩阵分解
- scipy.optimize函数优化器(最小化器)和求根算法
- scipy.signal信号处理工具
- scipy.sparse稀疏矩阵与稀疏线性系统求解器
scikit-learn
- scikit-learn项目(http://scikit-learn.org)诞生于2010年,目前已成为Python编程者首选的机器学习工具包。仅仅七年,scikit-learn就拥有了全世界1500位代码贡献者。其中包含以下子模块。
- 分类:SVM、最近邻、随机森林、逻辑回归等
- 回归:Lasso、岭回归等
- 聚类:k-means、谱聚类等
- 降维:PCA、特征选择、矩阵分解等
- 模型选择:网格搜索、交叉验证、指标矩阵
- 预处理:特征提取、正态化
statsmodels
- statsmodels(http://statsmodels.org)是一个统计分析包
- 与scikit-learn相比,statsmodels包含经典的(高频词汇)统计学、经济学算法。它所包含的模型如下。
- 回归模型:线性回归、通用线性模型、鲁棒线性模型、线性混合效应模型等
- 方差分析(ANOVA)
- 时间序列分析:AR、ARMA、ARIMA、VAR等模型
- 非参数方法:核密度估计、核回归· 统计模型结果可视化
安装与设置
社区和会议
快速浏览本书
术语
- 处理/处置/规整(munge/munging/wrangling):描述的是将非结构化或者同时又很凌乱的数据整理成结构化、清晰形式的整个过程
- 伪代码:用一种类似代码的形式描述算法或者过程,而事实上又不是实际有效的源代码。
- 语法糖:并不增加新特性,但便利于代码编写的编程语法。
第二章Python语言基础、IPython及Jupyter notebook
2.1 Python解释器
- Python是一种解释型语言。Python解释器通过一次执行一条语句来运行程序。
2.2 IPython基础
2.2.1 运行IPython命令行
import numpy as np
data = {i : np.random.randn() for i in range(7)}
data
{0: -1.2991373778220445,
1: -0.32334379274285147,
2: -1.5074083105244436,
3: 1.4752955371102277,
4: 0.20873618343396627,
5: -1.2957335373240664,
6: -1.4404604846243687}
- 与常见的print打印语句不同,IPython中大多数Python对象被格式化为更可读、更美观的形式。如果使用print方法在标准Python解释器中打印data变量,可读性会差一些
print(data)
{0: -1.2991373778220445, 1: -0.32334379274285147, 2: -1.5074083105244436, 3: 1.4752955371102277, 4: 0.20873618343396627, 5: -1.2957335373240664, 6: -1.4404604846243687}
2.2.2 运行Jupyter notebook
- Jupyter项目中的主要组件就是notebook,这是一种交互式的文档类型,可以用于编写代码、文本(可以带标记)、数据可视化以及其他输出
- Jupyter会自动打开你的默认网络浏览器(除非你使用了–no-browser命令)。你可以通过http地址来浏览notebook,地址是http://localhost:8888/
print("hello world")
hello world
2.2.3 Tab补全
- 当在命令行输入表达式时,按下Tab键即可为任意变量(对象、函数等)搜索命名空间,与你目前已输入的字符进行匹配
an_apple = 27
an_example = 42
an
b = [1,2,3]
b.
File "", line 3
b.
^
SyntaxError: invalid syntax
import datetime
datetime.
File "", line 3
datetime.
^
SyntaxError: invalid syntax
-
请注意IPython默认情况下隐藏了以下划线开始的方法和属性,诸如魔术方法、内部“私有”方法和属性,以避免杂乱的显示(使新手混淆)。
-
这些你当然也是可以使用tab补全的,但是必须先输入下划线才能看到它们。
-
如果你总是想在tab补全时直接看到它们,则需要修改IPython配置。
-
tab补全除了在搜索交互命名空间和补全对象或模块属性时有用,在很多其他上下文场景中也有用
-
tab补全的另一个应用场景是在函数的关键字参数(包含=号)中节约时间
def func_with_keyworlds(abra=1,abbra=2,abbbra=3):
return abra,abbra,abbbra
func_with_keyworlds(abbbra=1)
(1, 2, 1)
2.2.4 内省
- 在一个变量名的前后使用问号(?)可以显示一些关于该对象的概要信息
b = [1,2,3]
?b
def add_numbers(a,b):
"""两个数值相加"""
return a+b
add_numbers?
add_numbers??
np.*load*?
2.2.5 %run命令
- 文件中定义的所有变量(导入的、函数中的、全局定义的)在运行后,可以在IPython命令行中使用(除非出现某种异常)
- 如果你想让待运行的脚本使用交互式IPython命名空间中已有的变量,请使用%run -i替代普通的%run命令
- 在Jupyter notebook中,如果你想将脚本导入一个代码单元,可以使用%load魔术函数
2.2.5.1 中断运行中的代码
- 在任意代码运行时按下Ctrl-C,无论脚本是通过%run或是其他长命令运行的,都将引起KeyboardInterrupt。除了某些特殊情况,这将导致所有的Python程序立即停止运行
2.2.6 执行剪贴板中的程序
- %paste会获得剪贴板中的所有文本,并在命令行中作为一个代码块去执行
- %cpaste与之类似,只不过它会给出一个特殊的提示符,让你粘贴代码
x = 7
y = 5
if x > 5:
x+=1
y=8
%paste
UsageError: Line magic function `%paste` not found.
%cpaste
UsageError: Line magic function `%cpaste` not found.
2.2.7 终端快捷键
- Ctrl-P或向上箭头以当前输入内容开始,向后搜索历史命令
- Ctrl-N或向下箭头以当前输入内容开始,向前搜索历史命令
- Ctrl-R按行读取的反向历史搜索(部分匹配)
- Ctrl-Shift-V从剪贴板粘贴文本
- Ctrl-C中断当前正在执行的代码
- Ctrl-A将光标移动到本行起始位置
- CtrI-E将光标移动到本行结束位置
- Ctrl-K删除光标后本行的所有内容
- Ctrl-U删除当前行
- Ctrl-F将光标向前移动一个字符
- Ctrl-B将光标向后移动一个字符
- Ctrl-L清除本屏内容
2.2.8 关于魔术命令
- IPython的特殊命令(没有内建到Python自身中去)被称为“魔术”命令。这些命令被设计用于简化常见任务,确保用户更容易控制IPython系统的行为。- - 魔术命令的前缀符号是百分号%
- 魔术命令可以看作是IPython系统内部的命令行程序。
- 大多数魔术命令都可以使用?查看额外的“命令行”选项
- 魔术函数也可以不加百分号%就使用,只要没有变量被定义为与魔术函数相同的名字即可。这种特性被称为自动魔术,通过%automagic进行启用/禁用关。
a = np.random.randn(100,100)
%timeit np.dot(a,a)
20.5 µs ± 803 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%debug?
%pwd
'D:\\PythonFlie\\python\\利用python进行数据分析(书籍笔记)'
%quickref
%magic
- %quickref显示IPython快速参考卡
- %magic显示所有可用魔术命令的详细文档
- %debug从最后发生报错的底部进入交互式调试器
- %hist打印命令输入(也可以打印输出)历史
- %pdb出现任意报错后自动进入调试器
- %paste从剪贴板中执行已经预先格式化的Python代码
- %cpaste打开一个特殊提示符,手动粘贴待执行的Python代码
- %reset删除交互式命名空间中所有的变量/名称
- %page OBJECT通过分页器更美观地打印显示一个对象
- %run script.py在IPython中运行一个Python脚本
- %prun statement使用CProfile执行语句,并报告输出
- %time statement报告单个语句的执行时间
- %timeit statement多次运行单个语句计算平均执行时间;在估算代码最短执行时间时有用
- %who,%who_ls, %whos根据不同级别的信息/详细程度,展示交互命名空间中定义的变量
- %xdel variable在IPython内部删除-一个变量,清除相关的引用
2.2.9 matplotlib集成
- %matplotlib魔术函数可以设置matplotlib与IPython命令行或Jupyter notebook的集成
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(np.random.randn(50).cumsum())
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ll6PSUbg-1640752034422)(output_35_0.png)]
2.3 Python语言基础
2.3.1 语言语义
- Python语言的设计非常独特,它侧重于可读性、易用性及清晰性。一部分人则认为它是“可执行的伪代码”
2.3.1.1 缩进,而不是大括号
- Python使用缩进(tab或者空格)来组织代码,而不是像其他语言比如R、C++、Java和Perl那样用大括号
- 一个冒号代表一个缩进代码块的开始,单个代码块中所有的代码必须保持相同的缩进,直到代码块结束。
- 我强烈推荐使用四个空格作为你的默认缩进,而不是用tab。
2.3.1.2 一切皆为对象
- Python语言的一个重要特征就是对象模型的一致性。每一个数值、字符串、数据结构、函数、类、模块以及所有存在于Python解释器中的事物,都是Python对象。
2.3.1.3 注释
- 所有写在#号之后的文本会自动被Python解释器忽略。因此通常使用#在代码中添加注释。
- 有时候你会想排除部分代码但又不想删除,一个简单的解决办法就是把代码注释掉
- 注释也可以写在一行被执行代码的后面。部分编程者更习惯把注释写在特定的一行后面,这在很多时候有用
2.3.1.4 函数和对象方法的调用
- 几乎所有的Python对象都有内部函数,称为方法,可以访问到对象内部的内容。
- 函数传参既可以是位置参数也可以是关键字参数
2.3.1.5 变量和参数传递
- 在Python中对一个变量(或者变量名)赋值时,你就创建了一个指向等号右边对象的引用。
- 赋值也被称为绑定,这是因为我们将一个变量名绑定到了一个对象上。已被赋值的变量名有时也会被称为被绑定变量
a = [1,2,3]
b = a
a.append(4)
b
[1, 2, 3, 4]
2.3.1.6 动态引用、强类型
- 与Java、C++等大多数编译型语言不同,Python中的对象引用并不涉及类型
- 变量对于对象来说只是特定命名空间中的名称;类型信息是存储在对象自身之中的。
a = 5
type(a)
int
a = "foo"
type(a)
str
- 了解对象的类型是非常重要的,写出可以处理多种不同输入的函数是非常有用的。你可以使用isinstance函数来检查一个对象是否是特定类型的实例
a = 5
isinstance(a,int)
True
a = 5
b = 4.5
isinstance(a,(int,float))
True
isinstance(b,(int,float))
True
2.3.1.7 属性和方法
- Python中的对象通常都会有属性(Python对象“内部”存储的其他对象)和方法(与对象内部对象有关的函数,相关的对象可以连接到对象内部数据)。
- 属性和方法都可以通过形如obj.attribute_name的语法进行调用
a = "foo"
getattr(a,'split')
2.3.1.8 鸭子类型
- 常见的案例就是写接受任意序列类型(列表、元组、n维数组),甚至是一个迭代器的函数时使用这项功能。你可以先检查对象是否是一个列表(或者一个NumPy数组),如果不是就把它转换为列表:
def isiterable(obj):
try:
iter(obj)
return True
except TypeError:
return False
x = 2.34
if not isinstance(x,list) and isiterable(x):
x = list(x)
x
2.34
2.3.1.9 导入
- 通过使用as关键字,你可以对导入内容给予不同的变量名
2.3.1.10 二元运算符和比较运算
- 检查两个引用是否指向同一个对象,可以使用is关键字。is not在你想检查两个对象不是相同对象时也是有效的。
a = [1,2,3]
b = [1,2,3]
a is b
False
b = a
a is b
True
c = list(a)
c is not a
True
a == c
True
2.3.1.11 可变对象和不可变对象
- Python中的大部分对象,例如列表、字典、NumPy数组都是可变对象,大多数用户定义的类型(类)也是可变的。可变对象中包含的对象和值是可以被修改的
- 还有其他一些对象是不可变的,比如字符串、元组
2.3.2 标量类型
2.3.2.1 数值类型
- 基础的Python数字类型就是int和float。
ival = 457854
ival**6
9212172810807855033141451548434496
fval = 7.24
fval
7.24
fval = 7.24e-5
fval
7.24e-05
3/2
1.5
3//2
1
-
a = '这是一个字符串'
b = "这是一个字符串"
c = '''
这是一个字符串
这是一个字符串
这是一个字符串
'''
c.count('\n')
4
a = "这是一个字符串"
a[1] = "f"
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in
1 a = "这是一个字符串"
----> 2 a[1] = "f"
TypeError: 'str' object does not support item assignment
a = 5.6
b = str(a)
b
'5.6'
s = "python"
list(s)
['p', 'y', 't', 'h', 'o', 'n']
s[:3]
'pyt'
a = "12\\45"
print(a)
a
12\45
'12\\45'
s = r"c\数据分析\python"
print(s)
c\数据分析\python
a = "这是一个"
b = "小可爱"
a+b
'这是一个小可爱'
template = "{0:.4f}"
template.format(34.34232423)
'34.3423'
2.3.2.3 字节与Unicode
- 使用enocde方法将这个Unicode字符串转换为UTF-8字节
- 一个字节对象的Unicode编码,你可以再使用decode方法进行解码
- 字符串前加前缀b来定义字符文本
2.3.2.4 布尔值
- Python中的布尔值写作True和False。
- 比较运算和其他条件表达式的结果为True或False。布尔值可以与and和or关键字合用
2.3.2.5 类型转换
- str、bool、int和float既是数据类型,同时也是可以将其他数据转换为这些类型的函数
2.3.2.6 None
- None是Python的null值类型。如果一个函数没有显式地返回值,则它会隐式地返回None
- 从技术角度来说,None不仅是一个关键字,它还是NoneType类型的唯一实例
2.3.2.7 日期和时间
- Python中内建的datetime模块,提供了datetime、data和time类型
from datetime import datetime,date,time
dt = datetime(2021,12,6,23,33,6)
dt
datetime.datetime(2021, 12, 6, 23, 33, 6)
dt.day
6
dt.time()
datetime.time(23, 33, 6)
dt.minute
33
dt.date()
datetime.date(2021, 12, 6)
dt.time()
datetime.time(23, 33, 6)
dt.strftime("%Y%m%d %H:%M:%S")
'20211206 23:33:06'
datetime.strptime("20211206","%Y%m%d")
datetime.datetime(2021, 12, 6, 0, 0)
dt.replace(year = 2020,month = 1,day = 8)
datetime.datetime(2020, 1, 8, 23, 33, 6)
dt1 = datetime(2021,11,6,22,7,0)
dt2 = datetime(2020,11,6,22,0,0)
dt1 - dt2
datetime.timedelta(days=365, seconds=420)
type(dt1-dt2)
datetime.timedelta
2.3.3 控制流
- 其他语言中的标准控制流概念在Python中也是通过一些条件逻辑、循环等内建关键字的方式实现的。
2.3.3.1 if、elif和else
- 一个if语句可以接多个elif代码块和一个else代码块,如果所有的elif条件均为False,则执行else代码块
2.3.3.2 for循环
- for循环用于遍历一个集合(例如列表或元组)或一个迭代器
- 使用continue关键字可以跳过conitnue后面的代码进入下一次循环
- break关键字可以结束一个for循环
- break关键字只结束最内层的for循环;外层的for循环会继续运行
4 > 3 > 2 >1
True
for i in range(4):
for j in range(4):
if j > i :
break
print((i,j))
(0, 0)
(1, 0)
(1, 1)
(2, 0)
(2, 1)
(2, 2)
(3, 0)
(3, 1)
(3, 2)
(3, 3)
2.3.3.3 while循环
- while循环会在条件符合时一直执行代码块,直到条件判断为False或显式地以break结尾时才结束
2.3.3.4 pass
2.3.3.5 range
- range函数返回一个迭代器
- 起始、结尾、步进(可以是负的)可以传参给range函数
- range产生的整数包含起始但不包含结尾
range(10)
range(0, 10)
list(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
range(5,0,-1)
range(5, 0, -1)
list(range(5,0,-1))
[5, 4, 3, 2, 1]
seq = [1,2,3,4]
for i in range(len(seq)):
print(i)
0
1
2
3
sum = 0
for i in range(20):
if i%3 ==0 or i%5 ==0:
print(i)
0
3
5
6
9
10
12
15
18
2.3.3.6 三元表达式
- Python中的三元表达式允许你将一个if-else代码块联合起来,在一行代码或一个语句中生成数据
- 可以使用三元表达式来压缩代码量,但请注意如果条件以及真假表达式非常复杂,可能会牺牲可读性
第da章 内建数据结构、函数及文件
- Python的常用数据结构:元组、列表、字典和集合
3.1 数据结构和序列
3.1.1 元组
- 元组是一种固定长度、不可变的Python对象序列。创建元组最简单的办法就是用逗号分隔序列值。
tup = 4,5,6
tup
(4, 5, 6)
nested_tup = (4,5,6),(7,8)
nested_tup
((4, 5, 6), (7, 8))
tuple([1,2,3])
(1, 2, 3)
tup = tuple("string")
tup
('s', 't', 'r', 'i', 'n', 'g')
tup[2]
'r'
tup = tuple(["foo",[1,2,3],True])
tup
('foo', [1, 2, 3], True)
tup[2] = False
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in
----> 1 tup[2] = False
TypeError: 'tuple' object does not support item assignment
tup[1].append(5)
tup
('foo', [1, 2, 3, 5], True)
("foo",True,[1,2,3])+(6,0)+("bar",)
('foo', True, [1, 2, 3], 6, 0, 'bar')
tup*2
('foo', [1, 2, 3, 5], True, 'foo', [1, 2, 3, 5], True)
3.1.1.1元组拆包
- 如果你想要将元组型的表达式赋值给变量,Python会对等号右边的值进行拆包
- 常见使用场景
- 交换变量名
- 遍历元组或列表组成的序列
- 从函数返回多个值
tup = (4,5,6)
a,b,c = tup
c
6
tup = (4,5,[False,56])
a,b,c = tup
c
[False, 56]
a,b = 1,2
a
1
b
2
b,a = a,b
a
2
b
1
seq = [(1,2,3),(4,5,6),(7,8,9)]
for a,b,c in seq:
print(a,b,c)
1 2 3
4 5 6
7 8 9
seq = [(1,2,3),(4,5,6),(7,8,9)]
for a,b,c in seq:
print('a={0},b={1},c={2}'.format(a,b,c))
a=1,b=2,c=3
a=4,b=5,c=6
a=7,b=8,c=9
values = 1,2,3,4,5
a,b,*rest = values
a,b
(1, 2)
rest
[3, 4, 5]
a,b,*_ = values
3.1.1.2元组方法
- 由于元组的内容和长度是无法改变的,它的实例方法很少。一个常见的有用方法是count(列表中也可用),用于计量某个数值在元组中出现的次数
a = (1,2,3,4,5,5,7)
a.count(5)
2
3.1.2 列表
a = [2,3,None,'false']
tup = (1,23,None,"false")
b = list(tup)
b
[1, 23, None, 'false']
b[1]
23
b[1] = "hello"
b
[1, 'hello', None, 'false']
gen = range(10)
gen
range(0, 10)
list(gen)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
3.1.2.1 增加和移除元素
- append方法可以将元素添加到列表的尾部
- 使用insert方法可以将元素插入到指定的列表位置
- Python collections模块之deque()详解https://blog.csdn.net/chl183/article/details/106958004
- insert的反操作是pop,该操作会将特定位置的元素移除并返回
- 通过remove方法移除,该方法会定位第一个符合要求的值并移除它
b.append("world")
b
[1, 'hello', None, 'false', 'world']
b.insert(0,"country")
b
['country', 1, 'hello', None, 'false', 'country', 'world']
b.pop(2)
'country'
b
['country', 1, 'world']
b.remove('world')
b
['country']
'country' in b
True
'country' not in b
False
- 与字典、集合(后面会介绍)相比,检查列表中是否包含一个值是非常缓慢的。这是因为Python在列表中进行了线性逐个扫描,而在字典和集合中Python是同时检查所有元素的(基于哈希表)
3.1.2.2 连接和联合列表
[1,2,3]+["false",None]
[1, 2, 3, 'false', None]
x = [1,2,3]+["false",None]
x.extend([67,78])
x
[1, 2, 3, 'false', None, 67, 78]
- 请注意通过添加内容来连接列表是一种相对高代价的操作,这是因为连接过程中创建了新列表,并且还要复制对象。使用extend将元素添加到已经存在的列表是更好的方式,尤其是在你需要构建一个大型列表时
- 更倾向于使用extend添加比较大的数据而不是用+号的形式
3.1.2.3 排序
- 调用列表的sort方法对列表进行内部排序(无须新建一个对象)
a = [6,3,242,32,89,231]
a.sort()
a
[3, 6, 32, 89, 231, 242]
b = ["sa","hello","hannan","qw"]
b.sort(key=len)
b
['sa', 'qw', 'hello', 'hannan']
3.1.2.4 二分搜索和已排序列表的维护
- bisect.bisect会找到元素应当被插入的位置,并保持序列排序,而bisect.insort将元素插入到相应位置
- bisect模块的函数并不会检查列表是否已经排序,因为这么做代价太大。因此,对未排序列表使用bisect的函数虽然不会报错,但可能会导致不正确的结果。
import bisect
c = [1,2,3,2,12,6,7]
bisect.bisect(c,2)
4
bisect.bisect(c,18)
7
bisect.insort(c,6)
c
[1, 2, 3, 2, 12, 6, 6, 7]
3.1.2.5 切片
- 使用切片符号可以对大多数序列类型选取其子集,它的基本形式是将start:stop传入到索引符号[ ]中
- 起始位置start的索引是包含的,而结束位置stop的索引并不包含,因此元素的数量是stop-start
seq = [7,2,3,7,5,6,0,1]
seq[1:5]
[2, 3, 7, 5]
seq[3:4] = [6,33]
seq
[7, 2, 3, 6, 33, 5, 6, 0, 1]
seq[:4]
[7, 2, 3, 6]
seq[2:]
[3, 6, 33, 5, 6, 0, 1]
seq[-4:]
[5, 6, 0, 1]
seq[-4::2]
[5, 0]
seq[::-1]
[1, 0, 6, 5, 33, 6, 3, 2, 7]
3.1.3 内建序列函数
3.1.3.1 enumerate
- Python内建了enumerate函数,返回了(i, value)元组的序列,其中value是元素的值,i是元素的索引
- 在绘图时,在图上添加数值标签的时候会使用到
some_list = ["foo","bar","baz"]
mapping = {}
for i,v in enumerate(some_list):
mapping[i] = v
mapping
{0: 'foo', 1: 'bar', 2: 'baz'}
3.1.3.2 sorted
- sorted函数返回一个根据任意序列中的元素新建的已排序列表
a = [6,3,242,32,89,231]
sorted(a)
[3, 6, 32, 89, 231, 242]
b = "horse race"
sorted(b)
[' ', 'a', 'c', 'e', 'e', 'h', 'o', 'r', 'r', 's']
3.1.3.3 zip
- zip将列表、元组或其他序列的元素配对,新建一个元组构成的列表
seq1 = ["foo","bar","baz"]
seq2 = ["one","two","three"]
zipped = zip(seq1,seq2)
list(zipped)
[('foo', 'one'), ('bar', 'two'), ('baz', 'three')]
seq1 = ["foo","bar","baz","hello"]
seq2 = ["one","two","three"]
zipped = zip(seq1,seq2)
list(zipped)
[('foo', 'one'), ('bar', 'two'), ('baz', 'three')]
for i,(a,b) in enumerate(zip(seq1,seq2)):
print(i,a,b)
0 foo one
1 bar two
2 baz three
a = [(1,"h"),(2,"r"),(3,"u")]
first_name,last_name = zip(*a)
first_name
(1, 2, 3)
last_name
('h', 'r', 'u')
3.1.3.4 reversed
- reversed函数将序列的元素倒序排列
- reversed是一个生成器(将在后续内容讨论),因此如果没有实例化(例如使用list函数或进行for循环)的时候,它并不会产生一个倒序的列表
a = list(range(10))
list(reversed(a))
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
3.1.4 字典
- dict(字典)可能是Python内建数据结构中最重要的。
- 它更为常用的名字是哈希表或者是关联数组。字典是拥有灵活尺寸的键值对集合,其中键和值都是Python对象。
- 用大括号{}是创建字典的一种方式,在字典中用逗号将键值对分隔
d1 = {"a":"some values","b":[1,2,3,4]}
d1
{'a': 'some values', 'b': [1, 2, 3, 4]}
d1[7] = "an inter"
d1
{'a': 'some values', 'b': [1, 2, 3, 4], 7: 'an inter'}
d1["b"].append(5)
d1
{'a': 'some values', 'b': [1, 2, 3, 4, 5], 'hoo': 'foo', 'c': 'bar'}
"b" in d1
True
del d1[7]
d1
{'a': 'some values', 'b': [1, 2, 3, 4]}
d1.pop("a")
'some values'
d1
{'b': [1, 2, 3, 4]}
d1.keys()
dict_keys(['b'])
d1.values()
dict_values([[1, 2, 3, 4]])
d1.update({"hoo":"foo","c":"bar"})
d1
{'a': 'some values', 'b': [1, 2, 3, 4], 'hoo': 'foo', 'c': 'bar'}
3.1.4.1 从序列生成字典
- 字典本质上是2-元组(含有2个元素的元组)的集合,字典是可以接受一个2-元组的列表作为参数的
mapping = dict(zip(range(5),reversed(range(5))))
mapping
{0: 4, 1: 3, 2: 2, 3: 1, 4: 0}
3.1.4.2 默认值
words = ["apple","boo","bar","foo"]
by_letter = {}
for word in words:
letter = word[0]
if letter not in by_letter:
by_letter[letter] = word
else :
break
by_letter
{'a': 'apple', 'b': 'boo'}
words = ["apple","boo","bar","atom","foo"]
by_letter = {}
for word in words:
letter = word[0]
if letter not in by_letter:
by_letter[letter] = word
else :
break
by_letter
{'a': 'apple', 'b': 'boo'}
by_letter = {}
for word in words:
letter = word[0]
by_letter.setdefault(letter,[]).append(word)
by_letter
{'a': ['apple', 'atom'], 'b': ['boo', 'bar'], 'f': ['foo']}
3.1.4.3 有效的字典键类型
hash("string")
2343056889896338451
3.1.5 集合
- 集合是一种无序且元素唯一的容器
- 通过set函数或者是用字面值集与大括号的语法
set([2,2,2,1,2,3,5])
{1, 2, 3, 5}
{2,2,2,1,2,3,5}
{1, 2, 3, 5}
a = {1,2,3,4,5}
b = {3,4,5,6,7,8}
a.union(b)
{1, 2, 3, 4, 5, 6, 7, 8}
a|b
{1, 2, 3, 4, 5, 6, 7, 8}
a.intersection(b)
{3, 4, 5}
a&b
{3, 4, 5}
c = a.copy()
c |= b
c
{1, 2, 3, 4, 5, 6, 7, 8}
d = a.copy()
d &= b
d
{3, 4, 5}
函数 |
替代方法 |
描述 |
a.add(x) |
N/A |
将元素x加入集合a |
a.clear( ) |
N/A |
|
a.remove(x) |
N/A |
|
a.pop() |
N/A |
|
a.union(b) |
a|b |
a和b中的所有不同元素 |
a.update(b) |
a|=b |
将a的内容设置为a和b的并集 |
a.intersection(b) |
a&b |
a、b中同时包含的元素 |
a.intersection_ update(b) |
a&= b |
将a的内容设置为a和b的交集 |
a.difference(b) |
a-b |
在a不在b的元素 |
a.difference_ update(b) |
a-=b |
将a的内容设为在a不在b的元素 |
a.symmetric_ _difference(b) |
a^b |
所有在a或b中,但不是同时在a.b中的元素 |
a.symmetric_ difference_ update(b) |
a^=b |
将a的内容设为所有在a或b中,但不是同时在a、b中的元素 |
a. issubset(b) |
N/A |
如果a包含于b返回True |
a.add(x) N/A 将元素x加入集合a
a.clear( )N/A将集合重置为空,清空所有元素
a.remove(x)N/A从集合a移除某个元素
a.pop()N/A移除任意元素,如果集合是空的抛出keyError
a.union(b)a|ba和b中的所有不同元素
a.update(b)a|=b将a的内容设置为a和b的并集
a.intersection(b)a&ba、b中同时包含的元素
a. intersection_ update(b)a&= b将a的内容设置为a和b的交集
a.difference(b)a-b在a不在b的元素.
a.difference_ update(b)a-=b将a的内容设为在a不在b的元素
a. symmetric_ difference(b)a^b 所有在a或b中,但不是同时在a.b中的元素
a. symmetric difference_ update(b) a^=b将a的内容设为所有在a或b中,但不是同时在a、b中的元素
a. issubset(b)N/A如果a包含于b返回True
my_data = [1,2,3,4]
my_set = tuple(my_data)
my_set
(1, 2, 3, 4)
a_set = {1,2,3,4,5}
{1,2,7}.issubset(a_set)
False
{1,2,3} == {3,2,1}
True
3.1.6 列表、集合和字典的推导式
- 它允许你过滤一个容器的元素,用一种简明的表达式转换传递给过滤器的元素,从而生成一个新的列表
strings = ["a","as","bat","car","python"]
[x.upper() for x in strings if len(x) >2 ]
['BAT', 'CAR', 'PYTHON']
unique_lengths = {len(x) for x in strings}
unique_lengths
{1, 2, 3, 6}
set(map(len,strings))
{1, 2, 3, 6}
3.1.6.1 嵌套列表推导式
all_data = [['John','Emily','Michael','Mary','Steven'],['Maria','Juan','Javier','Natalia','Pilar']]
all_data[1][2].count("e")
1
name_of_interset = []
for names in all_data:
enough_es = [name for name in names if name.count("e")>=2]
name_of_interset.append(enough_es)
name_of_interset
[['Steven'], []]
name_of_interset = []
for names in all_data:
enough_es = [name for name in names if name.count("e")>=2]
name_of_interset.extend(enough_es)
name_of_interset
['Steven']
result = [name for names in all_data for name in names if name.count("e")>=2]
result
['Steven']
some_tups = [(1,2,3),(4,5,6),(7,8,9)]
flattened = [x for tups in some_tups for x in tups ]
flattened
[1, 2, 3, 4, 5, 6, 7, 8, 9]
flattened = []
for tups in some_tups:
for x in tups:
flattened.append(x)
flattened
[1, 2, 3, 4, 5, 6, 7, 8, 9]
[[x for x in tups] for tups in some_tups]
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
3.2 函数
- 函数声明时使用def关键字,返回时使用return关键字
- 如果Python达到函数的尾部时仍然没有遇到return语句,就会自动返回None
- 每个函数都可以有位置参数和关键字参数。关键字参数最常用于指定默认值或可选参数。
- 函数参数的主要限制是关键字参数必须跟在位置参数(如果有的话)后。
- 你可以按照任意顺序指定关键字参数,这可以让你不必强行记住函数参数的顺序,而只需用参数名指定。
def my_function(x,y,z=1.5):
if z > 1:
return z*(x+y)
else :
return z/(x+y)
my_function(1,2,3)
9
my_function(1,2)
4.5
my_function(1,2,z = 0.7)
0.2333333333333333
my_function(x = 1,y = 2,z = 0.7)
0.2333333333333333
3.2.1 命名空间、作用域和本地函数
- 函数有两种连接变量的方式:全局、本地。
- 在Python中另一种更贴切地描述变量作用域的名称是命名空间。
def func():
a = []
for i in range(5):
a.append(i)
a = []
def func():
for i in range(5):
a.append(i)
a = None
def func():
global a
a = []
a
[]
3.2.2 返回多个值
def f():
a = 5
b = 6
c = 7
return a,b,c
d,e,f = f()
e
6
return_values = f()
return_values
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in
----> 1 return_values = f()
2 return_values
TypeError: 'int' object is not callable
def f():
a = 5
b = 6
c = 7
return {"a":a,"b":b,"c":c}
f()
{'a': 5, 'b': 6, 'c': 7}
3.2.3 函数是对象
status = ['Alabama','Georgia!','georgia','Florida','south','carolina##','West virginia?']
import re
def clear_strings(strings):
result = []
for i in range(len(status)):
value = strings[i]
value = re.sub('[! # ?]','',value)
value = value.title()
result.append(value)
return result
clear_strings(status)
['Alabama',
'Georgia',
'Georgia',
'Florida',
'South',
'Carolina',
'Westvirginia']
3.2.4 匿名(Lambda)函数
- Python支持所谓的匿名或lambda函数。
- 匿名函数是一种通过单个语句生成函数的方式,其结果是返回值。
- 匿名函数使用lambda关键字定义,该关键字仅表达“我们声明一个匿名函数”的意思
- 和def关键字声明的函数不同,匿名函数对象自身并没有一个显式的__name__属性,这是lambda函数被称为匿名函数的一个原因
strings = ['foo','card','bar','aaaa','abab']
strings.sort(key=lambda x :len(set(list(x))))
strings
['aaaa', 'foo', 'abab', 'bar', 'card']
3.2.5 柯里化:部分参数应用
- 柯里化是计算机科学术语(以数学家Haskell Curry命名),它表示通过部分参数应用的方式从已有的函数中衍生出新的函数。
3.2.6 生成器
- 通过一致的方式遍历序列,这是Python的一个重要特性。这个特性是通过迭代器协议来实现的,迭代器协议是一种令对象可遍历的通用方式
- 迭代器就是一种用于在上下文中(比如for循环)向Python解释器生成对象的对象。
- 大部分以列表或列表型对象为参数的方法都可以接收任意的迭代器对象
- 包括内建方法比如min、max和sum,以及类型构造函数比如list和tuple
some_dict = {"a":1,"b":2,"c":3}
for key in some_dict:
print(key)
a
b
c
for i in some_dict:
print(i)
a
b
c
dict_iterator = iter(some_dict)
dict_iterator
list(dict_iterator)
['a', 'b', 'c']
def squares(n=10):
print("from 1 to {0}".format(n**2))
for i in range(1,n+1):
yield i**2
gen = squares()
gen
for x in gen:
print(x)
from 1 to 100
1
4
9
16
25
36
49
64
81
100
3.2.6.1 生成器表达式
- 用生成器表达式来创建生成器更为简单。
- 生成器表达式与列表、字典、集合的推导式很类似,创建一个生成器表达式,只需要将列表推导式的中括号替换为小括号即可
gen = (x**2 for x in range(10))
gen
at 0x0000026F2C2AEBA0>
list(gen)
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
def _make_gen():
for i in range(10):
yield i**2
gen = _make_gen()
gen
list(gen)
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
sum(x**2 for x in range(10))
285
dict((i,i**2) for i in range(5))
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}
3.2.6.2 itertools模块
- 标准库中的itertools模块是适用于大多数数据算法的生成器集合
import itertools
first_letter = lambda x: x[0]
names = ['A1an','Adam','Wes','Wi1l','Albert','Steven']
for letter,names in itertools.groupby (names, first_letter) :
print(letter,list(names))
A ['A1an', 'Adam']
W ['Wes', 'Wi1l']
A ['Albert']
S ['Steven']
3.2.7 错误和异常处理
- 优雅地处理Python的错误或异常是构建稳定程序的重要组成部分。在数据分析应用中,很多函数只能处理特定的输入。
- 某些情况下,你可能想要处理一个异常,但是你希望一部分代码无论try代码块是否报错都要执行。为了实现这个目的,使用finally关键字
- 可以使用else来执行当try代码块成功执行时才会执行的代码
3.2.7.1 IPython中的异常
3.3 文件与操作系统
- 打开文件进行读取或写入,需要使用内建函数open和绝对、相对路径
- tell方法可以给出句柄当前的位置
- 用sys模块来检查文件的默认编码
- seek方法可以将句柄位置改变到文件中特定的字节
3.3.1 字节与Unicode文件
- 需要在非ASCⅡ文本数据上进行数据分析,那么精通Python的Unicode功能是很有必要的。
3.4 本章小结
第四章NumPy基础:数组与向量化计算
4.1 NumPy ndarray:多维数组对象
- NumPy的核心特征之一就是N-维数组对象——ndarray
- 一个ndarray是一个通用的多维同类数据容器,它包含的每一个元素均为相同类型
import numpy as np
data = np.random.randn(2,3)
data
array([[0.29331975, 0.9468983 , 0.77589914],
[0.47945117, 0.32126585, 0.01654545]])
data*10
array([[2.93319746, 9.46898301, 7.75899144],
[4.7945117 , 3.21265851, 0.16545452]])
data+data
array([[0.58663949, 1.8937966 , 1.55179829],
[0.95890234, 0.6425317 , 0.0330909 ]])
data.shape
(2, 3)
data.dtype
dtype('float64')
4.1.1 生成ndarray
- 生成数组最简单的方式就是使用array函数。array函数接收任意的序列型对象(当然也包括其他的数组),生成一个新的包含传递数据的NumPy数组
data1 = [2.1,4,5,9.3]
arr1 = np.array(data1)
arr1
array([2.1, 4. , 5. , 9.3])
data2 = [[2.1,4,5,9.3],[2,3,1,1]]
arr2 = np.array(data2)
arr2
array([[2.1, 4. , 5. , 9.3],
[2. , 3. , 1. , 1. ]])
arr2.ndim
2
arr2.shape
(2, 4)
arr1.dtype
dtype('float64')
np.zeros(2)
array([0., 0.])
np.zeros(shape = (2,5))
array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]])
np.empty((2,3,2))
array([[[7.93765247e-312, 3.16202013e-322],
[0.00000000e+000, 0.00000000e+000],
[0.00000000e+000, 6.00774234e-067]],
[[3.14455224e+179, 6.01760737e-067],
[7.45958567e-038, 6.89467970e-042],
[2.36040853e+184, 4.25503767e+174]]])
np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
4.1.2 ndarray的数据类型
- 数据类型,即dtype,是一个特殊的对象,它包含了ndarray需要为某一种类型数据所申明的内存块信息(也称为元数据,即表示数据的数据)
- 使用astype时总是生成一个新的数组,即使你传入的dtype与之前一样
arr1 = np.array([1,2,3],dtype = np.float64)
arr1.dtype
dtype('float64')
arr2 = np.array([1,2,3],dtype = float)
arr2.dtype
dtype('float64')
arr2 = np.array([1,2,3],dtype = np.int32)
arr2.dtype
dtype('int32')
arr = np.array([1,2,3,4,5])
arr.dtype
dtype('int32')
float_arr = arr.astype(np.float64)
float_arr.dtype
dtype('float64')
arr = np.array([1.3,2.5,-3.6,4.9,-5.3])
arr.dtype
dtype('float64')
float_arr = arr.astype(np.int32)
float_arr.dtype
dtype('int32')
float_arr
array([ 1, 2, -3, 4, -5])
numeric_arr = np.array(["2.3","4.5","7","9.0"],dtype = np.string_)
float_arr = numeric_arr.astype(np.float64)
float_arr
array([2.3, 4.5, 7. , 9. ])
int_arr = np.arange(10)
calibers = np.array([1,2,3],dtype = np.float64)
int_arr.astype(calibers.dtype)
array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
4.1.3 NumPy数组算术
- 数组之所以重要是因为它允许你进行批量操作而无须任何for循环。
- NumPy用户称这种特性为向量化。
- 任何在两个等尺寸数组之间的算术操作都应用了逐元素操作的方式
arr = np.array([[1,2,3],[3,4,5]])
arr
array([[1, 2, 3],
[3, 4, 5]])
arr*arr
array([[ 1, 4, 9],
[ 9, 16, 25]])
arr - arr
array([[0, 0, 0],
[0, 0, 0]])
1/arr
array([[1. , 0.5 , 0.33333333],
[0.33333333, 0.25 , 0.2 ]])
arr*0.5
array([[0.5, 1. , 1.5],
[1.5, 2. , 2.5]])
arr2 = np.array([[0,12,23],[2,67,90]])
arr2 > arr
array([[False, True, True],
[False, True, True]])
4.1.4 基础索引与切片
- 区别于Python的内建列表,数组的切片是原数组的视图。这意味着数据并不是被复制了,任何对于视图的修改都会反映到原数组上
- (由于NumPy被设计成适合处理非常大的数组,你可以想象如果NumPy持续复制数据会引起多少内存问题。)
- 不写切片值的[:]将会引用数组的所有值
arr1 = np.arange(10)
arr1
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
arr1[5]
5
arr1[2:6]
array([2, 3, 4, 5])
arr1[2:6] = 12
arr1
array([ 0, 1, 12, 12, 12, 12, 6, 7, 8, 9])
arr1[2:6].copy()
array([12, 12, 12, 12])
arr2d = np.array([[1,2,3],[11,12,13],[21,22,23]])
arr2d[2]
array([21, 22, 23])
arr2d[2][2]
23
arr2d[2,2]
23
arr3d = np.array([[[1,2,3],[11,12,13]],[[21,22,23],[41,42,43]]])
arr3d[0]
array([[ 1, 2, 3],
[11, 12, 13]])
arr3d[0,1]
array([11, 12, 13])
4.1.4.1 数组的切片索引
arr = np.array([0,1,2,3,64,64,64,8,9])
arr
array([ 0, 1, 2, 3, 64, 64, 64, 8, 9])
arr[1:6]
array([ 1, 2, 3, 64, 64])
arr2d = np.array([[1,2,3],[11,12,13],[21,22,23]])
arr2d
array([[ 1, 2, 3],
[11, 12, 13],
[21, 22, 23]])
arr2d[:2]
array([[ 1, 2, 3],
[11, 12, 13]])
arr2d[:2,1:]
array([[ 2, 3],
[12, 13]])
arr2d[1,1:]
array([12, 13])
arr2d[:,1:]
array([[ 2, 3],
[12, 13],
[22, 23]])
arr2d[:2,1:] = 0
arr2d
array([[ 1, 0, 0],
[11, 0, 0],
[21, 22, 23]])
4.1.5 布尔索引
- 使用布尔值索引选择数据时,总是生成数据的拷贝,即使返回的数组并没有任何变化。
- Python的关键字and和or对布尔值数组并没有用,请使用&(and)和|(or)来代替
names = np.array(["Bob","Joe","Will","Bob","Will","Bob","Bob"])
data = np.random.randn(7,4)
names
array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Bob', 'Bob'], dtype='
data
array([[-0.31567413, 0.21075985, 1.35192139, 0.29466683],
[ 0.09323436, -0.22595805, 0.48826832, -1.11022322],
[-0.14077013, -0.2255605 , 2.54537756, -0.75138372],
[ 1.0264757 , 0.98484041, -0.64920039, -1.58000295],
[-0.76545971, -1.39263939, -0.50921365, 1.52282598],
[-1.04048087, 0.0652562 , 0.77910129, -1.19046351],
[-0.26920253, -0.94964351, -1.61560275, -0.26496221]])
names == "Will"
array([ True, False, False, True, False, True, True])
data[names == "Will"]
array([[-0.14077013, -0.2255605 , 2.54537756, -0.75138372],
[-0.76545971, -1.39263939, -0.50921365, 1.52282598]])
data[names == "Will",2:]
array([[ 2.54537756, -0.75138372],
[-0.50921365, 1.52282598]])
data[names != "Will"]
array([[-0.31567413, 0.21075985, 1.35192139, 0.29466683],
[ 0.09323436, -0.22595805, 0.48826832, -1.11022322],
[ 1.0264757 , 0.98484041, -0.64920039, -1.58000295],
[-1.04048087, 0.0652562 , 0.77910129, -1.19046351],
[-0.26920253, -0.94964351, -1.61560275, -0.26496221]])
data[~(names == "Will")]
array([[-0.31567413, 0.21075985, 1.35192139, 0.29466683],
[ 0.09323436, -0.22595805, 0.48826832, -1.11022322],
[ 1.0264757 , 0.98484041, -0.64920039, -1.58000295],
[-1.04048087, 0.0652562 , 0.77910129, -1.19046351],
[-0.26920253, -0.94964351, -1.61560275, -0.26496221]])
cond = names == "Will"
data[~cond]
array([[-0.31567413, 0.21075985, 1.35192139, 0.29466683],
[ 0.09323436, -0.22595805, 0.48826832, -1.11022322],
[ 1.0264757 , 0.98484041, -0.64920039, -1.58000295],
[-1.04048087, 0.0652562 , 0.77910129, -1.19046351],
[-0.26920253, -0.94964351, -1.61560275, -0.26496221]])
mask = (names == "Will") | (names == "Joe")
mask
array([False, True, True, False, True, False, False])
data[mask]
array([[ 0.09323436, -0.22595805, 0.48826832, -1.11022322],
[-0.14077013, -0.2255605 , 2.54537756, -0.75138372],
[-0.76545971, -1.39263939, -0.50921365, 1.52282598]])
data[data < 0] = 0
data[data > 0] = 1
data
array([[0., 1., 1., 1.],
[1., 0., 1., 0.],
[0., 0., 1., 0.],
[1., 1., 0., 0.],
[0., 0., 0., 1.],
[0., 1., 1., 0.],
[0., 0., 0., 0.]])
data[names == "Joe"] = 7
data
array([[0., 1., 1., 1.],
[7., 7., 7., 7.],
[0., 0., 1., 0.],
[1., 1., 0., 0.],
[0., 0., 0., 1.],
[0., 1., 1., 0.],
[0., 0., 0., 0.]])
4.1.6 神奇索引
- 神奇索引是NumPy中的术语,用于描述使用整数数组进行数据索引
- 请牢记神奇索引与切片不同,它总是将数据复制到一个新的数组中。
arr = np.empty((8,4))
for i in range(8):
arr[i] = i
arr
array([[0., 0., 0., 0.],
[1., 1., 1., 1.],
[2., 2., 2., 2.],
[3., 3., 3., 3.],
[4., 4., 4., 4.],
[5., 5., 5., 5.],
[6., 6., 6., 6.],
[7., 7., 7., 7.]])
arr[[4,3,0,6]]
array([[4., 4., 4., 4.],
[3., 3., 3., 3.],
[0., 0., 0., 0.],
[6., 6., 6., 6.]])
arr[[-3,-5,-1]]
array([[5., 5., 5., 5.],
[3., 3., 3., 3.],
[7., 7., 7., 7.]])
arr = np.arange(32).reshape(8,4)
arr
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23],
[24, 25, 26, 27],
[28, 29, 30, 31]])
arr[[1,6,2],[1,2,3]]
array([ 5, 26, 11])
arr[[1,6,2]][:,[1,2,1]]
array([[ 5, 6, 5],
[25, 26, 25],
[ 9, 10, 9]])
4.1.7 数组转置和换轴
arr = np.arange(15).reshape(3,5)
arr
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
arr.T
array([[ 0, 5, 10],
[ 1, 6, 11],
[ 2, 7, 12],
[ 3, 8, 13],
[ 4, 9, 14]])
np.dot(arr.T,arr)
array([[125, 140, 155, 170, 185],
[140, 158, 176, 194, 212],
[155, 176, 197, 218, 239],
[170, 194, 218, 242, 266],
[185, 212, 239, 266, 293]])
arr = np.arange(16).reshape((2,2,4))
arr
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]]])
arr.transpose((1,0,2))
array([[[ 0, 1, 2, 3],
[ 8, 9, 10, 11]],
[[ 4, 5, 6, 7],
[12, 13, 14, 15]]])
arr.swapaxes(1,2)
array([[[ 0, 4],
[ 1, 5],
[ 2, 6],
[ 3, 7]],
[[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]]])
4.2 通用函数:快速的逐元素数组函数
- 通用函数,也可以称为ufunc,是一种在ndarray数据中进行逐元素操作的函数。
- 一元和二元应该与一元一次方程里面的理解是一样的,未知数的个数
- 一元通用函数
| 函数名 | 描述 |
| --------- | ---------------------------------------- |
| abs、fabs | 逐元素地计算整数、浮点数或复数的绝对值 |
| sqrt | 计算每个元素的平方根(与arr *** 0.5 相等) |
| square | 计算每个元素的平方(与arr**2相等) |
- 二元通用函数
| 函数名 | 描述 |
| ----------------------------------------------------- | ------------------------------------------------------------ |
| add | 将数组的对应元素相加 |
| subtract | 在第二个数组中,将第一个数组中包含的元素去除 |
| multiply | 将数组的对应元素相乘 |
| divide, floor_divide | 除或整除(放弃余数) |
| power | 将第二个数组的元素作为第一个数组对应元素的幂次方 |
| maximum,fmax | 逐个元素计算最大值,fmax忽略NaN |
| minimum,fmin | 逐个元素计算最小值,fmin忽略NaN |
| mod | 按元素的求模计算(即求除法的余数) |
| copysign | 将第一个数组的符号值改为第二个数组的符号值 |
| greater,greater_equal,less,less_equal,equal,not_equal | 进行逐个元素的比较,返回布尔值数组(与 数学操作符>. >=. <. <=. ==、!=效果一致) |
| logical_and,logical_or,logical_xor | 进行逐个元素的逻辑操作(与逻辑操作符&、|、^效果一致) |
arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.sqrt(arr)
array([0. , 1. , 1.41421356, 1.73205081, 2. ,
2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])
np.exp(arr)
array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
2.98095799e+03, 8.10308393e+03])
x = np.random.randn(8)
y = np.random.randn(8)
x
array([ 0.47558794, 1.76289222, -0.0145971 , -1.11324245, -1.36569064,
0.4139789 , 0.63763297, -0.49070227])
y
array([ 0.42249673, 0.4340302 , -0.81489642, 0.03024444, 0.74262733,
-1.55340271, -1.30271516, -1.38870596])
np.maximum(x,y)
array([ 0.47558794, 1.76289222, -0.0145971 , 0.03024444, 0.74262733,
0.4139789 , 0.63763297, -0.49070227])
4.3 使用数组进行面向数组编程
- 这种利用数组表达式来替代显式循环的方法,称为向量化。
points = np.arange(-5,5,0.01)
points
array([-5.0000000e+00, -4.9900000e+00, -4.9800000e+00, -4.9700000e+00,
-4.9600000e+00, -4.9500000e+00, -4.9400000e+00, -4.9300000e+00,
-4.9200000e+00, -4.9100000e+00, -4.9000000e+00, -4.8900000e+00,
-4.8800000e+00, -4.8700000e+00, -4.8600000e+00, -4.8500000e+00,
-4.8400000e+00, -4.8300000e+00, -4.8200000e+00, -4.8100000e+00,
-4.8000000e+00, -4.7900000e+00, -4.7800000e+00, -4.7700000e+00,
-4.7600000e+00, -4.7500000e+00, -4.7400000e+00, -4.7300000e+00,
-4.7200000e+00, -4.7100000e+00, -4.7000000e+00, -4.6900000e+00,
-4.6800000e+00, -4.6700000e+00, -4.6600000e+00, -4.6500000e+00,
-4.6400000e+00, -4.6300000e+00, -4.6200000e+00, -4.6100000e+00,
-4.6000000e+00, -4.5900000e+00, -4.5800000e+00, -4.5700000e+00,
-4.5600000e+00, -4.5500000e+00, -4.5400000e+00, -4.5300000e+00,
-4.5200000e+00, -4.5100000e+00, -4.5000000e+00, -4.4900000e+00,
-4.4800000e+00, -4.4700000e+00, -4.4600000e+00, -4.4500000e+00,
-4.4400000e+00, -4.4300000e+00, -4.4200000e+00, -4.4100000e+00,
-4.4000000e+00, -4.3900000e+00, -4.3800000e+00, -4.3700000e+00,
-4.3600000e+00, -4.3500000e+00, -4.3400000e+00, -4.3300000e+00,
-4.3200000e+00, -4.3100000e+00, -4.3000000e+00, -4.2900000e+00,
-4.2800000e+00, -4.2700000e+00, -4.2600000e+00, -4.2500000e+00,
-4.2400000e+00, -4.2300000e+00, -4.2200000e+00, -4.2100000e+00,
-4.2000000e+00, -4.1900000e+00, -4.1800000e+00, -4.1700000e+00,
-4.1600000e+00, -4.1500000e+00, -4.1400000e+00, -4.1300000e+00,
-4.1200000e+00, -4.1100000e+00, -4.1000000e+00, -4.0900000e+00,
-4.0800000e+00, -4.0700000e+00, -4.0600000e+00, -4.0500000e+00,
-4.0400000e+00, -4.0300000e+00, -4.0200000e+00, -4.0100000e+00,
-4.0000000e+00, -3.9900000e+00, -3.9800000e+00, -3.9700000e+00,
-3.9600000e+00, -3.9500000e+00, -3.9400000e+00, -3.9300000e+00,
-3.9200000e+00, -3.9100000e+00, -3.9000000e+00, -3.8900000e+00,
-3.8800000e+00, -3.8700000e+00, -3.8600000e+00, -3.8500000e+00,
-3.8400000e+00, -3.8300000e+00, -3.8200000e+00, -3.8100000e+00,
-3.8000000e+00, -3.7900000e+00, -3.7800000e+00, -3.7700000e+00,
-3.7600000e+00, -3.7500000e+00, -3.7400000e+00, -3.7300000e+00,
-3.7200000e+00, -3.7100000e+00, -3.7000000e+00, -3.6900000e+00,
-3.6800000e+00, -3.6700000e+00, -3.6600000e+00, -3.6500000e+00,
-3.6400000e+00, -3.6300000e+00, -3.6200000e+00, -3.6100000e+00,
-3.6000000e+00, -3.5900000e+00, -3.5800000e+00, -3.5700000e+00,
-3.5600000e+00, -3.5500000e+00, -3.5400000e+00, -3.5300000e+00,
-3.5200000e+00, -3.5100000e+00, -3.5000000e+00, -3.4900000e+00,
-3.4800000e+00, -3.4700000e+00, -3.4600000e+00, -3.4500000e+00,
-3.4400000e+00, -3.4300000e+00, -3.4200000e+00, -3.4100000e+00,
-3.4000000e+00, -3.3900000e+00, -3.3800000e+00, -3.3700000e+00,
-3.3600000e+00, -3.3500000e+00, -3.3400000e+00, -3.3300000e+00,
-3.3200000e+00, -3.3100000e+00, -3.3000000e+00, -3.2900000e+00,
-3.2800000e+00, -3.2700000e+00, -3.2600000e+00, -3.2500000e+00,
-3.2400000e+00, -3.2300000e+00, -3.2200000e+00, -3.2100000e+00,
-3.2000000e+00, -3.1900000e+00, -3.1800000e+00, -3.1700000e+00,
-3.1600000e+00, -3.1500000e+00, -3.1400000e+00, -3.1300000e+00,
-3.1200000e+00, -3.1100000e+00, -3.1000000e+00, -3.0900000e+00,
-3.0800000e+00, -3.0700000e+00, -3.0600000e+00, -3.0500000e+00,
-3.0400000e+00, -3.0300000e+00, -3.0200000e+00, -3.0100000e+00,
-3.0000000e+00, -2.9900000e+00, -2.9800000e+00, -2.9700000e+00,
-2.9600000e+00, -2.9500000e+00, -2.9400000e+00, -2.9300000e+00,
-2.9200000e+00, -2.9100000e+00, -2.9000000e+00, -2.8900000e+00,
-2.8800000e+00, -2.8700000e+00, -2.8600000e+00, -2.8500000e+00,
-2.8400000e+00, -2.8300000e+00, -2.8200000e+00, -2.8100000e+00,
-2.8000000e+00, -2.7900000e+00, -2.7800000e+00, -2.7700000e+00,
-2.7600000e+00, -2.7500000e+00, -2.7400000e+00, -2.7300000e+00,
-2.7200000e+00, -2.7100000e+00, -2.7000000e+00, -2.6900000e+00,
-2.6800000e+00, -2.6700000e+00, -2.6600000e+00, -2.6500000e+00,
-2.6400000e+00, -2.6300000e+00, -2.6200000e+00, -2.6100000e+00,
-2.6000000e+00, -2.5900000e+00, -2.5800000e+00, -2.5700000e+00,
-2.5600000e+00, -2.5500000e+00, -2.5400000e+00, -2.5300000e+00,
-2.5200000e+00, -2.5100000e+00, -2.5000000e+00, -2.4900000e+00,
-2.4800000e+00, -2.4700000e+00, -2.4600000e+00, -2.4500000e+00,
-2.4400000e+00, -2.4300000e+00, -2.4200000e+00, -2.4100000e+00,
-2.4000000e+00, -2.3900000e+00, -2.3800000e+00, -2.3700000e+00,
-2.3600000e+00, -2.3500000e+00, -2.3400000e+00, -2.3300000e+00,
-2.3200000e+00, -2.3100000e+00, -2.3000000e+00, -2.2900000e+00,
-2.2800000e+00, -2.2700000e+00, -2.2600000e+00, -2.2500000e+00,
-2.2400000e+00, -2.2300000e+00, -2.2200000e+00, -2.2100000e+00,
-2.2000000e+00, -2.1900000e+00, -2.1800000e+00, -2.1700000e+00,
-2.1600000e+00, -2.1500000e+00, -2.1400000e+00, -2.1300000e+00,
-2.1200000e+00, -2.1100000e+00, -2.1000000e+00, -2.0900000e+00,
-2.0800000e+00, -2.0700000e+00, -2.0600000e+00, -2.0500000e+00,
-2.0400000e+00, -2.0300000e+00, -2.0200000e+00, -2.0100000e+00,
-2.0000000e+00, -1.9900000e+00, -1.9800000e+00, -1.9700000e+00,
-1.9600000e+00, -1.9500000e+00, -1.9400000e+00, -1.9300000e+00,
-1.9200000e+00, -1.9100000e+00, -1.9000000e+00, -1.8900000e+00,
-1.8800000e+00, -1.8700000e+00, -1.8600000e+00, -1.8500000e+00,
-1.8400000e+00, -1.8300000e+00, -1.8200000e+00, -1.8100000e+00,
-1.8000000e+00, -1.7900000e+00, -1.7800000e+00, -1.7700000e+00,
-1.7600000e+00, -1.7500000e+00, -1.7400000e+00, -1.7300000e+00,
-1.7200000e+00, -1.7100000e+00, -1.7000000e+00, -1.6900000e+00,
-1.6800000e+00, -1.6700000e+00, -1.6600000e+00, -1.6500000e+00,
-1.6400000e+00, -1.6300000e+00, -1.6200000e+00, -1.6100000e+00,
-1.6000000e+00, -1.5900000e+00, -1.5800000e+00, -1.5700000e+00,
-1.5600000e+00, -1.5500000e+00, -1.5400000e+00, -1.5300000e+00,
-1.5200000e+00, -1.5100000e+00, -1.5000000e+00, -1.4900000e+00,
-1.4800000e+00, -1.4700000e+00, -1.4600000e+00, -1.4500000e+00,
-1.4400000e+00, -1.4300000e+00, -1.4200000e+00, -1.4100000e+00,
-1.4000000e+00, -1.3900000e+00, -1.3800000e+00, -1.3700000e+00,
-1.3600000e+00, -1.3500000e+00, -1.3400000e+00, -1.3300000e+00,
-1.3200000e+00, -1.3100000e+00, -1.3000000e+00, -1.2900000e+00,
-1.2800000e+00, -1.2700000e+00, -1.2600000e+00, -1.2500000e+00,
-1.2400000e+00, -1.2300000e+00, -1.2200000e+00, -1.2100000e+00,
-1.2000000e+00, -1.1900000e+00, -1.1800000e+00, -1.1700000e+00,
-1.1600000e+00, -1.1500000e+00, -1.1400000e+00, -1.1300000e+00,
-1.1200000e+00, -1.1100000e+00, -1.1000000e+00, -1.0900000e+00,
-1.0800000e+00, -1.0700000e+00, -1.0600000e+00, -1.0500000e+00,
-1.0400000e+00, -1.0300000e+00, -1.0200000e+00, -1.0100000e+00,
-1.0000000e+00, -9.9000000e-01, -9.8000000e-01, -9.7000000e-01,
-9.6000000e-01, -9.5000000e-01, -9.4000000e-01, -9.3000000e-01,
-9.2000000e-01, -9.1000000e-01, -9.0000000e-01, -8.9000000e-01,
-8.8000000e-01, -8.7000000e-01, -8.6000000e-01, -8.5000000e-01,
-8.4000000e-01, -8.3000000e-01, -8.2000000e-01, -8.1000000e-01,
-8.0000000e-01, -7.9000000e-01, -7.8000000e-01, -7.7000000e-01,
-7.6000000e-01, -7.5000000e-01, -7.4000000e-01, -7.3000000e-01,
-7.2000000e-01, -7.1000000e-01, -7.0000000e-01, -6.9000000e-01,
-6.8000000e-01, -6.7000000e-01, -6.6000000e-01, -6.5000000e-01,
-6.4000000e-01, -6.3000000e-01, -6.2000000e-01, -6.1000000e-01,
-6.0000000e-01, -5.9000000e-01, -5.8000000e-01, -5.7000000e-01,
-5.6000000e-01, -5.5000000e-01, -5.4000000e-01, -5.3000000e-01,
-5.2000000e-01, -5.1000000e-01, -5.0000000e-01, -4.9000000e-01,
-4.8000000e-01, -4.7000000e-01, -4.6000000e-01, -4.5000000e-01,
-4.4000000e-01, -4.3000000e-01, -4.2000000e-01, -4.1000000e-01,
-4.0000000e-01, -3.9000000e-01, -3.8000000e-01, -3.7000000e-01,
-3.6000000e-01, -3.5000000e-01, -3.4000000e-01, -3.3000000e-01,
-3.2000000e-01, -3.1000000e-01, -3.0000000e-01, -2.9000000e-01,
-2.8000000e-01, -2.7000000e-01, -2.6000000e-01, -2.5000000e-01,
-2.4000000e-01, -2.3000000e-01, -2.2000000e-01, -2.1000000e-01,
-2.0000000e-01, -1.9000000e-01, -1.8000000e-01, -1.7000000e-01,
-1.6000000e-01, -1.5000000e-01, -1.4000000e-01, -1.3000000e-01,
-1.2000000e-01, -1.1000000e-01, -1.0000000e-01, -9.0000000e-02,
-8.0000000e-02, -7.0000000e-02, -6.0000000e-02, -5.0000000e-02,
-4.0000000e-02, -3.0000000e-02, -2.0000000e-02, -1.0000000e-02,
-1.0658141e-13, 1.0000000e-02, 2.0000000e-02, 3.0000000e-02,
4.0000000e-02, 5.0000000e-02, 6.0000000e-02, 7.0000000e-02,
8.0000000e-02, 9.0000000e-02, 1.0000000e-01, 1.1000000e-01,
1.2000000e-01, 1.3000000e-01, 1.4000000e-01, 1.5000000e-01,
1.6000000e-01, 1.7000000e-01, 1.8000000e-01, 1.9000000e-01,
2.0000000e-01, 2.1000000e-01, 2.2000000e-01, 2.3000000e-01,
2.4000000e-01, 2.5000000e-01, 2.6000000e-01, 2.7000000e-01,
2.8000000e-01, 2.9000000e-01, 3.0000000e-01, 3.1000000e-01,
3.2000000e-01, 3.3000000e-01, 3.4000000e-01, 3.5000000e-01,
3.6000000e-01, 3.7000000e-01, 3.8000000e-01, 3.9000000e-01,
4.0000000e-01, 4.1000000e-01, 4.2000000e-01, 4.3000000e-01,
4.4000000e-01, 4.5000000e-01, 4.6000000e-01, 4.7000000e-01,
4.8000000e-01, 4.9000000e-01, 5.0000000e-01, 5.1000000e-01,
5.2000000e-01, 5.3000000e-01, 5.4000000e-01, 5.5000000e-01,
5.6000000e-01, 5.7000000e-01, 5.8000000e-01, 5.9000000e-01,
6.0000000e-01, 6.1000000e-01, 6.2000000e-01, 6.3000000e-01,
6.4000000e-01, 6.5000000e-01, 6.6000000e-01, 6.7000000e-01,
6.8000000e-01, 6.9000000e-01, 7.0000000e-01, 7.1000000e-01,
7.2000000e-01, 7.3000000e-01, 7.4000000e-01, 7.5000000e-01,
7.6000000e-01, 7.7000000e-01, 7.8000000e-01, 7.9000000e-01,
8.0000000e-01, 8.1000000e-01, 8.2000000e-01, 8.3000000e-01,
8.4000000e-01, 8.5000000e-01, 8.6000000e-01, 8.7000000e-01,
8.8000000e-01, 8.9000000e-01, 9.0000000e-01, 9.1000000e-01,
9.2000000e-01, 9.3000000e-01, 9.4000000e-01, 9.5000000e-01,
9.6000000e-01, 9.7000000e-01, 9.8000000e-01, 9.9000000e-01,
1.0000000e+00, 1.0100000e+00, 1.0200000e+00, 1.0300000e+00,
1.0400000e+00, 1.0500000e+00, 1.0600000e+00, 1.0700000e+00,
1.0800000e+00, 1.0900000e+00, 1.1000000e+00, 1.1100000e+00,
1.1200000e+00, 1.1300000e+00, 1.1400000e+00, 1.1500000e+00,
1.1600000e+00, 1.1700000e+00, 1.1800000e+00, 1.1900000e+00,
1.2000000e+00, 1.2100000e+00, 1.2200000e+00, 1.2300000e+00,
1.2400000e+00, 1.2500000e+00, 1.2600000e+00, 1.2700000e+00,
1.2800000e+00, 1.2900000e+00, 1.3000000e+00, 1.3100000e+00,
1.3200000e+00, 1.3300000e+00, 1.3400000e+00, 1.3500000e+00,
1.3600000e+00, 1.3700000e+00, 1.3800000e+00, 1.3900000e+00,
1.4000000e+00, 1.4100000e+00, 1.4200000e+00, 1.4300000e+00,
1.4400000e+00, 1.4500000e+00, 1.4600000e+00, 1.4700000e+00,
1.4800000e+00, 1.4900000e+00, 1.5000000e+00, 1.5100000e+00,
1.5200000e+00, 1.5300000e+00, 1.5400000e+00, 1.5500000e+00,
1.5600000e+00, 1.5700000e+00, 1.5800000e+00, 1.5900000e+00,
1.6000000e+00, 1.6100000e+00, 1.6200000e+00, 1.6300000e+00,
1.6400000e+00, 1.6500000e+00, 1.6600000e+00, 1.6700000e+00,
1.6800000e+00, 1.6900000e+00, 1.7000000e+00, 1.7100000e+00,
1.7200000e+00, 1.7300000e+00, 1.7400000e+00, 1.7500000e+00,
1.7600000e+00, 1.7700000e+00, 1.7800000e+00, 1.7900000e+00,
1.8000000e+00, 1.8100000e+00, 1.8200000e+00, 1.8300000e+00,
1.8400000e+00, 1.8500000e+00, 1.8600000e+00, 1.8700000e+00,
1.8800000e+00, 1.8900000e+00, 1.9000000e+00, 1.9100000e+00,
1.9200000e+00, 1.9300000e+00, 1.9400000e+00, 1.9500000e+00,
1.9600000e+00, 1.9700000e+00, 1.9800000e+00, 1.9900000e+00,
2.0000000e+00, 2.0100000e+00, 2.0200000e+00, 2.0300000e+00,
2.0400000e+00, 2.0500000e+00, 2.0600000e+00, 2.0700000e+00,
2.0800000e+00, 2.0900000e+00, 2.1000000e+00, 2.1100000e+00,
2.1200000e+00, 2.1300000e+00, 2.1400000e+00, 2.1500000e+00,
2.1600000e+00, 2.1700000e+00, 2.1800000e+00, 2.1900000e+00,
2.2000000e+00, 2.2100000e+00, 2.2200000e+00, 2.2300000e+00,
2.2400000e+00, 2.2500000e+00, 2.2600000e+00, 2.2700000e+00,
2.2800000e+00, 2.2900000e+00, 2.3000000e+00, 2.3100000e+00,
2.3200000e+00, 2.3300000e+00, 2.3400000e+00, 2.3500000e+00,
2.3600000e+00, 2.3700000e+00, 2.3800000e+00, 2.3900000e+00,
2.4000000e+00, 2.4100000e+00, 2.4200000e+00, 2.4300000e+00,
2.4400000e+00, 2.4500000e+00, 2.4600000e+00, 2.4700000e+00,
2.4800000e+00, 2.4900000e+00, 2.5000000e+00, 2.5100000e+00,
2.5200000e+00, 2.5300000e+00, 2.5400000e+00, 2.5500000e+00,
2.5600000e+00, 2.5700000e+00, 2.5800000e+00, 2.5900000e+00,
2.6000000e+00, 2.6100000e+00, 2.6200000e+00, 2.6300000e+00,
2.6400000e+00, 2.6500000e+00, 2.6600000e+00, 2.6700000e+00,
2.6800000e+00, 2.6900000e+00, 2.7000000e+00, 2.7100000e+00,
2.7200000e+00, 2.7300000e+00, 2.7400000e+00, 2.7500000e+00,
2.7600000e+00, 2.7700000e+00, 2.7800000e+00, 2.7900000e+00,
2.8000000e+00, 2.8100000e+00, 2.8200000e+00, 2.8300000e+00,
2.8400000e+00, 2.8500000e+00, 2.8600000e+00, 2.8700000e+00,
2.8800000e+00, 2.8900000e+00, 2.9000000e+00, 2.9100000e+00,
2.9200000e+00, 2.9300000e+00, 2.9400000e+00, 2.9500000e+00,
2.9600000e+00, 2.9700000e+00, 2.9800000e+00, 2.9900000e+00,
3.0000000e+00, 3.0100000e+00, 3.0200000e+00, 3.0300000e+00,
3.0400000e+00, 3.0500000e+00, 3.0600000e+00, 3.0700000e+00,
3.0800000e+00, 3.0900000e+00, 3.1000000e+00, 3.1100000e+00,
3.1200000e+00, 3.1300000e+00, 3.1400000e+00, 3.1500000e+00,
3.1600000e+00, 3.1700000e+00, 3.1800000e+00, 3.1900000e+00,
3.2000000e+00, 3.2100000e+00, 3.2200000e+00, 3.2300000e+00,
3.2400000e+00, 3.2500000e+00, 3.2600000e+00, 3.2700000e+00,
3.2800000e+00, 3.2900000e+00, 3.3000000e+00, 3.3100000e+00,
3.3200000e+00, 3.3300000e+00, 3.3400000e+00, 3.3500000e+00,
3.3600000e+00, 3.3700000e+00, 3.3800000e+00, 3.3900000e+00,
3.4000000e+00, 3.4100000e+00, 3.4200000e+00, 3.4300000e+00,
3.4400000e+00, 3.4500000e+00, 3.4600000e+00, 3.4700000e+00,
3.4800000e+00, 3.4900000e+00, 3.5000000e+00, 3.5100000e+00,
3.5200000e+00, 3.5300000e+00, 3.5400000e+00, 3.5500000e+00,
3.5600000e+00, 3.5700000e+00, 3.5800000e+00, 3.5900000e+00,
3.6000000e+00, 3.6100000e+00, 3.6200000e+00, 3.6300000e+00,
3.6400000e+00, 3.6500000e+00, 3.6600000e+00, 3.6700000e+00,
3.6800000e+00, 3.6900000e+00, 3.7000000e+00, 3.7100000e+00,
3.7200000e+00, 3.7300000e+00, 3.7400000e+00, 3.7500000e+00,
3.7600000e+00, 3.7700000e+00, 3.7800000e+00, 3.7900000e+00,
3.8000000e+00, 3.8100000e+00, 3.8200000e+00, 3.8300000e+00,
3.8400000e+00, 3.8500000e+00, 3.8600000e+00, 3.8700000e+00,
3.8800000e+00, 3.8900000e+00, 3.9000000e+00, 3.9100000e+00,
3.9200000e+00, 3.9300000e+00, 3.9400000e+00, 3.9500000e+00,
3.9600000e+00, 3.9700000e+00, 3.9800000e+00, 3.9900000e+00,
4.0000000e+00, 4.0100000e+00, 4.0200000e+00, 4.0300000e+00,
4.0400000e+00, 4.0500000e+00, 4.0600000e+00, 4.0700000e+00,
4.0800000e+00, 4.0900000e+00, 4.1000000e+00, 4.1100000e+00,
4.1200000e+00, 4.1300000e+00, 4.1400000e+00, 4.1500000e+00,
4.1600000e+00, 4.1700000e+00, 4.1800000e+00, 4.1900000e+00,
4.2000000e+00, 4.2100000e+00, 4.2200000e+00, 4.2300000e+00,
4.2400000e+00, 4.2500000e+00, 4.2600000e+00, 4.2700000e+00,
4.2800000e+00, 4.2900000e+00, 4.3000000e+00, 4.3100000e+00,
4.3200000e+00, 4.3300000e+00, 4.3400000e+00, 4.3500000e+00,
4.3600000e+00, 4.3700000e+00, 4.3800000e+00, 4.3900000e+00,
4.4000000e+00, 4.4100000e+00, 4.4200000e+00, 4.4300000e+00,
4.4400000e+00, 4.4500000e+00, 4.4600000e+00, 4.4700000e+00,
4.4800000e+00, 4.4900000e+00, 4.5000000e+00, 4.5100000e+00,
4.5200000e+00, 4.5300000e+00, 4.5400000e+00, 4.5500000e+00,
4.5600000e+00, 4.5700000e+00, 4.5800000e+00, 4.5900000e+00,
4.6000000e+00, 4.6100000e+00, 4.6200000e+00, 4.6300000e+00,
4.6400000e+00, 4.6500000e+00, 4.6600000e+00, 4.6700000e+00,
4.6800000e+00, 4.6900000e+00, 4.7000000e+00, 4.7100000e+00,
4.7200000e+00, 4.7300000e+00, 4.7400000e+00, 4.7500000e+00,
4.7600000e+00, 4.7700000e+00, 4.7800000e+00, 4.7900000e+00,
4.8000000e+00, 4.8100000e+00, 4.8200000e+00, 4.8300000e+00,
4.8400000e+00, 4.8500000e+00, 4.8600000e+00, 4.8700000e+00,
4.8800000e+00, 4.8900000e+00, 4.9000000e+00, 4.9100000e+00,
4.9200000e+00, 4.9300000e+00, 4.9400000e+00, 4.9500000e+00,
4.9600000e+00, 4.9700000e+00, 4.9800000e+00, 4.9900000e+00])
xs,ys = np.meshgrid(points,points)
xs
array([[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
...,
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99]])
ys
array([[-5. , -5. , -5. , ..., -5. , -5. , -5. ],
[-4.99, -4.99, -4.99, ..., -4.99, -4.99, -4.99],
[-4.98, -4.98, -4.98, ..., -4.98, -4.98, -4.98],
...,
[ 4.97, 4.97, 4.97, ..., 4.97, 4.97, 4.97],
[ 4.98, 4.98, 4.98, ..., 4.98, 4.98, 4.98],
[ 4.99, 4.99, 4.99, ..., 4.99, 4.99, 4.99]])
z = np.sqrt(xs**2+ys**2)
z
array([[7.07106781, 7.06400028, 7.05693985, ..., 7.04988652, 7.05693985,
7.06400028],
[7.06400028, 7.05692568, 7.04985815, ..., 7.04279774, 7.04985815,
7.05692568],
[7.05693985, 7.04985815, 7.04278354, ..., 7.03571603, 7.04278354,
7.04985815],
...,
[7.04988652, 7.04279774, 7.03571603, ..., 7.0286414 , 7.03571603,
7.04279774],
[7.05693985, 7.04985815, 7.04278354, ..., 7.03571603, 7.04278354,
7.04985815],
[7.06400028, 7.05692568, 7.04985815, ..., 7.04279774, 7.04985815,
7.05692568]])
import matplotlib.pyplot as plt
plt.imshow(z)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-u2UhfLeQ-1640752034425)(output_399_1.png)]
4.3.1 将条件逻辑作为数组操作
- numpy.where函数是三元表达式x if condition else y的向量化版本。
- np.where的第二个和第三个参数并不需要是数组,它们可以是标量。
- 传递给np.where的数组既可以是同等大小的数组,也可以是标量。
xarr = [1.1,1.2,1.3,1.4,1.5]
yarr = [2.1,2.2,2.3,2.4,2.5]
cond = [False,True,False,False,True]
result = [(x if c else y) for x,y,c in zip(xarr,yarr,cond)]
result
[2.1, 1.2, 2.3, 2.4, 1.5]
result = np.where(cond,xarr,yarr)
result
array([2.1, 1.2, 2.3, 2.4, 1.5])
arr = np.random.randn(4,4)
arr
array([[ 0.0279815 , -0.03226408, 0.86484912, -0.08501145],
[ 0.98888936, -0.79318502, 0.23349129, 0.5547258 ],
[ 0.33158546, 0.31742717, -1.59492796, 0.29142641],
[ 0.33042836, -1.78994005, 0.11780007, -1.47899861]])
result = np.where(arr<0,arr,2)
result
array([[ 2. , -0.03226408, 2. , -0.08501145],
[ 2. , -0.79318502, 2. , 2. ],
[ 2. , 2. , -1.59492796, 2. ],
[ 2. , -1.78994005, 2. , -1.47899861]])
result = np.where(arr<0,-2,2)
result
array([[ 2, -2, 2, -2],
[ 2, -2, 2, 2],
[ 2, 2, -2, 2],
[ 2, -2, 2, -2]])
np.where(arr<0,2,arr)
array([[0.0279815 , 2. , 0.86484912, 2. ],
[0.98888936, 2. , 0.23349129, 0.5547258 ],
[0.33158546, 0.31742717, 2. , 0.29142641],
[0.33042836, 2. , 0.11780007, 2. ]])
4.3.2 数学和统计方法
- 聚合函数(通常也叫缩减函数)
- 既可以直接调用数组实例的方法,也可以使用顶层的NumPy函数。
- 像mean、sum等函数可以接收一个可选参数axis
- cumsum和cumprod并不会聚合,它们会产生一个中间结果
- 基础数组统计方法
| 方法|描述|
| ---- | ---- |
| sum | 沿着轴向计算所有元素的累和,0长度的数组,累和为0|
| mean| 数学平均,0长度的数组平均值为NaN|
| std,var |标准差和方差,可以选择自由度调整(默认分母是n)|
| min,max|最小值和最大值|
| argmin, argmax|最小值和最大值的位置|
arr = np.arange(12).reshape(3,4)
arr
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
arr.mean()
5.5
np.mean(arr)
5.5
arr.sum()
66
np.sum(arr)
66
arr.mean(axis = 1)
array([1.5, 5.5, 9.5])
arr.mean(1)
array([1.5, 5.5, 9.5])
arr.sum(axis = 0)
array([12, 15, 18, 21])
arr.sum(0)
array([12, 15, 18, 21])
arr = np.array([1,2,3,4,5,6,7])
arr.cumsum()
array([ 1, 3, 6, 10, 15, 21, 28], dtype=int32)
arr.cumprod()
array([ 1, 2, 6, 24, 120, 720, 5040], dtype=int32)
arr = np.array([[1,2,3],[4,5,6]])
arr
array([[1, 2, 3],
[4, 5, 6]])
arr.cumsum(1)
array([[ 1, 3, 6],
[ 4, 9, 15]], dtype=int32)
arr.cumsum(0)
array([[1, 2, 3],
[5, 7, 9]], dtype=int32)
4.3.3 布尔值数组的方法
- 布尔值会被强制为1(True)和0(False)。因此,sum通常可以用于计算布尔值数组中的True的个数
- 对于布尔值数组,有两个非常有用的方法any和all。any检查数组中是否至少有一个True,而all检查是否每个值都是True
arr = np.random.randn(100)
(arr>0).sum()
49
bools = np.array([False,True,False,True])
bools.any()
True
bools.all()
False
4.3.4 排序
- 和Python的内建列表类型相似,NumPy数组可以使用sort方法按位置排序
- 顶层的np.sort方法返回的是已经排序好的数组拷贝,而不是对原数组按位置排序。
arr = np.random.randn(6)
arr
array([-0.09465593, 0.48442829, -0.80286736, 0.29244109, -0.31614414,
-0.4904135 ])
arr.sort()
arr
array([-0.80286736, -0.4904135 , -0.31614414, -0.09465593, 0.29244109,
0.48442829])
arr = np.random.randn(5,3)
arr
array([[ 0.73589519, -0.25249415, -0.18110702],
[ 0.18183926, 0.36041412, -1.20684446],
[ 0.16825528, -0.55328778, 0.77363473],
[-2.38287785, -0.04767401, -0.22954949],
[-0.49612544, 1.74657401, 0.02651629]])
arr.sort()
arr
array([[-0.25249415, -0.18110702, 0.73589519],
[-1.20684446, 0.18183926, 0.36041412],
[-0.55328778, 0.16825528, 0.77363473],
[-2.38287785, -0.22954949, -0.04767401],
[-0.49612544, 0.02651629, 1.74657401]])
arr.sort(1)
arr
array([[-0.25249415, -0.18110702, 0.73589519],
[-1.20684446, 0.18183926, 0.36041412],
[-0.55328778, 0.16825528, 0.77363473],
[-2.38287785, -0.22954949, -0.04767401],
[-0.49612544, 0.02651629, 1.74657401]])
arr.sort(0)
arr
array([[-2.38287785, -0.22954949, -0.04767401],
[-1.20684446, -0.18110702, 0.36041412],
[-0.55328778, 0.02651629, 0.73589519],
[-0.49612544, 0.16825528, 0.77363473],
[-0.25249415, 0.18183926, 1.74657401]])
large_arr = np.random.randn(1000)
large_arr.sort()
large_arr[int(len(large_arr)*0.05)]
-1.667884606593057
4.3.5 唯一值与其他集合逻辑
- np.unique,返回的是数组中唯一值排序后形成的数组
- np.in1d,可以检查一个数组中的值是否在另外一个数组中,并返回一个布尔值数组
- 数组的集合操作
| 方法|描述|
| ---- | ---- |
|intersect1d(x, y)|计算x和y的交集,并排序|
|union1d(x, y)|计算x和y的并集,并排序|
|in1d(x, y)|计算x中的元素是否包含在y中,返回一个布尔值数组|
|setdiff1d(x, y) |差集,在x中但不在y中的x的元素|
|setxor1d(x, y)|异或集,在x或y中,但不属于x、y交集的元素|
names = np.array(["Bob","Joe","Will","Bob","Will","Bob","Bob"])
np.unique(names)
array(['Bob', 'Joe', 'Will'], dtype='
ints = np.array([3,3,3,4,4,2,2,1])
np.unique(ints)
array([1, 2, 3, 4])
sorted(set(ints))
[1, 2, 3, 4]
values = np.array([6,0,0,2,3,5,6])
np.in1d(values,[6,5])
array([ True, False, False, False, False, True, True])
4.4 使用数组进行文件输入和输出
- NumPy可以在硬盘中将数据以文本或二进制文件的形式进行存入硬盘或由硬盘载入。
- np.save和np.load是高效存取硬盘数据的两大工具函数。数组在默认情况下是以未压缩的格式进行存储的,后缀名是.npy
arr = np.arange(10)
np.save('some_array',arr)
np.load('some_array.npy')
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.savez('array_archive.npz',a = arr,b = arr)
arch = np.load('array_archive.npz')
arch['b']
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.savez_compressed('arrays_compressed.npz',a = arr,b = arr)
4.5 线性代数
- x.dot(y)等价于np.dot(x, y)
- 特殊符号@也作为中缀操作符,用于点乘矩阵操作
- 常用numpy.linalg函数
| 方法|描述|
| ---- | ---- |
|diag|将一个方阵的对角(或非对角)元素作为一维数组返回,或者将一维数组转换成一个方阵,并且在非对角线上有零点|
|dot|矩阵点乘|
|trace|计算对角元素和|
|det|计算矩阵的行列式|
|eig|计算方阵的特征值和特征向量|
|inv|计算方阵的逆矩阵|
|pinv|计算矩阵的Moore-Penrose伪逆|
|qr|计算QR分解|
|svd|计算奇异值分解(SVD)|
|solve|求解x的线性系统Ax = b,其中A是方阵|
|lstsq|计算Ax = b的最小二乘解|
x = np.array([[1,2,3],[4,5,6]])
y = np.array([[6,23],[-1,7],[8,9]])
x
array([[1, 2, 3],
[4, 5, 6]])
y
array([[ 6, 23],
[-1, 7],
[ 8, 9]])
x.dot(y)
array([[ 28, 64],
[ 67, 181]])
np.dot(x, y)
array([[ 28, 64],
[ 67, 181]])
np.dot(x,np.ones(3))
array([ 6., 15.])
np.dot(x,np.zeros(3))
array([0., 0.])
x @ np.ones(3)
array([ 6., 15.])
4.6 伪随机数生成
- numpy.random模块填补了Python内建的random模块的不足,可以高效地生成多种概率分布下的完整样本值数组。
- 可以通过np.random.seed更改NumPy的随机数种子
- numpy.random中的部分函数列表
samples = np.random.normal(size = (4,4))
samples
array([[ 0.5380071 , 0.47450501, 0.03845957, -1.02357224],
[ 0.03490259, -0.84530932, 1.13398569, -1.19187065],
[ 1.48156032, 0.46925679, -0.38656015, -1.01776325],
[ 0.44296535, -0.47960224, -1.46982352, 0.45302383]])
np.random.seed(1234)
rng = np.random.RandomState(1234)
rng.randn(10)
array([ 0.47143516, -1.19097569, 1.43270697, -0.3126519 , -0.72058873,
0.88716294, 0.85958841, -0.6365235 , 0.01569637, -2.24268495])
4.7 示例:随机漫步
- 随机漫步的模拟(https://en.wikipedia.org/wiki/Random_walk)提供了一个使用数组操作的说明性应用。
4.8 本章小结
import random
position = 0
walk = [position]
steps = 1000
for i in range(steps):
step = 1 if random.randint(0,1) else - 1
position += step
walk.append(position)
import matplotlib.pyplot as plt
plt.plot(walk[:100])
[]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5NoOsBTm-1640752034425)(output_464_1.png)]
第五章 pandas入门
- 最大的不同在于pandas是用来处理表格型或异质型数据的。而NumPy则相反,它更适合处理同质型的数值类数组数据。
- 还可以方便地从本地命名空间中导入Series和DataFrame,它们是常用的类
import pandas as pd
from pandas import Series,DataFrame
5.1 pandas数据结构介绍
5.1.1 Series
- Series是一种一维的数组型对象,它包含了一个值序列(与NumPy中的类型相似),并且包含了数据标签,称为索引(index)
- 交互式环境中Series的字符串表示,索引在左边,值在右边
- 默认生成的索引是从0到N-1(N是数据的长度)
- 通过values属性和index属性分别获得Series对象的值和索引
- 通常需要创建一个索引序列,用标签标识每个数据点
- 与NumPy的数组相比,你可以在从数据中选择数据的时候使用标签来进行索引
- pandas中使用isnull和notnull函数来检查缺失数据
obj = pd.Series([4,7,-5,3])
obj
0 4
1 7
2 -5
3 3
dtype: int64
obj.index
RangeIndex(start=0, stop=4, step=1)
obj.values
array([ 4, 7, -5, 3], dtype=int64)
obj2 = pd.Series([4,7,-5,3],index = ["a","b","c","d"])
obj2
a 4
b 7
c -5
d 3
dtype: int64
obj2.index
Index(['a', 'b', 'c', 'd'], dtype='object')
obj2["b"]
7
obj2[['c', 'a', 'd']]
c -5
a 4
d 3
dtype: int64
obj2[obj2>0]
a 4
b 7
d 3
dtype: int64
np.exp(obj2)
a 54.598150
b 1096.633158
c 0.006738
d 20.085537
dtype: float64
'b' in obj2
True
'f' in obj2
False
sdata = {"ohin":3500,"Texs":2000,"orenge":9000}
obj3 = pd.Series(sdata)
obj3
ohin 3500
Texs 2000
orenge 9000
dtype: int64
states = ["California","ohin","Texs"]
obj4 = pd.Series(sdata,states)
obj4
California NaN
ohin 3500.0
Texs 2000.0
dtype: float64
pd.isnull(obj4)
California True
ohin False
Texs False
dtype: bool
pd.notnull(obj4)
California False
ohin True
Texs True
dtype: bool
obj4.isnull()
California True
ohin False
Texs False
dtype: bool
obj4.notnull()
California False
ohin True
Texs True
dtype: bool
obj3
ohin 3500
Texs 2000
orenge 9000
dtype: int64
obj4
California NaN
ohin 3500.0
Texs 2000.0
dtype: float64
obj3+obj4
California NaN
Texs 4000.0
ohin 7000.0
orenge NaN
dtype: float64
obj4.name = "population"
obj4
California NaN
ohin 3500.0
Texs 2000.0
Name: population, dtype: float64
obj4.index.name = "states"
obj4
states
California NaN
ohin 3500.0
Texs 2000.0
Name: population, dtype: float64
obj
0 4
1 7
2 -5
3 3
dtype: int64
obj.index = ['Bob','Jef','Steve','Robot']
obj
Bob 4
Jef 7
Steve -5
Robot 3
dtype: int64
5.1.2 DataFrame
- DataFrame表示的是矩阵的数据表,它包含已排序的列集合,每一列可以是不同的值类型(数值、字符串、布尔值等)。
- DataFrame既有行索引也有列索引,它可以被视为一个共享相同索引的Series的字典。
- DataFrame构造函数的有效输入
| 方法|描述|
| ---- | ---- |
|2D ndarray|数据的矩阵,行和列的标签是可选参数|
|数组、列表和元组构成的字典|每个序列成为DataFrame的一列,所有的序列必须长度相等|
|NumPy结构化/记录化数组|与数组构成的字典一致
|Series构成的字典|每个值成为一列,每个Series的索引联合起来形成结果的行索引,也可以显式地传递索引|
|字典构成的字典|每一个内部字典成为一列,键联合起来形成结果的行索引|
|字典或Series构成的列表|列表中的一个元素形成DataFrame的一行,字典键或Series索引联合起来形成DataFrame的列标签|
|列表或元组构成的列表|与2D ndarray的情况-致|
|其他DataFrame|如果不显示传递索引,则会使用原DataFrame的索引|
|NumPy MaskedArray|与2D ndarray的情况类似,但隐蔽值会在结果DataFrame中成为NA/缺失值|
data = { 'states':['ohio','ohio','ohio','Nevada','Nevada','Nevada']
,'year':[2001,2002,2003,2004,2005,2006]
,'pop':[1.5,1.7,2.3,3.6,2.9,3.2]
}
fram = pd.DataFrame(data)
fram
|
states |
year |
pop |
0 |
ohio |
2001 |
1.5 |
1 |
ohio |
2002 |
1.7 |
2 |
ohio |
2003 |
2.3 |
3 |
Nevada |
2004 |
3.6 |
4 |
Nevada |
2005 |
2.9 |
5 |
Nevada |
2006 |
3.2 |
fram.head()
|
states |
year |
pop |
0 |
ohio |
2001 |
1.5 |
1 |
ohio |
2002 |
1.7 |
2 |
ohio |
2003 |
2.3 |
3 |
Nevada |
2004 |
3.6 |
4 |
Nevada |
2005 |
2.9 |
pd.DataFrame(data,columns = ['year','pop','states'])
|
year |
pop |
states |
0 |
2001 |
1.5 |
ohio |
1 |
2002 |
1.7 |
ohio |
2 |
2003 |
2.3 |
ohio |
3 |
2004 |
3.6 |
Nevada |
4 |
2005 |
2.9 |
Nevada |
5 |
2006 |
3.2 |
Nevada |
fram2 = pd.DataFrame(data,columns = ['year','pop','states','debt'],index = ['one','two','three','four','five','six'])
fram2
|
year |
pop |
states |
debt |
one |
2001 |
1.5 |
ohio |
NaN |
two |
2002 |
1.7 |
ohio |
NaN |
three |
2003 |
2.3 |
ohio |
NaN |
four |
2004 |
3.6 |
Nevada |
NaN |
five |
2005 |
2.9 |
Nevada |
NaN |
six |
2006 |
3.2 |
Nevada |
NaN |
fram2['states']
one ohio
two ohio
three ohio
four Nevada
five Nevada
six Nevada
Name: states, dtype: object
fram2.year
one 2001
two 2002
three 2003
four 2004
five 2005
six 2006
Name: year, dtype: int64
fram2.loc['three']
year 2003
pop 2.3
states ohio
debt NaN
Name: three, dtype: object
fram2['debt'] = 16.5
fram2
|
year |
pop |
states |
debt |
one |
2001 |
1.5 |
ohio |
16.5 |
two |
2002 |
1.7 |
ohio |
16.5 |
three |
2003 |
2.3 |
ohio |
16.5 |
four |
2004 |
3.6 |
Nevada |
16.5 |
five |
2005 |
2.9 |
Nevada |
16.5 |
six |
2006 |
3.2 |
Nevada |
16.5 |
fram2['debt'] = np.arange(6.0)
fram2
|
year |
pop |
states |
debt |
one |
2001 |
1.5 |
ohio |
0.0 |
two |
2002 |
1.7 |
ohio |
1.0 |
three |
2003 |
2.3 |
ohio |
2.0 |
four |
2004 |
3.6 |
Nevada |
3.0 |
five |
2005 |
2.9 |
Nevada |
4.0 |
six |
2006 |
3.2 |
Nevada |
5.0 |
val = pd.Series([-1.2,-1.5,-1.7],index = ['one','four','five'])
fram2['debt'] = val
fram2
|
year |
pop |
states |
debt |
one |
2001 |
1.5 |
ohio |
-1.2 |
two |
2002 |
1.7 |
ohio |
NaN |
three |
2003 |
2.3 |
ohio |
NaN |
four |
2004 |
3.6 |
Nevada |
-1.5 |
five |
2005 |
2.9 |
Nevada |
-1.7 |
six |
2006 |
3.2 |
Nevada |
NaN |
fram2['eastern'] = fram2.states == 'ohio'
fram2
|
year |
pop |
states |
debt |
eastern |
one |
2001 |
1.5 |
ohio |
-1.2 |
True |
two |
2002 |
1.7 |
ohio |
NaN |
True |
three |
2003 |
2.3 |
ohio |
NaN |
True |
four |
2004 |
3.6 |
Nevada |
-1.5 |
False |
five |
2005 |
2.9 |
Nevada |
-1.7 |
False |
six |
2006 |
3.2 |
Nevada |
NaN |
False |
del fram2['eastern']
fram2.columns
Index(['year', 'pop', 'states', 'debt'], dtype='object')
pop = { 'Nevada':{2001:2.4,2002:2.9},
'ohio':{2000:1.5,2001:1.7,2002:3.6}}
fram3 = pd.DataFrame(pop)
fram3
|
Nevada |
ohio |
2001 |
2.4 |
1.7 |
2002 |
2.9 |
3.6 |
2000 |
NaN |
1.5 |
fram3.T
|
2001 |
2002 |
2000 |
Nevada |
2.4 |
2.9 |
NaN |
ohio |
1.7 |
3.6 |
1.5 |
fram4 = pd.DataFrame(pop,index = [2001,2002,2003])
fram4
|
Nevada |
ohio |
2001 |
2.4 |
1.7 |
2002 |
2.9 |
3.6 |
2003 |
NaN |
NaN |
pdata = { 'Nevada':fram3['Nevada'][:-1],
'ohio':fram3['ohio'][:-2]}
pd.DataFrame(pdata)
|
Nevada |
ohio |
2001 |
2.4 |
1.7 |
2002 |
2.9 |
NaN |
fram3.index.name = 'year'
fram3.columns.name = 'state'
fram3
state |
Nevada |
ohio |
year |
|
|
2001 |
2.4 |
1.7 |
2002 |
2.9 |
3.6 |
2000 |
NaN |
1.5 |
fram3.values
array([[2.4, 1.7],
[2.9, 3.6],
[nan, 1.5]])
fram2.values
array([[2001, 1.5, 'ohio', -1.2],
[2002, 1.7, 'ohio', nan],
[2003, 2.3, 'ohio', nan],
[2004, 3.6, 'Nevada', -1.5],
[2005, 2.9, 'Nevada', -1.7],
[2006, 3.2, 'Nevada', nan]], dtype=object)
5.1.3 索引对象
- 根据重复标签进行筛选,会选取所有重复标签对应的数据。每个索引都有一些集合逻辑的方法和属性,这些方法和属性解决了关于它所包含的数据的其他常见问题
| 方法|描述|
| ---- | ---- |
|append|将额外的索引对象粘贴到原索引后,产生一个新的索引|
|difference|计算两个索引的差集|
|intersection|计算两个索引的交集|
|union|计算两个索引的并集|
|isin|计算表示每一个值是否在传值容器中的布尔数组|
|delete|将位置i的元素删除,并产生新的索引|
|drop|根据传参删除指定索引值,并产生新的索引|
|insert|在位置i插入元素,并产生新的索引
|is_ monotonic|如果索引序列递增则返回True|
|is_ unique|如果索引序列唯一则返回True|
|unique|计算索引的唯-值序列|
obj = pd.Series(range(3),index = ['a','b','c'])
index = obj.index
index
Index(['a', 'b', 'c'], dtype='object')
index[1:]
Index(['b', 'c'], dtype='object')
index[1] = 'd'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in
1 #索引对象是不可变的,因此用户是无法修改索引对象的
----> 2 index[1] = 'd'
D:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
4275 @final
4276 def __setitem__(self, key, value):
-> 4277 raise TypeError("Index does not support mutable operations")
4278
4279 def __getitem__(self, key):
TypeError: Index does not support mutable operations
labels = pd.Index(np.arange(3))
labels
Int64Index([0, 1, 2], dtype='int64')
obj2 = pd.Series([1.56,2.0,3.4],index = labels)
obj2
0 1.56
1 2.00
2 3.40
dtype: float64
obj2.index is labels
True
fram3.columns
Index(['Nevada', 'ohio'], dtype='object', name='state')
'ohio' in fram3.columns
True
2003 in fram3.columns
False
dup_labels = ['foo','foo','bar','bar']
dup_labels
['foo', 'foo', 'bar', 'bar']
5.2 基本功能
5.2.1 重建索引
obj = pd.Series([1.2,1.5,2.7,2.9],index = ['d','b','a','c'])
obj
d 1.2
b 1.5
a 2.7
c 2.9
dtype: float64
obj.reindex(['a','b','c','d','e'])
a 2.7
b 1.5
c 2.9
d 1.2
e NaN
dtype: float64
obj3 = pd.Series(['blue','purple','yellow'],index =[0,2,4])
obj3
0 blue
2 purple
4 yellow
dtype: object
obj3.reindex(range(6),method = 'ffill')
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
frame = pd.DataFrame(np.arange(9).reshape(3,3),
index = ['a','b','c'],
columns=['ohio','Texas','California'])
frame
|
ohio |
Texas |
California |
a |
0 |
1 |
2 |
b |
3 |
4 |
5 |
c |
6 |
7 |
8 |
fram2 = frame.reindex(['a','b','c','d'])
fram2
|
ohio |
Texas |
California |
a |
0.0 |
1.0 |
2.0 |
b |
3.0 |
4.0 |
5.0 |
c |
6.0 |
7.0 |
8.0 |
d |
NaN |
NaN |
NaN |
states = ['Texas','Utah','California']
frame.reindex(columns = states,copy=False)
frame
|
ohio |
Texas |
California |
a |
0 |
1 |
2 |
b |
3 |
4 |
5 |
c |
6 |
7 |
8 |
frame.loc[['a','b'],['Texas','California']]
|
Texas |
California |
a |
1 |
2 |
b |
4 |
5 |
5.2.2 轴向上删除条目
obj = pd.Series(np.arange(5.0),index = ['a','b','c','d','e'])
obj
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
new_obj = obj.drop('c')
new_obj
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
obj.drop(['d','e'])
a 0.0
b 1.0
c 2.0
dtype: float64
data = pd.DataFrame(np.arange(16).reshape(4,4),
index = ['Texas','Utah','California','New work'],
columns = ['one','two','three','four'])
data
|
one |
two |
three |
four |
Texas |
0 |
1 |
2 |
3 |
Utah |
4 |
5 |
6 |
7 |
California |
8 |
9 |
10 |
11 |
New work |
12 |
13 |
14 |
15 |
data.drop(['Utah','Texas'])
|
one |
two |
three |
four |
California |
8 |
9 |
10 |
11 |
New work |
12 |
13 |
14 |
15 |
data.drop('two',axis=1)
|
one |
three |
four |
Texas |
0 |
2 |
3 |
Utah |
4 |
6 |
7 |
California |
8 |
10 |
11 |
New work |
12 |
14 |
15 |
data.drop(['two','four'],axis='columns')
|
one |
three |
Texas |
0 |
2 |
Utah |
4 |
6 |
California |
8 |
10 |
New work |
12 |
14 |
obj.drop('c',inplace = True)
obj
d 1.2
b 1.5
a 2.7
dtype: float64
5.2.3 索引、选择与过滤
- Series的索引(obj[…])与NumPy数组索引的功能类似,只不过Series的索引值可以不仅仅是整数。
obj = pd.Series(np.arange(4.0),index = ['a','b','c','d'])
obj
a 0.0
b 1.0
c 2.0
d 3.0
dtype: float64
obj['b']
1.0
obj[1]
1.0
obj[2:4]
c 2.0
d 3.0
dtype: float64
obj[['b','a','d']]
b 1.0
a 0.0
d 3.0
dtype: float64
obj[[1,3]]
b 1.0
d 3.0
dtype: float64
obj[obj < 2]
a 0.0
b 1.0
dtype: float64
obj['b':'c']
b 1.0
c 2.0
dtype: float64
obj['b':'c'] = 5
obj
a 0.0
b 5.0
c 5.0
d 3.0
dtype: float64
data = pd.DataFrame(np.arange(16).reshape(4,4),
index = ['Texas','Utah','California','New work'],
columns = ['one','two','three','four'])
data
|
one |
two |
three |
four |
Texas |
0 |
1 |
2 |
3 |
Utah |
4 |
5 |
6 |
7 |
California |
8 |
9 |
10 |
11 |
New work |
12 |
13 |
14 |
15 |
data['two']
Texas 1
Utah 5
California 9
New work 13
Name: two, dtype: int32
data[['three','one']]
|
three |
one |
Texas |
2 |
0 |
Utah |
6 |
4 |
California |
10 |
8 |
New work |
14 |
12 |
data[:2]
|
one |
two |
three |
four |
Texas |
0 |
1 |
2 |
3 |
Utah |
4 |
5 |
6 |
7 |
data[data['three']>5]
|
one |
two |
three |
four |
Utah |
4 |
5 |
6 |
7 |
California |
8 |
9 |
10 |
11 |
New work |
12 |
13 |
14 |
15 |
data < 5
|
one |
two |
three |
four |
Texas |
True |
True |
True |
True |
Utah |
True |
False |
False |
False |
California |
False |
False |
False |
False |
New work |
False |
False |
False |
False |
data[data < 5] = 0
data
|
one |
two |
three |
four |
Texas |
0 |
0 |
0 |
0 |
Utah |
0 |
5 |
6 |
7 |
California |
8 |
9 |
10 |
11 |
New work |
12 |
13 |
14 |
15 |
5.2.3.1 使用loc和iloc选择数据
- 针对DataFrame在行上的标签索引,我将介绍特殊的索引符号loc和iloc。
- 他们允许你使用轴标签(loc)或整数标签(iloc)以NumPy风格的语法从DataFrame中选出数组的行和列的子集。
data
|
one |
two |
three |
four |
Texas |
0 |
0 |
0 |
0 |
Utah |
0 |
5 |
6 |
7 |
California |
8 |
9 |
10 |
11 |
New work |
12 |
13 |
14 |
15 |
data.loc['New work',['one','four']]
one 12
four 15
Name: New work, dtype: int32
data.iloc[3,[0,3]]
one 12
four 15
Name: New work, dtype: int32
data.loc[:'New work',['one','four']]
|
one |
four |
Texas |
0 |
0 |
Utah |
0 |
7 |
California |
8 |
11 |
New work |
12 |
15 |
data.iloc[:3,[0,3]]
|
one |
four |
Texas |
0 |
0 |
Utah |
0 |
7 |
California |
8 |
11 |
5.2.4 整数索引
- 为了更精确地处理,可以使用loc(用于标签)或iloc(用于整数)
ser = pd.Series(np.arange(3.0))
ser
0 0.0
1 1.0
2 2.0
dtype: float64
ser[:1]
0 0.0
dtype: float64
ser.loc[:0]
0 0.0
dtype: float64
ser.iloc[:1]
0 0.0
dtype: float64
5.2.5 算术和数据对齐
s1 = pd.Series([1.2,1.7,0.3,5.6],index = ['a','c','d','e'])
s2 = pd.Series([2.3,2.6,2.9,3.0],index = ['a','d','f','g'])
s1
a 1.2
c 1.7
d 0.3
e 5.6
dtype: float64
s2
a 2.3
d 2.6
f 2.9
g 3.0
dtype: float64
s1+s2
a 3.5
c NaN
d 2.9
e NaN
f NaN
g NaN
dtype: float64
df1 = pd.DataFrame(np.arange(9.0).reshape(3,3),columns=list('bcd'),
index = ['ohio','Texas','Colorado'])
df2 = pd.DataFrame(np.arange(12.0).reshape(4,3),columns=list('bde'),
index = ['Utah','ohio','Texas','Oregon'])
df1
|
b |
c |
d |
ohio |
0.0 |
1.0 |
2.0 |
Texas |
3.0 |
4.0 |
5.0 |
Colorado |
6.0 |
7.0 |
8.0 |
df2
|
b |
d |
e |
Utah |
0.0 |
1.0 |
2.0 |
ohio |
3.0 |
4.0 |
5.0 |
Texas |
6.0 |
7.0 |
8.0 |
Oregon |
9.0 |
10.0 |
11.0 |
df1+df2
|
b |
c |
d |
e |
Colorado |
NaN |
NaN |
NaN |
NaN |
Oregon |
NaN |
NaN |
NaN |
NaN |
Texas |
9.0 |
NaN |
12.0 |
NaN |
Utah |
NaN |
NaN |
NaN |
NaN |
ohio |
3.0 |
NaN |
6.0 |
NaN |
df1 = pd.DataFrame({'A':[1,2]})
df2 = pd.DataFrame({'B':[3,4]})
df1
df2
df1 - df2
5.2.5.1 使用填充值的算术方法
- 在两个不同的索引化对象之间进行算术操作时,你可能会想要使用特殊填充值,比如当轴标签在一个对象中存在,在另一个对象中不存在时,你想将缺失值填充为0
- 灵活算术方法
|方法|描述|
|—|---|
|add, radd|加法(+)|
|sub, rsub|减法()|
|div, rdiv|除法()|
|floordiv, rfloordiv|整除(/)|
|mul, rmul|乘法(*)|
|pow, rpow|幂次方(**)|
df1 = pd.DataFrame(np.arange(12.).reshape(3,4),columns=list('abcd'))
df1
|
a |
b |
c |
d |
0 |
0.0 |
1.0 |
2.0 |
3.0 |
1 |
4.0 |
5.0 |
6.0 |
7.0 |
2 |
8.0 |
9.0 |
10.0 |
11.0 |
df2 = pd.DataFrame(np.arange(20.).reshape(4,5),columns=list('abcde'))
df2
|
a |
b |
c |
d |
e |
0 |
0.0 |
1.0 |
2.0 |
3.0 |
4.0 |
1 |
5.0 |
6.0 |
7.0 |
8.0 |
9.0 |
2 |
10.0 |
11.0 |
12.0 |
13.0 |
14.0 |
3 |
15.0 |
16.0 |
17.0 |
18.0 |
19.0 |
df2.loc[1,'b'] = np.nan
df2
|
a |
b |
c |
d |
e |
0 |
0.0 |
1.0 |
2.0 |
3.0 |
4.0 |
1 |
5.0 |
NaN |
7.0 |
8.0 |
9.0 |
2 |
10.0 |
11.0 |
12.0 |
13.0 |
14.0 |
3 |
15.0 |
16.0 |
17.0 |
18.0 |
19.0 |
df1+df2
|
a |
b |
c |
d |
e |
0 |
0.0 |
2.0 |
4.0 |
6.0 |
NaN |
1 |
9.0 |
NaN |
13.0 |
15.0 |
NaN |
2 |
18.0 |
20.0 |
22.0 |
24.0 |
NaN |
3 |
NaN |
NaN |
NaN |
NaN |
NaN |
df1.add(df2,fill_value=0)
|
a |
b |
c |
d |
e |
0 |
0.0 |
2.0 |
4.0 |
6.0 |
4.0 |
1 |
9.0 |
5.0 |
13.0 |
15.0 |
9.0 |
2 |
18.0 |
20.0 |
22.0 |
24.0 |
14.0 |
3 |
15.0 |
16.0 |
17.0 |
18.0 |
19.0 |
1/df1
|
a |
b |
c |
d |
0 |
inf |
1.000000 |
0.500000 |
0.333333 |
1 |
0.250 |
0.200000 |
0.166667 |
0.142857 |
2 |
0.125 |
0.111111 |
0.100000 |
0.090909 |
df1.rdiv(1)
|
a |
b |
c |
d |
0 |
inf |
1.000000 |
0.500000 |
0.333333 |
1 |
0.250 |
0.200000 |
0.166667 |
0.142857 |
2 |
0.125 |
0.111111 |
0.100000 |
0.090909 |
df1.reindex(columns=df2.columns,fill_value=0)
|
a |
b |
c |
d |
e |
0 |
0.0 |
1.0 |
2.0 |
3.0 |
0 |
1 |
4.0 |
5.0 |
6.0 |
7.0 |
0 |
2 |
8.0 |
9.0 |
10.0 |
11.0 |
0 |
5.2.5.2 DataFrame和Series间的操作
arr = np.arange(12.).reshape(3,4)
arr
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
arr[0]
array([0., 1., 2., 3.])
arr - arr[0]
array([[0., 0., 0., 0.],
[4., 4., 4., 4.],
[8., 8., 8., 8.]])
frame = pd.DataFrame(np.arange(12.).reshape(4,3),
columns=list('bde'),
index=['Utah','ohio','Texas','Orgon'])
frame
|
b |
d |
e |
Utah |
0.0 |
1.0 |
2.0 |
ohio |
3.0 |
4.0 |
5.0 |
Texas |
6.0 |
7.0 |
8.0 |
Orgon |
9.0 |
10.0 |
11.0 |
series = frame.iloc[0]
series
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
frame.loc['Utah']
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
frame - series
|
b |
d |
e |
Utah |
0.0 |
0.0 |
0.0 |
ohio |
3.0 |
3.0 |
3.0 |
Texas |
6.0 |
6.0 |
6.0 |
Orgon |
9.0 |
9.0 |
9.0 |
series2 = pd.Series(range(3),index=['b','e','f'])
series2
b 0
e 1
f 2
dtype: int64
frame + series2
|
b |
d |
e |
f |
Utah |
0.0 |
NaN |
3.0 |
NaN |
ohio |
3.0 |
NaN |
6.0 |
NaN |
Texas |
6.0 |
NaN |
9.0 |
NaN |
Orgon |
9.0 |
NaN |
12.0 |
NaN |
series3 = frame['d']
series3
Utah 1.0
ohio 4.0
Texas 7.0
Orgon 10.0
Name: d, dtype: float64
frame.sub(series3,axis='index')
|
b |
d |
e |
Utah |
-1.0 |
0.0 |
1.0 |
ohio |
-1.0 |
0.0 |
1.0 |
Texas |
-1.0 |
0.0 |
1.0 |
Orgon |
-1.0 |
0.0 |
1.0 |
5.2.6 函数应用和映射
frame = pd.DataFrame(np.random.randn(4,3),columns=list('bde'),
index=['Utah','ohio','Texas','Orgon'])
frame
|
b |
d |
e |
Utah |
0.699710 |
0.082620 |
0.778282 |
ohio |
-0.702902 |
-0.590762 |
1.112169 |
Texas |
0.592566 |
-0.356857 |
-0.977514 |
Orgon |
1.817981 |
-0.212453 |
0.435315 |
frame.abs()
|
b |
d |
e |
Utah |
0.699710 |
0.082620 |
0.778282 |
ohio |
0.702902 |
0.590762 |
1.112169 |
Texas |
0.592566 |
0.356857 |
0.977514 |
Orgon |
1.817981 |
0.212453 |
0.435315 |
np.abs(frame)
|
b |
d |
e |
Utah |
0.699710 |
0.082620 |
0.778282 |
ohio |
0.702902 |
0.590762 |
1.112169 |
Texas |
0.592566 |
0.356857 |
0.977514 |
Orgon |
1.817981 |
0.212453 |
0.435315 |
f = lambda x:x.max() - x.min()
frame.apply(f)
b 2.520883
d 0.673381
e 2.089682
dtype: float64
frame.apply(f,axis = 'columns')
Utah 0.695662
ohio 1.815071
Texas 1.570080
Orgon 2.030434
dtype: float64
def f(x):
return pd.Series([x.min(),x.max()],index = ['min','max'])
frame.apply(f)
|
b |
d |
e |
min |
-0.702902 |
-0.590762 |
-0.977514 |
max |
1.817981 |
0.082620 |
1.112169 |
format1 = lambda x :'%.2f'%x
frame.applymap(format1)
|
b |
d |
e |
Utah |
0.70 |
0.08 |
0.78 |
ohio |
-0.70 |
-0.59 |
1.11 |
Texas |
0.59 |
-0.36 |
-0.98 |
Orgon |
1.82 |
-0.21 |
0.44 |
frame['e'].map(format1)
Utah 0.78
ohio 1.11
Texas -0.98
Orgon 0.44
Name: e, dtype: object
5.2.7 排序和排名
- 根据某些准则对数据集进行排序是另一个重要的内建操作。
- 如需按行或列索引进行字典型排序,需要使用sort_index方法,该方法返回一个新的、排序好的对象
- 排名中的平级关系打破方法
|方法|描述|
|—|---|
|‘average’|默认:在每个组中分配平均排名|
|‘min’|对整个组使用最小排名|
|‘max’|对整个组使用最大排名|
|‘first’|按照值在数据中出现的次序分配排名|
|'dense '|类似于method=‘min’,但组间排名总是增加1,而不是一个组中的相等元素的数量幂次方(**)|
obj = pd.Series(range(4),index = ['d','b','a','c'])
obj
d 0
b 1
a 2
c 3
dtype: int64
obj.sort_index()
a 2
b 1
c 3
d 0
dtype: int64
frame = pd.DataFrame(np.arange(8).reshape(2,4),
index = ['three','one'],
columns= ['d','a','b','c'])
frame
|
d |
a |
b |
c |
three |
0 |
1 |
2 |
3 |
one |
4 |
5 |
6 |
7 |
frame.sort_index()
|
d |
a |
b |
c |
one |
4 |
5 |
6 |
7 |
three |
0 |
1 |
2 |
3 |
frame.sort_index(axis=1)
|
a |
b |
c |
d |
three |
1 |
2 |
3 |
0 |
one |
5 |
6 |
7 |
4 |
frame.sort_values(by = 'a')
|
d |
a |
b |
c |
three |
0 |
1 |
2 |
3 |
one |
4 |
5 |
6 |
7 |
frame.sort_index(axis=1,ascending=False)
|
d |
c |
b |
a |
three |
0 |
3 |
2 |
1 |
one |
4 |
7 |
6 |
5 |
obj = pd.Series([4,7,-3,2])
obj.sort_values()
2 -3
3 2
0 4
1 7
dtype: int64
obj = pd.Series([4,7,np.nan,-3,2,np.nan])
obj.sort_values()
3 -3.0
4 2.0
0 4.0
1 7.0
2 NaN
5 NaN
dtype: float64
frame = pd.DataFrame({'b':[4,7,-3,2],
'a':[0,1,0,1]})
frame.sort_index(axis=1)
|
a |
b |
0 |
0 |
4 |
1 |
1 |
7 |
2 |
0 |
-3 |
3 |
1 |
2 |
frame.sort_values(by=['a','b'])
|
b |
a |
2 |
-3 |
0 |
0 |
4 |
0 |
3 |
2 |
1 |
1 |
7 |
1 |
frame.sort_index(axis=1).sort_values(by=['a','b'])
|
a |
b |
2 |
0 |
-3 |
0 |
0 |
4 |
3 |
1 |
2 |
1 |
1 |
7 |
obj = pd.Series([7,-5,7,4,2,0,4])
obj
0 7
1 -5
2 7
3 4
4 2
5 0
6 4
dtype: int64
obj.rank()
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
obj.rank(method='first')
0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64
obj.rank(ascending=False,method='max')
0 2.0
1 7.0
2 2.0
3 4.0
4 5.0
5 6.0
6 4.0
dtype: float64
frame = pd.DataFrame({'b':[1,2,3,4],
'a':[2,6,3,0],
'c':[-2,5,8,0]})
frame
|
b |
a |
c |
0 |
1 |
2 |
-2 |
1 |
2 |
6 |
5 |
2 |
3 |
3 |
8 |
3 |
4 |
0 |
0 |
frame.rank(axis='columns')
|
b |
a |
c |
0 |
2.0 |
3.0 |
1.0 |
1 |
1.0 |
3.0 |
2.0 |
2 |
1.5 |
1.5 |
3.0 |
3 |
3.0 |
1.5 |
1.5 |
5.2.8 含有重复标签的轴索引
obj = pd.Series(range(5),index=['a','a','b','b','c'])
obj
a 0
a 1
b 2
b 3
c 4
dtype: int64
obj.index.is_unique
False
obj['a']
a 0
a 1
dtype: int64
df = pd.DataFrame(np.random.randn(4,3),index = ['a','a','b','b'])
df
|
0 |
1 |
2 |
a |
-1.053335 |
0.546850 |
-1.340218 |
a |
-1.295823 |
-2.939633 |
-0.800708 |
b |
-0.737360 |
-1.191385 |
0.538147 |
b |
2.336073 |
-0.509076 |
0.924366 |
df.loc['b']
|
0 |
1 |
2 |
b |
-0.737360 |
-1.191385 |
0.538147 |
b |
2.336073 |
-0.509076 |
0.924366 |
5.3 描述性统计的概述与计算
- 归约方法可选参数
|方法|描述|
|—|---|
|axis|归约轴,0为行向,1为列向|
|skipna|排除缺失值,默认为True|
|level|如果轴是多层索引的(MultiIndex), 该参数可以缩减分组层级|
df = pd.DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,1.3]],
index = ['a','b','c','d'],
columns = ['one','two'])
df
|
one |
two |
a |
1.40 |
NaN |
b |
7.10 |
-4.5 |
c |
NaN |
NaN |
d |
0.75 |
1.3 |
df.sum()
one 9.25
two -3.20
dtype: float64
df.sum(axis='columns')
a 1.40
b 2.60
c 0.00
d 2.05
dtype: float64
df.mean(axis='columns',skipna = False)
a NaN
b 1.300
c NaN
d 1.025
dtype: float64
df.idxmin()
one d
two b
dtype: object
df.idxmax()
one b
two d
dtype: object
df.cumsum()
|
one |
two |
a |
1.40 |
NaN |
b |
8.50 |
-4.5 |
c |
NaN |
NaN |
d |
9.25 |
-3.2 |
df.describe()
|
one |
two |
count |
3.000000 |
2.000000 |
mean |
3.083333 |
-1.600000 |
std |
3.493685 |
4.101219 |
min |
0.750000 |
-4.500000 |
25% |
1.075000 |
-3.050000 |
50% |
1.400000 |
-1.600000 |
75% |
4.250000 |
-0.150000 |
max |
7.100000 |
1.300000 |
obj = pd.Series(['a','a','b','c']*4)
obj.describe()
count 16
unique 3
top a
freq 8
dtype: object
- 表5-8:描述性统计和汇总统计
|方法|描述|
|—|---|
|count|非NA值的个数|
|describe|计算Series 或DataFrame各列的汇总统计集合|
|min, max|计算最小值、最大值|
|argmin, argmax|分别计算最小值、最大值所在的索引位置(整数)|
|idxmin,idxmax|分别计算最小值或最大值所在的索引标签|
|quantile|计算样本的从0到1间的分位数|
|Sum|加和|
|mean|均值|
|median|中位数(50% 分位数)|
|mad|平均值的平均绝对偏差|
|prod|所有值的积|
|var|值的样本方差|
|std|值的样本标准差|
|skew|样本偏度(第三时刻)值|
|kurt|样本峰度(第四时刻)的值|
|Cumsum|累计值|
|cummin,cummax|累计值的最小值或最大值|
|cumprod|值的累计积|
|diff|计算第一个算术差值(对时间序列有用)|
|pct_change |计算百分比|
5.3.1 相关性和协方差
import pandas_datareader.data as web
all_data = {ticker:web.get_data_yahoo(ticker)
for ticker in ['AAPL','IBM','MSFT','GOOG']}
all_data
{'AAPL': High Low Open Close Volume \
Date
2016-12-20 29.375000 29.170000 29.184999 29.237499 85700000.0
2016-12-21 29.350000 29.195000 29.200001 29.264999 95132800.0
2016-12-22 29.127501 28.910000 29.087500 29.072500 104343600.0
2016-12-23 29.129999 28.897499 28.897499 29.129999 56998000.0
2016-12-27 29.450001 29.122499 29.129999 29.315001 73187600.0
... ... ... ... ... ...
2021-12-13 182.130005 175.529999 181.119995 175.740005 153237000.0
2021-12-14 177.740005 172.210007 175.250000 174.330002 139380400.0
2021-12-15 179.500000 172.309998 175.110001 179.300003 131063300.0
2021-12-16 181.139999 170.750000 179.279999 172.259995 150185800.0
2021-12-17 173.470001 169.690002 169.929993 171.139999 195432700.0
Adj Close
Date
2016-12-20 27.520725
2016-12-21 27.546612
2016-12-22 27.365414
2016-12-23 27.419537
2016-12-27 27.593679
... ...
2021-12-13 175.740005
2021-12-14 174.330002
2021-12-15 179.300003
2021-12-16 172.259995
2021-12-17 171.139999
[1258 rows x 6 columns],
'IBM': High Low Open Close Volume \
Date
2016-12-20 160.850861 159.130020 160.124283 160.229446 2274632.0
2016-12-21 160.554489 157.982788 158.938812 159.971313 3740182.0
2016-12-22 160.831741 159.254303 160.000000 159.713196 2931520.0
2016-12-23 160.124283 159.130020 159.655838 159.378586 1779455.0
2016-12-27 160.592728 159.512421 159.636703 159.789673 1461785.0
... ... ... ... ... ...
2021-12-13 124.360001 120.790001 123.760002 122.580002 6847500.0
2021-12-14 125.029999 122.300003 122.349998 123.760002 5716100.0
2021-12-15 124.820000 122.180000 123.800003 123.110001 4990000.0
2021-12-16 126.639999 123.480003 123.510002 125.930000 7280500.0
2021-12-17 128.639999 125.209999 125.870003 127.400002 10379000.0
Adj Close
Date
2016-12-20 127.391281
2016-12-21 127.186043
2016-12-22 126.980827
2016-12-23 126.714798
2016-12-27 127.041634
... ...
2021-12-13 122.580002
2021-12-14 123.760002
2021-12-15 123.110001
2021-12-16 125.930000
2021-12-17 127.400002
[1258 rows x 6 columns],
'MSFT': High Low Open Close Volume \
Date
2016-12-20 63.799999 63.029999 63.689999 63.540001 26028400.0
2016-12-21 63.700001 63.119999 63.430000 63.540001 17096300.0
2016-12-22 64.099998 63.410000 63.840000 63.549999 22176600.0
2016-12-23 63.540001 62.799999 63.450001 63.240002 12403800.0
2016-12-27 64.070000 63.209999 63.209999 63.279999 11763200.0
... ... ... ... ... ...
2021-12-13 343.790009 339.079987 340.679993 339.399994 28899400.0
2021-12-14 334.640015 324.109985 333.220001 328.339996 44438700.0
2021-12-15 335.190002 324.500000 328.609985 334.649994 35381100.0
2021-12-16 336.760010 323.019989 335.709991 324.899994 35034800.0
2021-12-17 324.920013 317.250000 320.880005 323.799988 47750300.0
Adj Close
Date
2016-12-20 59.078098
2016-12-21 59.078098
2016-12-22 59.087402
2016-12-23 58.799171
2016-12-27 58.836357
... ...
2021-12-13 339.399994
2021-12-14 328.339996
2021-12-15 334.649994
2021-12-16 324.899994
2021-12-17 323.799988
[1258 rows x 6 columns],
'GOOG': High Low Open Close Volume \
Date
2016-12-20 798.650024 793.270020 796.760010 796.419983 951000
2016-12-21 796.676025 787.099976 795.840027 794.559998 1211300
2016-12-22 793.320007 788.580017 792.359985 791.260010 972200
2016-12-23 792.739990 787.280029 790.900024 789.909973 623400
2016-12-27 797.859985 787.656982 790.679993 791.549988 789100
... ... ... ... ... ...
2021-12-13 2971.250000 2927.199951 2968.879883 2934.090088 1205200
2021-12-14 2908.840088 2844.850098 2895.399902 2899.409912 1238900
2021-12-15 2950.344971 2854.110107 2887.320068 2947.370117 1364000
2021-12-16 2971.030029 2881.850098 2961.540039 2896.770020 1370000
2021-12-17 2889.201904 2835.760010 2854.290039 2856.060059 2162800
Adj Close
Date
2016-12-20 796.419983
2016-12-21 794.559998
2016-12-22 791.260010
2016-12-23 789.909973
2016-12-27 791.549988
... ...
2021-12-13 2934.090088
2021-12-14 2899.409912
2021-12-15 2947.370117
2021-12-16 2896.770020
2021-12-17 2856.060059
[1258 rows x 6 columns]}
price = pd.DataFrame({ticker:data['Adj Close']
for ticker,data in all_data.items()})
price
|
AAPL |
IBM |
MSFT |
GOOG |
Date |
|
|
|
|
2016-12-20 |
27.520725 |
127.391281 |
59.078098 |
796.419983 |
2016-12-21 |
27.546612 |
127.186043 |
59.078098 |
794.559998 |
2016-12-22 |
27.365414 |
126.980827 |
59.087402 |
791.260010 |
2016-12-23 |
27.419537 |
126.714798 |
58.799171 |
789.909973 |
2016-12-27 |
27.593679 |
127.041634 |
58.836357 |
791.549988 |
... |
... |
... |
... |
... |
2021-12-13 |
175.740005 |
122.580002 |
339.399994 |
2934.090088 |
2021-12-14 |
174.330002 |
123.760002 |
328.339996 |
2899.409912 |
2021-12-15 |
179.300003 |
123.110001 |
334.649994 |
2947.370117 |
2021-12-16 |
172.259995 |
125.930000 |
324.899994 |
2896.770020 |
2021-12-17 |
171.139999 |
127.400002 |
323.799988 |
2856.060059 |
1258 rows × 4 columns
Volume = pd.DataFrame({ticker:data['Volume']
for ticker,data in all_data.items()})
Volume
|
AAPL |
IBM |
MSFT |
GOOG |
Date |
|
|
|
|
2016-12-20 |
85700000.0 |
2274632.0 |
26028400.0 |
951000 |
2016-12-21 |
95132800.0 |
3740182.0 |
17096300.0 |
1211300 |
2016-12-22 |
104343600.0 |
2931520.0 |
22176600.0 |
972200 |
2016-12-23 |
56998000.0 |
1779455.0 |
12403800.0 |
623400 |
2016-12-27 |
73187600.0 |
1461785.0 |
11763200.0 |
789100 |
... |
... |
... |
... |
... |
2021-12-13 |
153237000.0 |
6847500.0 |
28899400.0 |
1205200 |
2021-12-14 |
139380400.0 |
5716100.0 |
44438700.0 |
1238900 |
2021-12-15 |
131063300.0 |
4990000.0 |
35381100.0 |
1364000 |
2021-12-16 |
150185800.0 |
7280500.0 |
35034800.0 |
1370000 |
2021-12-17 |
195432700.0 |
10379000.0 |
47750300.0 |
2162800 |
1258 rows × 4 columns
returns = price.pct_change()
|
AAPL |
IBM |
MSFT |
GOOG |
Date |
|
|
|
|
2021-12-13 |
-0.020674 |
-0.012169 |
-0.009167 |
-0.013254 |
2021-12-14 |
-0.008023 |
0.009626 |
-0.032587 |
-0.011820 |
2021-12-15 |
0.028509 |
-0.005252 |
0.019218 |
0.016541 |
2021-12-16 |
-0.039264 |
0.022906 |
-0.029135 |
-0.017168 |
2021-12-17 |
-0.006502 |
0.011673 |
-0.003386 |
-0.014054 |
returns['MSFT'].corr(returns['IBM'])
0.4973985942857425
returns.MSFT.corr(returns['IBM'])
0.4973985942857425
returns['MSFT'].cov(returns['IBM'])
0.00014275531886403115
returns.corr()
|
AAPL |
IBM |
MSFT |
GOOG |
AAPL |
1.000000 |
0.429743 |
0.736424 |
0.658843 |
IBM |
0.429743 |
1.000000 |
0.497399 |
0.469242 |
MSFT |
0.736424 |
0.497399 |
1.000000 |
0.778161 |
GOOG |
0.658843 |
0.469242 |
0.778161 |
1.000000 |
returns.corrwith(returns.IBM)
AAPL 0.429743
IBM 1.000000
MSFT 0.497399
GOOG 0.469242
dtype: float64
returns.corrwith(Volume)
AAPL -0.073139
IBM -0.134489
MSFT -0.065349
GOOG -0.110807
dtype: float64
5.3.2 唯一值、计数和成员属性
方法 |
描述 |
isin |
计算表征Series中每个值是否包含于传入序列的布尔值数组 |
match |
计算数组中每个值的整数索引,形成-一个唯-值数组。有助于数据;对齐和join类型的操作 |
unique |
计算Series 值中的唯-值数组,按照观察顺序返回 |
value_ counts |
返回一个Series,索引是唯一值序列,值是计数个数,按照个数降序排序 |
quantile |
计算样本的从0到1间的分位数 |
Sum |
加和 |
mean |
均值 |
median |
中位数(50% 分位数) |
mad |
平均值的平均绝对偏差 |
prod |
所有值的积 |
var |
值的样本方差 |
std |
值的样本标准差 |
skew |
样本偏度(第三时刻)值 |
kurt |
样本峰度(第四时刻)的值 |
Cumsum |
累计值 |
cummin,cummax |
累计值的最小值或最大值 |
cumprod |
值的累计积 |
diff |
计算第一个算术差值(对时间序列有用) |
pct_ change |
计算百分比 |
obj = pd.Series(['c','a','d','a','b','a','a','d'])
uniques = obj.unique()
uniques.sort()
obj.value_counts()
a 4
d 2
b 1
c 1
dtype: int64
pd.value_counts(obj.values,sort=False)
b 1
d 2
c 1
a 4
dtype: int64
mask = obj.isin(['b','c'])
mask
0 True
1 False
2 False
3 False
4 True
5 False
6 False
7 False
dtype: bool
obj[mask]
0 c
4 b
dtype: object
to_match = pd.Series(['c','a','b','b','c','a'])
unique_vals = pd.Series(['c','b','a'])
pd.Index(unique_vals).get_indexer(to_match)
array([0, 2, 1, 1, 0, 2], dtype=int64)
data = pd.DataFrame({'QU1':[1,3,4,3,4],
'QU2':[2,3,1,2,3],
'QU3':[1,5,2,2,4]})
data
|
QU1 |
QU2 |
QU3 |
0 |
1 |
2 |
1 |
1 |
3 |
3 |
5 |
2 |
4 |
1 |
2 |
3 |
3 |
2 |
2 |
4 |
4 |
3 |
4 |
result = data.apply(pd.value_counts).fillna(0)
result
|
QU1 |
QU2 |
QU3 |
1 |
1.0 |
1.0 |
1.0 |
2 |
0.0 |
2.0 |
2.0 |
3 |
2.0 |
2.0 |
0.0 |
4 |
2.0 |
0.0 |
1.0 |
5 |
0.0 |
0.0 |
1.0 |
5.4 本章小结
第六章 数据载入、存储及文件格式
- 输入和输出通常有以下几种类型:读取文本文件及硬盘上其他更高效的格式文件、从数据库载入数据、与网络资源进行交互(比如Web API)
6.1 文本格式数据的读写
- Pandas的解析函数
|方法|描述|
|—|---|
|read_ CsV|从文件、URL或文件型对象读取分隔好的数据,逗号是默认分隔符|
|read_ table|从文件、URL或文件型对象读取分隔好的数据,制表符(’\t’)是默认分隔符|
|read_ fwf|从特定宽度格式的文件中读取数据(无分隔符)|
|read clipboard|read_ table的剪贴板版本,在将表格从Web页面上转换成数据时有用|
|read_ excel|从Excel的XLS或XLSX文件中读取表格数据|
|read_ hdf|读取用pandas存储的HDF5文件|
|read_ html|从HTML文件中读取所有表格数据|
|read_json|从JSON JavaScript Object Notation)字符串中读取数据|
|read_ msgpack|读取MessagePack二进制格式的pandas数据|
|read_ pickle|读取以Python pickle格式存储的任意对象|
|read_ sas|读取存储在SAS系统中定制存储格式的SAS数据集|
|read_ sql|将SQL查询的结果(使用SQLAlchemy)读取为pandas的DataFrame|
|read_ stata|读取Stata格式的数据集|
|read_feather|读取Feather二进制格式|
df = pd.read_csv('examples/ex1.csv')
df
|
a |
b |
c |
d |
message |
0 |
1 |
2 |
3 |
4 |
hello |
1 |
5 |
6 |
7 |
8 |
world |
2 |
9 |
10 |
11 |
12 |
foo |
pd.read_table('examples/ex1.csv',sep=',')
|
a |
b |
c |
d |
message |
0 |
1 |
2 |
3 |
4 |
hello |
1 |
5 |
6 |
7 |
8 |
world |
2 |
9 |
10 |
11 |
12 |
foo |
pd.read_csv('examples/ex2.csv',header = None,sep=',')
|
0 |
1 |
2 |
3 |
4 |
0 |
1 |
2 |
3 |
4 |
hello |
1 |
5 |
6 |
7 |
8 |
world |
2 |
9 |
10 |
11 |
12 |
foo |
names = ['a','b','c','d','message']
pd.read_csv('examples/ex2.csv',names = names,index_col = 'message')
|
a |
b |
c |
d |
message |
|
|
|
|
hello |
1 |
2 |
3 |
4 |
world |
5 |
6 |
7 |
8 |
foo |
9 |
10 |
11 |
12 |
parsed = pd.read_csv('examples/csv_mindex.csv',index_col = ['key1','key2'])
parsed
|
|
value1 |
value2 |
key1 |
key2 |
|
|
one |
a |
1 |
2 |
b |
3 |
4 |
c |
5 |
6 |
d |
7 |
8 |
two |
a |
9 |
10 |
b |
11 |
12 |
c |
13 |
14 |
d |
15 |
16 |
list(open('examples/ex3.txt'))
[' A B C\n',
'aaa -0.264438 -1.026059 -0.619500\n',
'bbb 0.927272 0.302904 -0.032399\n',
'ccc -0.264273 -0.386314 -0.217601\n',
'ddd -0.871858 -0.348382 1.100491\n']
result = pd.read_table('examples/ex3.txt',sep = '\s+')
result
|
A |
B |
C |
aaa |
-0.264438 |
-1.026059 |
-0.619500 |
bbb |
0.927272 |
0.302904 |
-0.032399 |
ccc |
-0.264273 |
-0.386314 |
-0.217601 |
ddd |
-0.871858 |
-0.348382 |
1.100491 |
pd.read_csv('examples/ex4.csv',skiprows=[0,2,3])
|
a |
b |
c |
d |
message |
0 |
1 |
2 |
3 |
4 |
hello |
1 |
5 |
6 |
7 |
8 |
world |
2 |
9 |
10 |
11 |
12 |
foo |
result = pd.read_table('examples/ex5.csv',sep = ',')
result
|
something |
a |
b |
c |
d |
message |
0 |
one |
1 |
2 |
3.0 |
4 |
NaN |
1 |
two |
5 |
6 |
NaN |
8 |
world |
2 |
three |
9 |
10 |
11.0 |
12 |
foo |
pd.isnull(result)
|
something |
a |
b |
c |
d |
message |
0 |
False |
False |
False |
False |
False |
True |
1 |
False |
False |
False |
True |
False |
False |
2 |
False |
False |
False |
False |
False |
False |
result = pd.read_table('examples/ex5.csv',sep = ',',na_values=['1'])
result
|
something |
a |
b |
c |
d |
message |
0 |
one |
NaN |
2 |
3.0 |
4 |
NaN |
1 |
two |
5.0 |
6 |
NaN |
8 |
world |
2 |
three |
9.0 |
10 |
11.0 |
12 |
foo |
sentinels = {'message':['foo','NA'],'something':['two']}
result = pd.read_table('examples/ex5.csv',sep = ',',na_values=sentinels)
result
|
something |
a |
b |
c |
d |
message |
0 |
one |
1 |
2 |
3.0 |
4 |
NaN |
1 |
NaN |
5 |
6 |
NaN |
8 |
world |
2 |
three |
9 |
10 |
11.0 |
12 |
NaN |
- Pandas的解析函数
|方法|描述|
|—|---|
|path|表明文件系统位置的字符串、URL或文件型对象|
|sep或delimiter|用于分隔每行字段的字符序列或正则表达式|
|header|用作列名的行号,默认是0 (第-行),如果没有列名的话,应该为None
|index_ col|用作结果中行索引的列号或列名,可以是一个单一的名称/数字,也可以是一个分层索引|
|names|结果的列名列表,和header=None一起用|
|skiprows|从文件开头处起,需要跳过的行数或行号列表:|
|na_ values|需要用NA替换的值序列|
|comment|在行结尾处分隔注释的字符|
|parse_ dates|尝试将数据解析为datetime,默认是False。如果为True,将尝试解析所有的列。也可以指定列号或列名列表来进行解析。如果列表的元素是元组或列表,将会把多个列组合在-起进行解析(例如日期/时间将拆分为两列)|
|keep_ date_ col|如果连接列到解析日期上,保留被连接的列,默认是False|
|converters|包含列名称映射到函数的字典(例如{‘foo’ :f}会把函数f应用到’foo’列)|
|dayfirst|解析非明确日期时,按照国际格式处理(例如7/6/2012 -> June7,2012),默认为False|
6.1.1 分块读入文本文件
- 当处理大型文件或找出正确的参数集来正确处理大文件时,你可能需要读入文件的一个小片段或者按小块遍历文件。
pd.options.display.max_rows = 10
result = pd.read_table('examples/ex6.csv',sep = ',')
result
|
one |
two |
three |
four |
key |
0 |
0.467976 |
-0.038649 |
-0.295344 |
-1.824726 |
L |
1 |
-0.358893 |
1.404453 |
0.704965 |
-0.200638 |
B |
2 |
-0.501840 |
0.659254 |
-0.421691 |
-0.057688 |
G |
3 |
0.204886 |
1.074134 |
1.388361 |
-0.982404 |
R |
4 |
0.354628 |
-0.133116 |
0.283763 |
-0.837063 |
Q |
... |
... |
... |
... |
... |
... |
9995 |
2.311896 |
-0.417070 |
-1.409599 |
-0.515821 |
L |
9996 |
-0.479893 |
-0.650419 |
0.745152 |
-0.646038 |
E |
9997 |
0.523331 |
0.787112 |
0.486066 |
1.093156 |
K |
9998 |
-0.362559 |
0.598894 |
-1.843201 |
0.887292 |
G |
9999 |
-0.096376 |
-1.012999 |
-0.657431 |
-0.573315 |
0 |
10000 rows × 5 columns
result = pd.read_table('examples/ex6.csv',sep = ',',nrows=5)
result
|
one |
two |
three |
four |
key |
0 |
0.467976 |
-0.038649 |
-0.295344 |
-1.824726 |
L |
1 |
-0.358893 |
1.404453 |
0.704965 |
-0.200638 |
B |
2 |
-0.501840 |
0.659254 |
-0.421691 |
-0.057688 |
G |
3 |
0.204886 |
1.074134 |
1.388361 |
-0.982404 |
R |
4 |
0.354628 |
-0.133116 |
0.283763 |
-0.837063 |
Q |
chunker = pd.read_table('examples/ex6.csv',sep = ',',chunksize=1000)
chunker
chunker = pd.read_table('examples/ex6.csv',sep = ',',chunksize=1000)
tot = pd.Series([])
for piece in chunker:
tot = tot.add(piece['key'].value_counts(),fill_value = 0)
tot = tot.sort_values(ascending = False)
:3: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
tot = pd.Series([])
6.1.2 将数据写入文本格式
data = pd.read_csv('examples/ex5.csv')
data
|
something |
a |
b |
c |
d |
message |
0 |
one |
1 |
2 |
3.0 |
4 |
NaN |
1 |
two |
5 |
6 |
NaN |
8 |
world |
2 |
three |
9 |
10 |
11.0 |
12 |
foo |
data.to_csv('examples/out.csv')
import sys
data.to_csv(sys.stdout,sep = '|')
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo
data.to_csv(sys.stdout,na_rep = "NULL")
,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo
data.to_csv(sys.stdout,index=False,header=False)
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo
data.to_csv(sys.stdout,index=False,columns = ['a','b','c'])
a,b,c
1,2,3.0
5,6,
9,10,11.0
dates = pd.date_range('1/1/2000',periods=7)
ts = pd.Series(np.arange(7),index = dates)
ts.to_csv('examples/tseries.csv')
ts
2000-01-01 0
2000-01-02 1
2000-01-03 2
2000-01-04 3
2000-01-05 4
2000-01-06 5
2000-01-07 6
Freq: D, dtype: int32
6.1.3 使用分隔格式
import csv
f = open('examples/ex7.csv')
reader = csv.reader(f)
for line in reader:
print(line)
['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']
with open('examples/ex7.csv') as f:
lines = list(csv.reader(f))
lines
[['a', 'b', 'c'], ['1', '2', '3'], ['1', '2', '3']]
header,values = lines[0],lines[1:]
header
['a', 'b', 'c']
values
[['1', '2', '3'], ['1', '2', '3']]
data_dict = {h:v for h,v in zip(header,zip(*values))}
data_dict
{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}
6.1.4 JSON数据
- JSON(JavaScript Object Notation的简写)已经成为Web浏览器和其他应用间通过HTTP请求发送数据的标准格式。
- 对象中的所有键都必须是字符串
obj ="""
{"name":"Wes",
"places_ lived": ["United States","Spain","Germany"],
"pet":null,
"siblings":[{"name":"Scott","age":30, "pets": ["Zeus","Zuko"]},
{"age":38,"name": "Katie","pets": ["Sixes","Stache","Cisco"]}]}
"""
import json
result = json.loads(obj)
result
{'name': 'Wes',
'places_ lived': ['United States', 'Spain', 'Germany'],
'pet': None,
'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
{'age': 38, 'name': 'Katie', 'pets': ['Sixes', 'Stache', 'Cisco']}]}
asjson = json.dumps(result)
asjson
'{"name": "Wes", "places_ lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"age": 38, "name": "Katie", "pets": ["Sixes", "Stache", "Cisco"]}]}'
siblings = pd.DataFrame(result["siblings"],columns = ['name','age'])
siblings
|
name |
age |
0 |
Scott |
30 |
1 |
Katie |
38 |
data = pd.read_json('examples/example.json')
data
|
a |
b |
c |
0 |
1 |
2 |
3 |
1 |
4 |
5 |
6 |
2 |
7 |
8 |
9 |
print(data.to_json())
{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}
print(data.to_json(orient='records'))
[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]
6.1.5 XML和HTML:网络抓取
- pandas的内建函数read_html可以使用lxml和Beautiful Soup等库将HTML中的表自动解析为DataFrame对象
import lxml
import beautifulsoup4
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
in
----> 1 import beautifulsoup4
ModuleNotFoundError: No module named 'beautifulsoup4'
tables = pd.read_html('examples/fdic_failed_bank_list.html')
len(tables)
1
failures = tables[0]
failures.head()
|
Bank Name |
City |
ST |
CERT |
Acquiring Institution |
Closing Date |
Updated Date |
0 |
Allied Bank |
Mulberry |
AR |
91 |
Today's Bank |
September 23, 2016 |
November 17, 2016 |
1 |
The Woodbury Banking Company |
Woodbury |
GA |
11297 |
United Bank |
August 19, 2016 |
November 17, 2016 |
2 |
First CornerStone Bank |
King of Prussia |
PA |
35312 |
First-Citizens Bank & Trust Company |
May 6, 2016 |
September 6, 2016 |
3 |
Trust Company Bank |
Memphis |
TN |
9956 |
The Bank of Fayette County |
April 29, 2016 |
September 6, 2016 |
4 |
North Milwaukee State Bank |
Milwaukee |
WI |
20364 |
First-Citizens Bank & Trust Company |
March 11, 2016 |
June 16, 2016 |
close_timestamps = pd.to_datetime(failures['Closing Date'])
close_timestamps.dt.year.value_counts()
2010 157
2009 140
2011 92
2012 51
2008 25
2013 24
2014 18
2002 11
2015 8
2016 5
2001 4
2004 4
2003 3
2007 3
2000 2
Name: Closing Date, dtype: int64
6.1.5.1 使用lxml.objectify解析XML
- XML(eXtensible Markup Language)是另一种常用的结构化数据格式,它使用元数据支持分层、嵌套数据。
from lxml import objectify
path = 'example/mta_perf/Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
in
1 from lxml import objectify
2 path = 'example/mta_perf/Performance_MNR.xml'
----> 3 parsed = objectify.parse(open(path))
4 root = parsed.getroot()
FileNotFoundError: [Errno 2] No such file or directory: 'example/mta_perf/Performance_MNR.xml'
6.2 二进制格式
- 使用Python内建的pickle序列化模块进行二进制格式操作是存储数据(也称为序列化)最高效、最方便的方式之一。
- pickle仅被推荐作为短期的存储格式。问题在于pickle很难确保格式的长期有效性;一个今天被pickle化的对象可能明天会因为库的新版本而无法反序列化。我们尽可能保持向后兼容性,但在将来的某个时候,可能有必要“打破”pickle格式
frame = pd.read_csv('examples/ex1.csv')
frame
|
a |
b |
c |
d |
message |
0 |
1 |
2 |
3 |
4 |
hello |
1 |
5 |
6 |
7 |
8 |
world |
2 |
9 |
10 |
11 |
12 |
foo |
frame.to_pickle('examples/frame_pickle')
pd.read_pickle('examples/frame_pickle')
|
a |
b |
c |
d |
message |
0 |
1 |
2 |
3 |
4 |
hello |
1 |
5 |
6 |
7 |
8 |
world |
2 |
9 |
10 |
11 |
12 |
foo |
6.2.1 使用HDF5格式
frame = pd.DataFrame({'a':np.random.randn(100)})
frame
|
a |
0 |
-0.302048 |
1 |
1.382578 |
2 |
-0.961553 |
3 |
-0.390501 |
4 |
-0.441857 |
... |
... |
95 |
-0.876892 |
96 |
-0.447401 |
97 |
-0.443915 |
98 |
-0.980782 |
99 |
1.710659 |
100 rows × 1 columns
store = pd.HDFStore('mydata.h5')
store
File path: mydata.h5
store['obj1'] = frame
store['obj1_col'] = frame['a']
store
File path: mydata.h5
store['obj1']
|
a |
0 |
-0.302048 |
1 |
1.382578 |
2 |
-0.961553 |
3 |
-0.390501 |
4 |
-0.441857 |
... |
... |
95 |
-0.876892 |
96 |
-0.447401 |
97 |
-0.443915 |
98 |
-0.980782 |
99 |
1.710659 |
100 rows × 1 columns
store.put('obj2',frame,format='table')
store.select('obj2',where = ['index>= 10 and index <= 15'])
|
a |
10 |
1.428619 |
11 |
-0.722998 |
12 |
-0.493169 |
13 |
-0.731987 |
14 |
-0.826249 |
15 |
-0.877255 |
6.2.2 读取Microsoft Excel文件
xlsx = pd.ExcelFile('examples/ex1.xlsx')
xlsx
pd.read_excel('examples/ex1.xlsx')
|
Unnamed: 0 |
a |
b |
c |
d |
message |
0 |
0 |
1 |
2 |
3 |
4 |
hello |
1 |
1 |
5 |
6 |
7 |
8 |
world |
2 |
2 |
9 |
10 |
11 |
12 |
foo |
frame = pd.read_excel('examples/ex1.xlsx','Sheet1')
frame
|
Unnamed: 0 |
a |
b |
c |
d |
message |
0 |
0 |
1 |
2 |
3 |
4 |
hello |
1 |
1 |
5 |
6 |
7 |
8 |
world |
2 |
2 |
9 |
10 |
11 |
12 |
foo |
writer = pd.ExcelWriter('examples/ex2.xlsx')
frame.to_excel(writer,'Sheet1')
writer.save()
frame.to_excel('examples/ex2.xlsx')
6.3 与Web API交互
import requests
url = 'https://github.com/repos/pandas-dev/pandas/issues'
6.4 与数据库交互
6.5 本章小结