数据分析学习笔记（三）－－numpy：内置函数（通用函数、数学与统计方法、集合）

通用函数

通用函数（ufunc）是一种对ndarray中对数据执行元素级运算的函数

# 例子数组
a = np.array([-1,2.1,0.2,2.6,9.1])  # [-1.   2.1  0.2  2.6  9.1]
b = np.arange(1,len(a)+1)           # [1 2 3 4 5]

一元函数

函数	说明	例子	结果
abs、fabs	计算整数、浮点数和复数对绝对值，对于非复数值，可以使用更快对fabs	np.abs(a)	[1. 2.1 0.2 2.6 9.1]
sqrt	计算各元素的平方根，相当于 arr**0.5	np.sqrt(b)
square	计算各元素的平方，相当于 arr**2	np.square(b)	[ 1 4 9 16 25 36 49 64 81]
exp	计算各元素的指数 e(x)
log、log10、log2、log1p	分别为自然对数（底数为e）、底数为10的log、底数为2的log、log(1+x)
sign	计算各元素的符号，1（正数）、0（零）、-1（负数）	np.sign(a)	[-1. 1. 1. 1. 1.]
ceil	向上取整	np.ceil(a)	[-1. 3. 1. 3. 10.]
floor	向下取整	np.floor(a)	[-1. 2. 0. 2. 9.]
rint	四舍五入，保留dtype	np.rint(a)	[-1. 2. 0. 3. 9.]
modf	将元素的小数和整数部分以两个独立的数组形式返回	np.modf(a)	(array([-0. , 0.1, 0.2, 0.6, 0.1]), array([-1., 2., 0., 2., 9.]))
nonzero	将所有非零元素的行与列坐标分割开，重构成两个分别关于行和列的矩阵	np.nonzero(a)	(array([0, 1, 2, 3, 4]),)
clip	切除元素	np.clip(a, 0, 5) 等同于 a.clip(0,5)	[0. 2.1 0.2 2.6 5. ]
isnan	返回一个布尔数组，True位置的元素为NaN值
isfinite、isinf	返回一个布尔数组，True位置的元素为有穷的或者是无穷的
cos、conh、sin、sinh、tan、tanh	普通型和双曲型三角函数
arccos、arccosh、arcsin、arcsinh、arctan、arctanh	反三角函数
logical_not	计算各元素not x的真值，相当于 -arr

二元函数

函数	说明	例子	结果
add	相加	np.add(a,b) 等同于 a+b	[ 0. 4.1 3.2 6.6 14.1]
subtract	第一个数组减第二个数组	np.subtract(a,b) 等同于 a-b	[-2. 0.1 -2.8 -1.4 4.1]
multiply	相乘	np.multiply(a,b) 等同于 a*b	[-1. 4.2 0.6 10.4 45.5]
divide、floor_divide	除法或做完除法后向下取整	np.divide(a,b) 等同于 a/b；np.floor_divide(a,b) 等同于 np.floor(a/b)
power	pow(a,b)，a的b次方	np.power(a,b)
maximum、fmax	元素中最大值，fmax会忽略NaN	np.maximum(a,b) 、np.fmax(a,b)
minimum、fmin	元素中最小值，fmin会忽略NaN
mod	求模	np.mod(a,b)
copysign	将第二个数组中的值的符号复制给第一个数组中的值	np.copysign(b,a)	[-1. 2. 3. 4. 5.]
greater、greater_equal、less、less_equal、equal、not_equal	>、>=、<、<=、=、!=
logical_and、logical_or、logical_xor	元素级的真值逻辑运算，相当于中缀运算符&、\|、^

数学与统计方法

函数	说明	例子	结果
sum	对数组中全部或者某轴方向的元素求和	np.sum(a) 或 a.sum()	13.0
mean	算术平均数，零长度的数组的mean为NaN	np.mean(a) 或 a.mean()	2.6
average	加权平均，权重相同时，也可看作时算术平均	np.average(a)	2.6
median	中位数，一组有序数列的中间数，偶数时，取平均	np.median(a)	2.1
std、var	分别求标准差和方差，自由度可调（默认为n）	a.std()、a.var()	3.4991427521608776、12.244
min、max	最小值和最大值
argmin、argmax	分别为最小元素和最大元素的索引	a.argmin()、a.argmax()	0、 4
diff	diff(a, n=1, axis=-1),后一个与前一个的差值，参数n表示进行n轮运算，多维数组中，可通过axis控制方向	np.diff(a)	[ 3.1 -1.9 2.4 6.5]
cumsum	所有元素和累计和（数组）	a.cumsum()	[-1. 1.1 1.3 3.9 13. ]
cumprod	所有元素的累计积（数组）	np.cumprod()	[-1. -2.1 -0.42 -1.092 -9.9372]

注：
上述例子是一维数组，如果是二维数组调用方法类似，不过可以使用参数 axis 指定方向，1为横向，0为竖向

arr = np.arange(24).reshape(4,6)
'''
[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]
 [12 13 14 15 16 17]
 [18 19 20 21 22 23]]
 '''
# 求和
arr.sum()           # 276   总和
arr.sum(axis=0)     #  [36 40 44 48 52 56]
# 算术平均数
arr.mean()          # 11.5      总数的算术平均数
arr.mean(axis=1)    # [ 2.5  8.5 14.5 20.5] 竖向的算术平均数

关于加权平均数： average函数

arr = np.arange(10).reshape(2,5)
'''
[[0 1 2 3 4]
 [5 6 7 8 9]]
 '''
arr.mean()          # 4.5 算术平均数 
np.average(arr)     # 4.5 可看作是算术平均数
np.average(arr, axis=1) # [2. 7.]，给出了方向
np.average(arr, weights=np.arange(arr1.size).reshape(2,5))  # 传入了权重 6.333333333333333

集合

# 例子数组
s0 = np.array([1,2,3,2,1,4,5,2])    # [1 2 3 2 1 4 5 2]
s1 = np.arange(0,30,2)  # [ 0  2  4  6  8 10 12 14 16 18 20 22 24 26 28]
s2 = np.arange(0,30,3)  # [ 0  3  6  9 12 15 18 21 24 27]

| 函数 | 说明 | 例子 | 结果
| --- | --- |
| unique(x) | 计算x中唯一元素，并返回有序的结果 | np.unique(s0) | [1 2 3 4 5] |
| intersect1d(x,y) | 交集，并返回有序结果 | np.intersect1d(s1,s2) | [ 0 6 12 18 24] |
| union1d(x,y) | 并集，并返回有序结果 | np.union1d(s1,s2) | [ 0 2 3 4 6 8 9 10 12 14 15 16 18 20 21 22 24 26 27 28] |
| setdiff1d(x,y) | 集合差，即元素在x中且不再y中 | np.setdiff1d(s1,s2) | [ 2 4 8 10 14 16 20 22 26 28] |
| setxor1d(x,y) | 集合对称差，只存在x和y中的元素集合 | np.setxor1d(s1,s2) | [ 2 3 4 8 9 10 14 15 16 20 21 22 26 27 28] |
| in1d(x,y) | 得到一个"x的元素是否包含于y"的布尔行数组 | np.in1d(s2,s1) | [ True False True False True False True False True False] |

注：数组1和数组2的元素数量及shape都可以不同
上述例子中的s1和s2虽然都是一维的，但是数量并不相同；为了验证集合操作无关shape，我们将s1和s2的shape做一下改变

s1 = s1.reshape(3,5)
'''
[[ 0  2  4  6  8]
 [10 12 14 16 18]
 [20 22 24 26 28]]
 '''
s2 = s2.reshape(2,5)
'''
[[ 0  3  6  9 12]
 [15 18 21 24 27]]
 '''
np.intersect1d(s1,s2)   # [15 18 21 24 27]]
np.union1d(s1,s2)       # [ 0  2  3  4  6  8  9 10 12 14 15 16 18 20 21 22 24 26 27 28]
np.setdiff1d(s2,s1)     # [ 3  9 15 21 27]

补充（where、sort、any、all）

where

where函数是一个三目运算符，where(condition, x, y)，
完成类似下面的工作

if（condition）：
　　x
else:
　　y

例子1：有xarr和yarr两个数组，需要根据condition选择数据

xarr = np.array(np.arange(1.1, 1.6, 0.1))
yarr = np.array(np.arange(2.1, 2.6, 0.1))
cond = np.array([True, False, True, True, False])

在python语法中：

result = [x if c else y for x, y, c in zip(xarr, yarr, cond)]
输出：
[1.1, 2.2, 1.3000000000000003, 1.4000000000000004, 2.5000000000000004]
非常不方便，而且出现了数据异常问题

在numpy中使用where函数：

result = np.where(cond, xarr, yarr)
输出：
[1.1 2.2 1.3 1.4 2.5]

例子2：将arr数组中小于0的部分重制为0，其余部分保留

arr = np.random.randn(4,4)  
输出：
[[ 0.40336609 -1.42094364 -1.1257582   0.2787659 ]
 [-0.64618146 -0.56508989  0.20527747  1.8542685 ]
 [-0.39792887  0.94738928 -0.68713023  0.60328758]
 [-0.94495984 -1.47217366  0.03280616 -0.13120201]]
arr = np.where(arr>0, arr, 0)
输出：
[[0.40336609 0.         0.         0.2787659 ]
 [0.         0.         0.20527747 1.8542685 ]
 [0.         0.94738928 0.         0.60328758]
 [0.         0.         0.03280616 0.        ]]

例子3：复杂嵌套的情况

cond1 = np.array([True, False, True, True, False])
cond2 = np.array([True, True, True, False, False])
result = []

python语法：

for i in range(len(cond1)):
    if cond1[i] and cond2[i]:
        result.append(0)
    elif cond1[i]:
        result.append(1)
    elif cond2[i]:
        result.append(2)
    else:
        result.append(3)
print(result)           # [0, 2, 0, 1, 3]

在numpy中使用where函数：

result = np.where(cond1&cond2, 0 ,
             np.where(cond1, 1,
                  np.where(cond2, 2, 3)))
list(result)     # [0, 2, 0, 1, 3]

注：where函数可以只传条件，返回条件对象的真值下标数组

arr = np.random.randn(10)
np.where(arr>0)      # (array([1, 2, 3, 6, 9]),)

如果是多维数组，返回也是数组，分别返回纬度数组索引

cond1 = np.array([True, False, True, True, False])
cond2 = np.array([True, True, True, False, False])
arr = np.array([cond1,cond2])
np.where(arr)
# (array([0, 0, 0, 1, 1, 1]), array([0, 2, 3, 0, 1, 2]))
# 即 [(0,0),(0,2),(0,3),(1,0),(1,1),(1,2)]位置

sort 排序

# 多维数组，可指定方向
arr = np.random.randn(20).reshape(4,5)
'''
[[-0.94603557 -0.18393318  0.11450866  0.40325255  0.45881851]
 [ 1.17704035 -0.41401001  0.75339636 -0.43745415  2.7929479 ]
 [-0.28784153 -1.48745643 -0.07142102 -0.5482369  -0.22610164]
 [ 1.35561729 -1.08766432  0.83278514 -1.32299757  0.04410116]]
 '''
np.sort(arr, axis=0)     # 竖向排序（默认为横向排序）
'''
[[-0.94603557 -1.48745643 -0.07142102 -1.32299757 -0.22610164]
 [-0.28784153 -1.08766432  0.11450866 -0.5482369   0.04410116]
 [ 1.17704035 -0.41401001  0.75339636 -0.43745415  0.45881851]
 [ 1.35561729 -0.18393318  0.83278514  0.40325255  2.7929479 ]]
 '''
 # 一维数组
arr = np.array([2,6,4,2,1,4])
arr.sort()      # 这种方式排序会直接改变「原」数组，使用np.sort()方式则将产生新的排序后的数组，而不改变原数组
print(arr)      # [2 6 4 2 1 4]

例子：我想知道一组数据的25%分位数是多少？

# 产生一组数据
arr = np.random.randn(20).reshape(4,5)
# 1.我们先将其转化为一维数组，并进行排序处理
arr = arr.flatten()
# 排序
arr.sort()
# 获取25%下标数据
value = arr[int(0.25*len(arr))] # 获取25%分位数
print(arr)
'''
[-1.8819284  -1.84223613 -1.55037549 -1.19713841 -0.91661269 -0.69222229
 -0.6796624  -0.65882803 -0.55325753 -0.34502426 -0.1197655   0.36925446
  0.5343373   0.62780224  0.74335279  0.82012463  1.00546263  1.08559715
  1.29212188  1.47629451]
  '''
print(value)    # -0.69222229

all、any

all：是否都是 True , 如果都是返回 True 否则 False
any: 是否存在 True , 如果存在 True 返回 True 否则 False

arr = np.array([True,False,True,True,False])
arr.all()    # False
arr.any()    # True

布尔型数组的统计方法

arr = np.random.randn(100)
(arr>0).sum()      # 统计正值的总数

数据分析学习笔记（三）－－numpy：内置函数（通用函数、数学与统计方法、集合）

通用函数

数学与统计方法

集合

补充（where、sort、any、all）

你可能感兴趣的:(数据分析学习笔记（三）－－numpy：内置函数（通用函数、数学与统计方法、集合）)