python 判断为空nan, null | pandas 空值定义为numpy.nan 对整体的series或Dataframe判断是否未空,用isnull() 对单独的某个值判断,可以用 np.isnan() nan遇到问题解决:http://www.cnblogs.com/itdyb/p/5806688.html |
||
np.newaxis | np.newaxis的功能是插入新维度,看下面的例子: a=np.array([1,2,3,4,5])
输出结果 (5,) 可以看出a是一个一维数组,
|
x_data=np.linspace(-1,1,300)[:,np.newaxis]
输出结果: (5,) (1, 5) |
x_data=np.linspace(-1,1,300)[:,np.newaxis]
输出结果 (5,) (5, 1) |
可以看出np.newaxis分别是在行或列上增加维度,原来是(6,)的数组,在行上增加维度变成(1,6)的二维数组,在列上增加维度变为(6,1)的二维数组 |
|||
dtype = [('Name', 'S10'), ('Height', float), ('Age', int)] values = [('Li', 1.8, 41), ('Wang', 1.9, 38),('Duan', 1.7, 38)] a = np.array(values, dtype=dtype) np.sort(a, order='Height') |
|||
|
从数组a中,替换所有大于30到30和小于10到10的值。
# Solution 1: Using np.clip
np.clip(a, a_min=10, a_max=30)
# Solution 2: Using np.where
print(np.where(a < 10, 10, np.where(a > 30, 30, a)))
# > [ 27.63 14.64 21.8 30. 10. 10. 30. 30. 10. 29.18 30.
# > 11.25 10.08 10. 11.77 30. 30. 10. 30. 14.43]
np.argwhere(iris[:, 3].astype(float) > 1.0)[0]
# Solution:
vals, counts = np.unique(iris[:, 2], return_counts=True)
print(vals[np.argmax(counts)])
# Get the second last value
np.unique(np.sort(petal_len_setosa))[-2]
# Approach 1: Generate Probablistically
np.random.seed(100)
a = np.array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])
species_out = np.random.choice(a, 150, p=[0.5, 0.25, 0.25])
# Approach 2: Probablistic Sampling (preferred)
np.random.seed(100)
probs = np.r_[np.linspace(0, 0.500, num=50), np.linspace(0.501, .750, num=50), np.linspace(.751, 1.0, num=50)]
index = np.searchsorted(probs, np.random.random(150))
species_out = species[index]
print(np.unique(species_out, return_counts=True))
# **给定:**
arr1 = np.arange(3)
arr2 = np.arange(3,7)
arr3 = np.arange(7,10)
array_of_arrays = np.array([arr1, arr2, arr3])
print('array_of_arrays: ', array_of_arrays)
# Solution 1
arr_2d = np.array([a for arr in array_of_arrays for a in arr])
# Solution 2:
arr_2d = np.concatenate(array_of_arrays)
print(arr_2d)
# > array_of_arrays: [array([0, 1, 2]) array([3, 4, 5, 6]) array([7, 8, 9])]
# > [0 1 2 3 4 5 6 7 8 9]
# Solution 1
arr_2d = np.array([a for arr in array_of_arrays for a in arr])
# Solution 2:
arr_2d = np.concatenate(array_of_arrays)
print(arr_2d)
# > array_of_arrays: [array([0, 1, 2]) array([3, 4, 5, 6]) array([7, 8, 9])]
# > [0 1 2 3 4 5 6 7 8 9]
# **给定:**
np.random.seed(101)
arr = np.random.randint(1,4, size=6)
arr
# > array([2, 3, 2, 2, 2, 1])
# Solution:
def one_hot_encodings(arr):
uniqs = np.unique(arr)
out = np.zeros((arr.shape[0], uniqs.shape[0]))
for i, k in enumerate(arr):
out[i, k-1] = 1
return out
one_hot_encodings(arr)
# > array([[ 0., 1., 0.],
# > [ 0., 0., 1.],
# > [ 0., 1., 0.],
# > [ 0., 1., 0.],
# > [ 0., 1., 0.],
# > [ 1., 0., 0.]])
# Method 2:
(arr[:, None] == np.unique(arr)).view(np.int8)
arr[:,]
Out[26]: array([1, 3, 1, 1, 1, 2, 2, 2, 1, 3, 3, 1, 1, 2, 1, 2, 2, 2, 1, 3])
arr[:,None]
Out[27]:
array([[1],
[3],
[1],
[1],
[1],
[2],
[2],
[2],
[1],
[3],
[3],
[1],
[1],
[2],
[1],
[2],
[2],
[2],
[1],
[3]])
arr[:,None]==np.unique(arr)
Out[30]:
array([[ True, False, False],
[False, False, True],
[ True, False, False],
[ True, False, False],
[ True, False, False],
[False, True, False],
[False, True, False],
[False, True, False],
[ True, False, False],
[False, False, True],
[False, False, True],
[ True, False, False],
[ True, False, False],
[False, True, False],
[ True, False, False],
[False, True, False],
[False, True, False],
[False, True, False],
[ True, False, False],
[False, False, True]])
np.random.seed(10)
a = np.random.randint(20, size=10)
print('Array: ', a)
# Solution
print(a.argsort().argsort())
print('Array: ', a)
# > Array: [ 9 4 15 0 17 16 17 8 9 0]
# > [4 2 6 0 8 7 9 3 5 1]
# > Array: [ 9 4 15 0 17 16 17 8 9 0]
这样想:刚开始 分数按照学号排列 对应数组a 对数组a进行排序 a.sort() 产生按照分数有序的数组 b=a.argsort() 对应分数有序时候的学号序列 b.index 对应 学号的排名 a.argsort().argsort(): b.argsort() 将 分数有序时候的学号序列(拍好名次的学号序列) 重新按照学号大小排序: 返回对应学号的index 也就是相应学号的排名 b.sort() 返回 排好顺序的学号 |
# **给定:**
np.random.seed(10)
a = np.random.randint(20, size=[2,5])
print(a)
# Solution
print(a.ravel().argsort().argsort().reshape(a.shape))
# > [[ 9 4 15 0 17]
# > [16 17 8 9 0]]
# > [[4 2 6 0 8]
# > [7 9 3 5 1]]
# Solution 1
np.amax(a, axis=1)
# Solution 2
np.apply_along_axis(np.max, arr=a, axis=1)
# > array([9, 8, 6, 3, 9])
# Input
np.random.seed(100)
a = np.random.randint(1,10, [5,3])
a
# Solution
np.apply_along_axis(lambda x: np.min(x)/np.max(x), arr=a, axis=1)
# > array([ 0.44444444, 0.125 , 0.5 , 1. , 0.11111111])
# > array([[9, 9, 4],
# > [8, 8, 1],
# > [5, 3, 6],
# > [3, 3, 3],
# > [2, 1, 9]])
问题:在给定的numpy数组中找到重复的条目(第二次出现以后),并将它们标记为True。第一次出现应该是False的。
# Input
np.random.seed(100)
a = np.random.randint(0, 5, 10)
## Solution
# There is no direct function to do this as of 1.13.3
# Create an all True array
out = np.full(a.shape[0], True)
# Find the index positions of unique elements
unique_positions = np.unique(a, return_index=True)[1]
# Mark those positions as False
out[unique_positions] = False
print(out)
# > [False True False True False False True True True True]
难度等级:L3
问题:在二维数字数组中查找按分类列分组的数值列的平均值
# List comprehension version
[[group_val, numeric_column[grouping_column==group_val].mean()] for group_val in np.unique(grouping_column)]
答案:
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')
# Solution
# No direct way to implement this. Just a version of a workaround.
numeric_column = iris[:, 1].astype('float') # sepalwidth
grouping_column = iris[:, 4] # species
# List comprehension version
[[group_val, numeric_column[grouping_column==group_val].mean()] for group_val in np.unique(grouping_column)]
# For Loop version
output = []
for group_val in np.unique(grouping_column):
output.append([group_val, numeric_column[grouping_column==group_val].mean()])
output
# > [[b'Iris-setosa', 3.418],
# > [b'Iris-versicolor', 2.770],
# > [b'Iris-virginica', 2.974]]
a = np.array([1,2,3,np.nan,5,6,7,np.nan])
a[~np.isnan(a)]
# > array([ 1., 2., 3., 5., 6., 7.])
# Input
a = np.array([1,2,3,4,5])
b = np.array([4,5,6,7,8])
# Solution
dist = np.linalg.norm(a-b)
dist
# > 6.7082039324993694
问题:找到一个一维数字数组a中的所有峰值。峰顶是两边被较小数值包围的点。
思路:求原数组的差分(np.diff(a))后的过零点doublediff = np.diff(np.sign(np.diff(a)))
a = np.array([1, 3, 7, 1, 2, 6, 0, 1])
doublediff = np.diff(np.sign(np.diff(a)))
peak_locations = np.where(doublediff == -2)[0] + 1
peak_locations
# > array([2, 5])
难度等级:L2
问题:从2d数组a_2d中减去一维数组b_1D,使得b_1D的每一项从a_2d的相应行中减去。
答案:
# Input
a_2d = np.array([[3,3,3],[4,4,4],[5,5,5]])
b_1d = np.array([1,2,3])
# Solution
print(a_2d - b_1d[:,None])
# > [[2 2 2]
# > [2 2 2]
# > [2 2 2]]
难度等级:L2
问题:找出x中数字1的第5次重复的索引。
x = np.array([1, 2, 1, 1, 3, 4, 3, 1, 1, 2, 1, 1, 2])
答案:
x = np.array([1, 2, 1, 1, 3, 4, 3, 1, 1, 2, 1, 1, 2])
n = 5
# Solution 1: List comprehension
[i for i, v in enumerate(x) if v == 1][n-1]
# Solution 2: Numpy version
np.where(x == 1)[0][n-1]
# > 8
难度等级:L3
问题:对于给定的一维数组,计算窗口大小为3的移动平均值。
def moving_average(a, n=3) :
ret = np.cumsum(a, dtype=float)
ret[n:] = ret[n:] - ret[:-n]
return ret[n - 1:] / n
答案:
# Solution
# Source: https://stackoverflow.com/questions/14313510/how-to-calculate-moving-average-using-numpy
def moving_average(a, n=3) :
ret = np.cumsum(a, dtype=float)
ret[n:] = ret[n:] - ret[:-n]
return ret[n - 1:] / n
np.random.seed(100)
Z = np.random.randint(10, size=10)
print('array: ', Z)
# Method 1
moving_average(Z, n=3).round(2)
# Method 2: # Thanks AlanLRH!
# np.ones(3)/3 gives equal weights. Use np.ones(4)/4 for window size 4.
np.convolve(Z, np.ones(3)/3, mode='valid') .
# > array: [8 8 3 7 7 0 4 2 5 2]
# > moving average: [ 6.33 6. 5.67 4.67 3.67 2. 3.67 3. ]
当计算和操作数组时,它们的数据有时被复制到新的数组中,有时不复制。对于初学者来说,这经常是一个混乱的来源。有三种情况:
简单赋值不会创建数组对象或其数据的拷贝。
>>> a = np.arange(12)
>>> b = a # no new object is created
>>> b is a # a and b are two names for the same ndarray object
True
>>> b.shape = 3,4 # changes the shape of a
>>> a.shape
(3, 4)
Python将可变对象作为引用传递,所以函数调用不会复制。
>>> def f(x):
... print(id(x))
...
>>> id(a) # id is a unique identifier of an object
148293216
>>> f(a)
148293216
不同的数组对象可以共享相同的数据。 view
方法创建一个新的数组对象,它查看相同的数据。
>>> c = a.view()
>>> c is a
False
>>> c.base is a # c is a view of the data owned by a
True
>>> c.flags.owndata
False
>>>
>>> c.shape = 2,6 # a's shape doesn't change
>>> a.shape
(3, 4)
>>> c[0,4] = 1234 # a's data changes
>>> a
array([[ 0, 1, 2, 3],
[1234, 5, 6, 7],
[ 8, 9, 10, 11]])
对数组切片返回一个视图:
>>> s = a[ : , 1:3] # spaces added for clarity; could also be written "s = a[:,1:3]"
>>> s[:] = 10 # s[:] is a view of s. Note the difference between s=10 and s[:]=10
>>> a
array([[ 0, 10, 10, 3],
[1234, 10, 10, 7],
[ 8, 10, 10, 11]])
copy
方法生成数组及其数据的完整拷贝。
>>> d = a.copy() # a new array object with new data is created
>>> d is a
False
>>> d.base is a # d doesn't share anything with a
False
>>> d[0,0] = 9999
>>> a
array([[ 0, 10, 10, 3],
[1234, 10, 10, 7],
[ 8, 10, 10, 11]])
我们如何从一个相同大小的行向量列表构造一个二维数组?在MATLAB中,这很容易:如果x和y是两个长度相同的向量,那么只需要 m=[x;y]
。在NumPy中,这通过函数 column_stack
,dstack
,hstack
和 vstack
工作,具体取决于要做什么堆叠。例如:
x = np.arange(0,10,2) # x=([0,2,4,6,8])
y = np.arange(5) # y=([0,1,2,3,4])
m = np.vstack([x,y]) # m=([[0,2,4,6,8],
# [0,1,2,3,4]])
xy = np.hstack([x,y]) # xy =([0,2,4,6,8,0,1,2,3,4])
NumPy的 histogram
函数应用于一个数组,并返回一对向量:数组的histogram和向量的bin。注意: matplotlib
也具有构建histograms的函数(在Matlab中称为 hist
),它与NumPy中的不同。主要区别是 pylab.hist
自动绘制histogram,而 numpy.histogram
仅生成数据。
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> # Build a vector of 10000 normal deviates with variance 0.5^2 and mean 2
>>> mu, sigma = 2, 0.5
>>> v = np.random.normal(mu,sigma,10000)
>>> # Plot a normalized histogram with 50 bins
>>> plt.hist(v, bins=50, normed=1) # matplotlib version (plot)
>>> plt.show()
>>> # Compute the histogram with numpy and then plot it
>>> (n, bins) = np.histogram(v, bins=50, normed=True) # NumPy version (no plot)
>>> plt.plot(.5*(bins[1:]+bins[:-1]), n)
>>> plt.show()