如何运用Numpy&Matplotlib&Pandas进行数据的分析和可视化

一、Numpy

  • Numpy(Numerical Python extensions)是一个第三方的Python包,用于科学计算,前身是1995年就开始开发的一个用于数组运算的库
  • 极大地简化了向量和矩阵的操作处理,是一些主力软件包(如scikit-learn、scipy、pandas和tensorflow)架构的基础部分。
  • Quickstart tutorial:https://docs.scipy.org/doc/numpy/user/quickstart.html
  • A Visual Intro to NumPy and Data Representation:http://jalammar.github.io/visual-numpy/
import numpy as np
np.array([2, 3, 6, 7])
array([2, 3, 6, 7])
a=np.array([0,0,0])
a
array([0, 0, 0])
np.array([2, 3, 6, 7.])
array([2., 3., 6., 7.])
np.array([2, 3, 6, 7+1j])
array([2.+0.j, 3.+0.j, 6.+0.j, 7.+1.j])

等差数列的数组

np.arange(5)
array([0, 1, 2, 3, 4])
np.arange(10, 100, 20, dtype=float)
array([10., 30., 50., 70., 90.])
np.linspace(0., 2.5, 5)
array([0.   , 0.625, 1.25 , 1.875, 2.5  ])
x = np.linspace(0, 2*np.pi, 10)
print(x)
print(x.shape)
print(x.ndim)
f = np.sin(x)
f
[0.         0.6981317  1.3962634  2.0943951  2.7925268  3.4906585
 4.1887902  4.88692191 5.58505361 6.28318531]
(10,)
1





array([ 0.00000000e+00,  6.42787610e-01,  9.84807753e-01,  8.66025404e-01,
        3.42020143e-01, -3.42020143e-01, -8.66025404e-01, -9.84807753e-01,
       -6.42787610e-01, -2.44929360e-16])

二维数组

a = np.array([[1, 2, 3], [4, 5, 6]])
a
array([[1, 2, 3],
       [4, 5, 6]])
a.shape
(2, 3)
a.ndim
2
a.size
6

改变数组的形状

a = np.arange(0, 20, 1)      # 一维数组
a
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])
b = a.reshape((4, 5))
b
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])
c = a.reshape((20, 1))
c
array([[ 0],
       [ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10],
       [11],
       [12],
       [13],
       [14],
       [15],
       [16],
       [17],
       [18],
       [19]])
d = a.reshape((-1, 4))
d
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])
print(a)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
a.shape = (4, 5)
print(a)
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]

形状(N, ), (N, 1)和(1, N)不同

  • 形状(N, ):数组是一维的
  • 形状(N, 1):数组是二维的,N行一列
  • 形状(1, N):数组是二维的,一行N列
a = np.array([1, 2, 3, 4, 5])    # 一维数组
b = a.copy()

c1 = np.dot(np.transpose(a), b)  # 转置对一维数组不起作用
print(c1)
c2 = np.dot(a, np.transpose(b))  # 转置也可以写成b.T
print(c2)

ax = np.reshape(a, (5, 1))
print(ax)
bx = np.reshape(b, (1, 5))
print(bx)
c = np.dot(ax, bx)
print(c)
55
55
[[1]
 [2]
 [3]
 [4]
 [5]]
[[1 2 3 4 5]]
[[ 1  2  3  4  5]
 [ 2  4  6  8 10]
 [ 3  6  9 12 15]
 [ 4  8 12 16 20]
 [ 5 10 15 20 25]]

填充数组

np.zeros(3)
array([0., 0., 0.])
np.zeros((2, 2), complex)
array([[0.+0.j, 0.+0.j],
       [0.+0.j, 0.+0.j]])
np.ones((2, 3))
array([[1., 1., 1.],
       [1., 1., 1.]])
np.full((2, 2), 5)
array([[5, 5],
       [5, 5]])
# rand: 0到1之间[0, 1)均匀分布的随机数
np.random.rand(2, 4)
array([[0.47985176, 0.69532184, 0.26390581, 0.43990791],
       [0.05152074, 0.67448969, 0.31955424, 0.61910693]])
# randn:服从均值为0,方差为1的标准正态(高斯)分布的随机数
np.random.randn(2, 4)
array([[ 0.16204318,  0.98753155, -0.53755078,  0.93984252],
       [ 0.08822856, -0.47378803, -0.5818457 ,  0.78371192]])

索引与切片

a = np.array([0, 1, 2, 3, 4])
a[1:3]
array([1, 2])
a[:3]
array([0, 1, 2])
a[1:]
array([1, 2, 3, 4])
a[1:-1]
array([1, 2, 3])
a[:]
array([0, 1, 2, 3, 4])
a[::2]
array([0, 2, 4])
a[1:4:2]
array([1, 3])
a[::-1]
array([4, 3, 2, 1, 0])
a = np.arange(12); a.shape = (3, 4); a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
a[1, 2]
6
a[1, -1]
7
a[:, 1]
array([1, 5, 9])
a[2, :]
array([ 8,  9, 10, 11])
a[1][2]
6
a[2]
array([ 8,  9, 10, 11])
a[0, 1:3]
array([1, 2])
a[1:, 2:]
array([[ 6,  7],
       [10, 11]])
a[::2, 1::2]
array([[ 1,  3],
       [ 9, 11]])

拷贝与视图

a = np.arange(5); a
array([0, 1, 2, 3, 4])
b = a[2:].copy()           # .copy()
b
array([2, 3, 4])
b[0] = 100;
print(b)
print(a)
[100   3   4]
[0 1 2 3 4]

数组运算

x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)
print(x + y)              # 加法运算
print(np.add(x, y))
[[ 6.  8.]
 [10. 12.]]
[[ 6.  8.]
 [10. 12.]]
print(x - y)              # 减法运算
print(np.subtract(x, y))
[[-4. -4.]
 [-4. -4.]]
[[-4. -4.]
 [-4. -4.]]
print(x * y)              # 乘法运算
print(np.multiply(x, y))
[[ 5. 12.]
 [21. 32.]]
[[ 5. 12.]
 [21. 32.]]
print(x / y)              # 除法运算
print(np.divide(x, y))
[[0.2        0.33333333]
 [0.42857143 0.5       ]]
[[0.2        0.33333333]
 [0.42857143 0.5       ]]
print(np.sqrt(x))         # 平方根运算
[[1.         1.41421356]
 [1.73205081 2.        ]]

广播机制(broadcasting)

https://www.runoob.com/numpy/numpy-broadcast.html

a = np.array([[ 0, 0, 0],
           [10,10,10],
           [20,20,20],
           [30,30,30]])
b = np.array([1,2,3])
print(a + b)
[[ 1  2  3]
 [11 12 13]
 [21 22 23]
 [31 32 33]]

矩阵乘法

A = np.array([[1, 2], [3, 4]])
print(np.dot(A, A))
print(A*A)
[[ 7 10]
 [15 22]]
[[ 1  4]
 [ 9 16]]
x = np.array([10, 20])
np.dot(A, x)            #等价于A.dot(x)
array([ 50, 110])
np.dot(x, A)            #等价于x.dot(A)
array([ 70, 100])

更高效的数学函数

https://docs.scipy.org/doc/numpy/reference/routines.math.html

x = np.array([[1,2],[3,4]])
x
array([[1, 2],
       [3, 4]])
print(np.sum(x))          # Compute sum of all elements;
print(np.sum(x, axis=0))  # Compute sum of each column;
print(np.sum(x, axis=1))  # Compute sum of each row;
10
[4 6]
[3 7]

二、Matplotlib

  • Matplotlib是Python中最常用的可视化工具之一,可以非常方便地创建海量类型的2D图表和一些基本的3D图表
  • 因为在函数的设计上参考了MATLAB,所以叫做Matplotlib
  • Pyplot tutorial:https://matplotlib.org/stable/tutorials/introductory/pyplot.html
import matplotlib.pyplot as plt

plt.plot([1,2,3,4], [1,4,9,16],  'r--')
plt.axis([0, 6, 0, 20])
plt.show()

如何运用Numpy&Matplotlib&Pandas进行数据的分析和可视化_第1张图片

%matplotlib inline

一张图中多条曲线

import numpy as np

t = np.arange(0., 5., 0.2)
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')

如何运用Numpy&Matplotlib&Pandas进行数据的分析和可视化_第2张图片

# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3*np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

# Plot the points using matplotlib
plt.plot(x, y_sin)
plt.plot(x, y_cos)

plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.title('Sine and Cosine')
plt.legend(['Sine', 'Cosine'])

如何运用Numpy&Matplotlib&Pandas进行数据的分析和可视化_第3张图片

多张子图

def f(t):
    return np.exp(-t) * np.cos(2*np.pi*t)

t1 = np.arange(0.0, 5.0, 0.1)
t2 = np.arange(0.0, 5.0, 0.02)

plt.figure()
plt.subplot(211)
plt.plot(t1, f(t1), 'bo', t2, f(t2), 'k')

plt.subplot(212)
plt.plot(t2, np.cos(2*np.pi*t2), 'r--')

如何运用Numpy&Matplotlib&Pandas进行数据的分析和可视化_第4张图片

# figure的使用
x = np.linspace(-1, 1, 50)
y1 = 2 * x + 1

# figure 1
plt.figure(1)
plt.plot(x, y1)


# figure 2
y2 = x**2
plt.figure()
plt.plot(x, y2)


# figure 3,指定figure的编号并指定figure的大小, 指定线的颜色, 宽度和类型
y2 = x**2
plt.figure(num = 5, figsize = (4, 4))
plt.plot(x, y1)
plt.plot(x, y2, color = 'red', linewidth = 1.0, linestyle = '--')

如何运用Numpy&Matplotlib&Pandas进行数据的分析和可视化_第5张图片
如何运用Numpy&Matplotlib&Pandas进行数据的分析和可视化_第6张图片
如何运用Numpy&Matplotlib&Pandas进行数据的分析和可视化_第7张图片

分类变量的图

names = ['group_a', 'group_b', 'group_c']
values = [1, 10, 100]

plt.figure(1, figsize=(9, 3))

plt.subplot(131)
plt.bar(names, values)

plt.subplot(132)
plt.scatter(names, values)

plt.subplot(133)
plt.plot(names, values)

plt.suptitle('Categorical Plotting')
Text(0.5, 0.98, 'Categorical Plotting')

如何运用Numpy&Matplotlib&Pandas进行数据的分析和可视化_第8张图片

添加文本

mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)

# the histogram of the data
n, bins, patches = plt.hist(x, 50, density=1, facecolor='g', alpha=0.75)

plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title('Histogram of IQ')
plt.text(60, .025, r'$\mu=100,\ \sigma=15$')
plt.axis([40, 160, 0, 0.03])
plt.grid(True)

如何运用Numpy&Matplotlib&Pandas进行数据的分析和可视化_第9张图片

添加注释

ax = plt.subplot(111)

t = np.arange(0.0, 5.0, 0.01)
s = np.cos(2*np.pi*t)

line, = plt.plot(t, s, lw=2)

plt.annotate('local max', xy=(2, 1), xytext=(3, 1.5), arrowprops=dict(facecolor='black', shrink=0.05))
plt.ylim(-2,2)
(-2.0, 2.0)

如何运用Numpy&Matplotlib&Pandas进行数据的分析和可视化_第10张图片

三、Pandas

  • Pandas是python的一个数据分析包
  • 由AQR Capital Management于2008年4月开发,并于2009年底开源出来
  • 10 Minutes to pandas:https://pandas.pydata.org/docs/user_guide/10min.html

3.1 Series

  • 一维标记数组,由一组数据以及一组与之相关的数据标签(即索引)组成。

传入列表创建Series

import pandas as pd
# 传入列表,使用缺省整数索引
obj = pd.Series([4, 7, -5, 3])
obj
0    4
1    7
2   -5
3    3
dtype: int64
obj.values
array([ 4,  7, -5,  3], dtype=int64)
obj.index
RangeIndex(start=0, stop=4, step=1)
# 传入列表,并给定索引
obj2 = pd.Series([4,7,-5,3], index=['d','b','a','c'])
obj2
d    4
b    7
a   -5
c    3
dtype: int64
obj2.index
Index(['d', 'b', 'a', 'c'], dtype='object')
# 修改索引
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

传入字典创建Series

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

检测缺失数据

pd.isnull(obj4)
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
pd.notnull(obj4)
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

访问Series中的元素

# 通过索引访问Series中的元素
print(obj2['a'])
obj2['d']= 6
obj2[['c','a','d']]
-5





c    3
a   -5
d    6
dtype: int64
print('b' in obj2)
print('e' in obj2)
print(3 in obj2.values)
True
False
True

对Series的操作

# 用条件过滤数组
obj2[obj2 > 0]
d    6
b    7
c    3
dtype: int64
# 标量乘法
obj2*2
d    12
b    14
a   -10
c     6
dtype: int64
# 数学函数
np.exp(obj2)
d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64
print(obj3)
print(obj4)

obj3 + obj4      # 自动对齐索引
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64





California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64
obj4.name = 'population'
obj4.index.name = 'state'
obj4
state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

3.2 DataFrame

  • 二维表格型数据结构, 含有一组有序的列,每列都有标签,可看成一个Series的字典,既有行索引又有列索引

创建DataFrame,传入由等长列表或数组构成的字典

data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
      'year':[2000, 2001, 2002, 2001, 2002],
      'pop':[1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(data)
frame
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
pd.DataFrame(data, columns=['year', 'state', 'pop'])
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five'])
frame2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN

创建DataFrame, 传入嵌套字典

pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002:3.6}}
frame3 = pd.DataFrame(pop)
frame3
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5
frame4 = pd.DataFrame(pop, index=[2001, 2002, 2003])
frame4
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN

缺失数据处理

frame3.dropna(how='all')    # 删除有任何缺失值的行
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5
frame4.fillna(value=5)     # 填充缺失值
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 5.0 5.0
frame3.isnull()       # 判断哪些是缺失值
Nevada Ohio
2001 False False
2002 False False
2000 True False

访问单列

frame2['state']          # 字典记法
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object
frame2.state           # 属性记法
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

访问单行

frame2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
frame2.loc['three']
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object
frame2.iloc[2]
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

修改列

frame2['debt'] = 16.5
frame2
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
import numpy as np
frame2['debt'] = np.arange(5)
frame2
year state pop debt
one 2000 Ohio 1.5 0
two 2001 Ohio 1.7 1
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4
val = pd.Series([-1.2, -1.5, -1.7], index=[ 'two', 'four', 'five'])
frame2['debt'] = val
frame2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7

增加列

frame2['eastern'] = (frame2.state == 'Ohio')
frame2
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False

删除行和列

del frame2['eastern']
frame2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
frame2.drop(['pop','debt'], axis=1)
year state
one 2000 Ohio
two 2001 Ohio
three 2002 Ohio
four 2001 Nevada
five 2002 Nevada
frame2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
frame2.drop(columns=['pop','debt'])
year state
one 2000 Ohio
two 2001 Ohio
three 2002 Ohio
four 2001 Nevada
five 2002 Nevada
frame2.drop(['one', 'three', 'five'], axis=0)
year state pop debt
two 2001 Ohio 1.7 -1.2
four 2001 Nevada 2.4 -1.5
frame2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
frame2.drop(['pop','debt'], axis=1, inplace=True)
frame2
year state
one 2000 Ohio
two 2001 Ohio
three 2002 Ohio
four 2001 Nevada
five 2002 Nevada

你可能感兴趣的:(机器学习,numpy,pandas,matplotlib,python,数据分析)