NumPy和Pandas是利用Python进行数据分析的最基本的两个库,NumPy是用Python进行科学计算时所用到的基础库。它是大量Python数学和科学计算包的基础,比如Pandas库就用到了NumPy,pandas库专门用于数据分析,充分借鉴了Python标准库NumPy的相关概念,Python标准库所提供的内置工具对数据分析方面的大多数计算来说都过于简单或不够用,这也是NumPy和Pandas等库风靡的原因。
在我们下载第三方平台时,NumPy和Pandas一般都已经安装好,检测Python是否有NumPy和Pandas库:
import numpy as np #引入numpy库并简称为np
import pandas as pd #引入pandas并简称为pd
在引入库时,若未报错,则说明已有numpy和pandas库。
若报错,可运行命令指示符(CMD),输入:pip install numpy 或 pip install pandas进行安装。
或点击此处Python模块下载下载对应模块.whl文件,在CMD->cd命令下进入到.whl文件所在目录,如果pip目录未添加到环境变量,最好把.whl文件放置到桌面上,然后进入CMD输入cd desktop+回车,进入桌面,然后输入python intall 模块名+回车进行安装
numpy库的基础是naaray(即N维数组)对象,定义ndarray最简单的方式是array()函数。
import numpy as np
a = np.array([1,2,3])
print(a)
[1 2 3] #列表变数组
import numpy as np
a = array([1,2,3])
print(type(a))
<class 'numpy.ndarray'>
我们刚刚建成的数组a一行三列,其秩等于1。当然我们也可以定义一个2 × 2的二维数组:
import numpy as np
b = np.array([[1,2],[3,4]])
print(b) #输出数组b
print(b.size) #得到数组b中量的个数
print(b.shape) #得到数组b的类型
[[1 2]
[3 4]]
4
(2, 2)
import numpy as np
print(np.zeros((3,3)))
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
import numpy as np
a = np.arange(0,12).reshape(3,4)
print(a)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
import numpy as np
a = np.random.random(3) #生成一个 1 × 3的数组
b = np.random.random((3,3)) #生成一个3 × 3的数组
print(a)
print(b)
[0.50262628 0.75309293 0.43306456]
[[0.95669556 0.79520557 0.57801797]
[0.13321105 0.37363079 0.03048739]
[0.07016573 0.91489571 0.09046483]]
import numpy as np
a = np.arange(4)
b = np.arange(4,8)
print(a[3])
print(a+4)
print(a+b)
print(a*b)
print(np.dot(a,b)) #矩阵积
2
[4 5 6 7]
[ 4 6 8 10]
[ 0 5 12 21]
38
import numpy as np
a= np.arange(4)
a+=4
print(a)
[4 5 6 7]
import numpy as np
a = np.array([1,2,3])
print(a.sum())
print(a.min())
print(a.max())
print(a.mean())
6
1
3
2.0
Improt numpy as np
a = np.random.random((4,4))
print(a<0.5)
[[ True False False False]
[ True False False False]
[ True False True False]
[ True True False True]]
import numpy as np
a = np.arange(12)
b = a.reshape(3,4) #三行四列
print(a)
print(b)
print(b.transpose()) # b 的转置矩阵
[ 0 1 2 3 4 5 6 7 8 9 10 11]
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[ 0 4 8]
[ 1 5 9]
[ 2 6 10]
[ 3 7 11]]
import numpy as np
a = np.ones((3,3))
b = np.zeros((3,3))
c = np.vstack((a,b))
d = np.hstack((a,b))
print(c)
print(d)
[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]
[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
[[1. 1. 1. 0. 0. 0.]
[1. 1. 1. 0. 0. 0.]
[1. 1. 1. 0. 0. 0.]]
一维数组连接:
import numpy as np
a = np.array([0,1,2])
b = np.array([3,4,5])
c = np.array([6,7,8])
d = np.column_stack((a,b,c)) #列堆积
e = np.row_stack((a,b,c)) #行堆积
print(d)
print(e)
[[0 3 6]
[1 4 7]
[2 5 8]]
[[0 1 2]
[3 4 5]
[6 7 8]]
improt numpy as np
a=arange(20).reshape(5,4)
print(a)
([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19]])
print(a[:,:]) #第一维全取,第二维全取,第一维和第二维之间用","相隔,各维中用":"选择“各维下标”
([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19]])
print(a[0:2,1:3]) #第一维取下标0到2(不包括2),第二维取下标1到3(不包括3),即取第0~2行的第1~3列元素
([[1, 2],
[5, 6]])
print(a[:3,1:]) #第一维取下标从头至3(不包括3),第二维取下标1至尾
([[ 1, 2, 3],
[ 5, 6, 7],
[ 9, 10, 11]])
print(a[2:,:3]) #第一维取下标2至尾,第二维取下标开头至3(不包括3)
([[ 8, 9, 10],
[12, 13, 14],
[16, 17, 18]])
data = np.genfromtxt('data.csv',delimiter=','names=True)
pandas的核心为两大数据结构,即Series和DataFrame,虽然这些数据结构不能解决所有问题,但它们为大多数数据问题提供了有效和强大的工具,它们将Index(索引)整合到自己的结构中,使数据结构具有很强的操作性。
a = pd.Series([1,2,3])
print(a)
0 1
1 2
2 3
dtype: int64
s = pd.Series([1,2,3],index = ['a','b','c'])
print(a)
a 1
b 2
c 3
dtype: int64
print(s[b])
2
print(s>2)
a False
b False
c True
dtype: bool
print(s[s>2])
c 3
s = pd.Series([1,2,3,np.NaN])
print(s.isnull())
print(s.notnull())
0 False
1 False
2 False
3 True
dtype: bool
0 True
1 True
2 True
3 False
dtype: bool
isnull和notnull用作筛选条件时:
s = pd.Series([1,2,3,np.NaN])
print(s[s.isnull()])
print(s[s.notnull()])
3 NaN
dtype: float64
0 1.0
1 2.0
2 3.0
dtype: float64
data = {'color' : ['bule','green','yellow','red'],
'object':['ball','pen','pencil','paper'],
'price':[1.2,1.0,1.5,1.4]}
frame =pd.DataFrame(data)
print(frame)
color object price
0 bule ball 1.2
1 green pen 1.0
2 yellow pencil 1.5
3 red paper 1.4
DataFrame和Series一样,如果index没有明确标签,就会从0开始。
data = {'color' : ['bule','green','yellow','red'],
'object':['ball','pen','pencil','paper'],
'price':[1.2,1.0,1.5,1.4]}
frame =pd.DataFrame(data,index=['one','two','three','four'])
print(frame)
color object price
one bule ball 1.2
two green pen 1.0
three yellow pencil 1.5
four red paper 1.4
print(frame.columns) #获取索引列表
Index(['color', 'object', 'price'], dtype='object')
print(frame.values) #获取所有元素
[['bule' 'ball' 1.2]
['green' 'pen' 1.0]
['yellow' 'pencil' 1.5]
['red' 'paper' 1.4]]
print(frame.price)
one 1.2
two 1.0
three 1.5
four 1.4
Name: price, dtype: float64
print(frame.ix[2]) #行的获取
color yellow
object pencil
price 1.5
Name: three, dtype: object
frame.index.name = 'id'
frame.columns.name='item'
print(frame)
item color object price
id
one bule ball 1.2
two green pen 1.0
three yellow pencil 1.5
four red paper 1.4
print(frame.isin([1,0,'pen']))
item color object price
id
one False False False
two False True True
three False False False
four False False False
print(frame[frame.isin([1,0,pen])]) #过滤
item color object price
id
one NaN NaN NaN
two NaN pen 1.0
three NaN NaN NaN
four NaN NaN NaN
del frame['price'] #删除pirce那一列
print(frame)
item color object
id
one bule ball
two green pen
three yellow pencil
four red paper
pandas剩余重要下期将继续补充。