Pandas是基于Numpy构建的,让以Numpy为中心的应用变得更加简单。平台获取的数据主要是以Pandas中DataFrame的形式。除此之外,Pandas还包括一维数组Series以及三维的Pannel。
下面将进行详细介绍:
Series:一维数组,与numpy中的一维array类似。二者与Python基本的数据结构List也很相近,其区别是:List中的元素可以使不同的数据类型,而Array和Series中则只允许存储相同的数据类型,这样可以更有效的使用内存,提高运算效率
DataFrame:二维的表格型数据结构。很多功能与R中data.frame类似。可以将DataFrame理解为Series的容器。以下的内容主要以DataFrame为主。
# 首先导入库
import pandas as pd
import numpy as np
由一组数据(各种numpy数据类型),以及一组与之相关的标签数据(即索引)组成。仅由一组数据即可产生最简单的Series,可以通过传递一个list对象来创建一个Series,Pandas默认创建整型索引。
# 创建一个Series:
s = pd.Series([1,3,5,np.nan,6,8])
s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
# 获取Series的索引:
s.index
RangeIndex(start=0, stop=6, step=1)
DataFrame是一个表格型的数据结构,它含有一组有序的列,每一列的数据结构都是相同的,而不同的列之间则可以是不同的数据结构(数值、字符、布尔值等)。或者以数据库进行类比,DataFrame中的每一行是一个记录,名称为Index的一个元素,而每一列则为一个字段,是这个记录的一个属性。DataFrame既有行索引也有列索引,可以被看做由Series组成的字典(共用同一个索引)。
dates = pd.date_range("20130101",periods=6)
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list("ABCD"))
df
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | 0.513020 | 1.541941 | -0.787835 | -0.666850 |
2013-01-02 | 0.474004 | 1.059586 | -0.652823 | 0.343018 |
2013-01-03 | 1.230373 | 0.725093 | 0.371367 | -0.215019 |
2013-01-04 | 0.464843 | 0.056014 | -1.149305 | 1.216529 |
2013-01-05 | -0.950310 | 0.388610 | -0.779216 | 1.453014 |
2013-01-06 | -1.081557 | -0.687838 | 1.702892 | -1.268365 |
(1)网络爬虫概述
网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。
(2)金融数据模块获取股票数据方法:DataReader()
(3)DataFrame存储为CSV文件方法:dataframe.to_csv()
import numpy as np
import pandas as pd
import pandas_datareader.data as web
import datetime
# 爬取数据
df_csvsave = web.DataReader("AAPL","yahoo",datetime.datetime(2019,4,1),datetime.date.today())
# 保存为csv数据
df_csvsave.to_csv(".\\data.csv",columns=df_csvsave.columns,index=True)
# 展示数据
df_csvsave
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
2019-04-04 | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300 | 195.690002 |
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
2019-04-08 | 200.229996 | 196.339996 | 196.419998 | 200.100006 | 25881700 | 200.100006 |
2019-04-09 | 202.850006 | 199.229996 | 200.320007 | 199.500000 | 35768200 | 199.500000 |
2019-04-10 | 200.740005 | 198.179993 | 198.679993 | 200.619995 | 21695300 | 200.619995 |
2019-04-11 | 201.000000 | 198.440002 | 200.850006 | 198.949997 | 20900800 | 198.949997 |
2019-04-12 | 200.139999 | 196.210007 | 199.199997 | 198.869995 | 27744300 | 198.869995 |
(1)获取CSV格式的股票数据
(2)CSV文件数据转换为DataFrame方法:pd.read_csv()
df = pd.read_csv("data.csv",index_col=0,encoding="gb2312")
df
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
2019-04-04 | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300 | 195.690002 |
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
2019-04-08 | 200.229996 | 196.339996 | 196.419998 | 200.100006 | 25881700 | 200.100006 |
2019-04-09 | 202.850006 | 199.229996 | 200.320007 | 199.500000 | 35768200 | 199.500000 |
2019-04-10 | 200.740005 | 198.179993 | 198.679993 | 200.619995 | 21695300 | 200.619995 |
2019-04-11 | 201.000000 | 198.440002 | 200.850006 | 198.949997 | 20900800 | 198.949997 |
2019-04-12 | 200.139999 | 196.210007 | 199.199997 | 198.869995 | 27744300 | 198.869995 |
查看前几条数据:
# 默认是查询5条
df.head()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
2019-04-04 | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300 | 195.690002 |
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
查看后几条数据:
# 默认查询后5条
df.tail()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-08 | 200.229996 | 196.339996 | 196.419998 | 200.100006 | 25881700 | 200.100006 |
2019-04-09 | 202.850006 | 199.229996 | 200.320007 | 199.500000 | 35768200 | 199.500000 |
2019-04-10 | 200.740005 | 198.179993 | 198.679993 | 200.619995 | 21695300 | 200.619995 |
2019-04-11 | 201.000000 | 198.440002 | 200.850006 | 198.949997 | 20900800 | 198.949997 |
2019-04-12 | 200.139999 | 196.210007 | 199.199997 | 198.869995 | 27744300 | 198.869995 |
查看DataFrame的索引
df.index
Index(['2019-04-01', '2019-04-02', '2019-04-03', '2019-04-04', '2019-04-05',
'2019-04-08', '2019-04-09', '2019-04-10', '2019-04-11', '2019-04-12'],
dtype='object', name='Date')
查看DataFrame的列名
df.columns
Index(['High', 'Low', 'Open', 'Close', 'Volume', 'Adj Close'], dtype='object')
查看DataFrame的值
df.values
array([[1.91679993e+02, 1.88380005e+02, 1.91639999e+02, 1.91240005e+02,
2.78620000e+07, 1.91240005e+02],
[1.94460007e+02, 1.91050003e+02, 1.91089996e+02, 1.94020004e+02,
2.27657000e+07, 1.94020004e+02],
[1.96500000e+02, 1.93149994e+02, 1.93250000e+02, 1.95350006e+02,
2.32718000e+07, 1.95350006e+02],
[1.96369995e+02, 1.93139999e+02, 1.94789993e+02, 1.95690002e+02,
1.91143000e+07, 1.95690002e+02],
[1.97100006e+02, 1.95929993e+02, 1.96449997e+02, 1.97000000e+02,
1.85266000e+07, 1.97000000e+02],
[2.00229996e+02, 1.96339996e+02, 1.96419998e+02, 2.00100006e+02,
2.58817000e+07, 2.00100006e+02],
[2.02850006e+02, 1.99229996e+02, 2.00320007e+02, 1.99500000e+02,
3.57682000e+07, 1.99500000e+02],
[2.00740005e+02, 1.98179993e+02, 1.98679993e+02, 2.00619995e+02,
2.16953000e+07, 2.00619995e+02],
[2.01000000e+02, 1.98440002e+02, 2.00850006e+02, 1.98949997e+02,
2.09008000e+07, 1.98949997e+02],
[2.00139999e+02, 1.96210007e+02, 1.99199997e+02, 1.98869995e+02,
2.77443000e+07, 1.98869995e+02]])
使用describe()函数对于数据的快速统计汇总
df.describe()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
count | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 1.000000e+01 | 10.000000 |
mean | 198.107001 | 195.004999 | 196.268999 | 197.134001 | 2.435307e+07 | 197.134001 |
std | 3.458631 | 3.500351 | 3.518317 | 3.029131 | 5.169550e+06 | 3.029131 |
min | 191.679993 | 188.380005 | 191.089996 | 191.240005 | 1.852660e+07 | 191.240005 |
25% | 196.402496 | 193.142498 | 193.634998 | 195.435005 | 2.109942e+07 | 195.435005 |
50% | 198.620003 | 196.070000 | 196.434998 | 197.934998 | 2.301875e+07 | 197.934998 |
75% | 200.612503 | 197.719994 | 199.069996 | 199.362499 | 2.727865e+07 | 199.362499 |
max | 202.850006 | 199.229996 | 200.850006 | 200.619995 | 3.576820e+07 | 200.619995 |
对数据的转置:
df.T
Date | 2019-04-01 | 2019-04-02 | 2019-04-03 | 2019-04-04 | 2019-04-05 | 2019-04-08 | 2019-04-09 | 2019-04-10 | 2019-04-11 | 2019-04-12 |
---|---|---|---|---|---|---|---|---|---|---|
High | 1.916800e+02 | 1.944600e+02 | 1.965000e+02 | 1.963700e+02 | 1.971000e+02 | 2.002300e+02 | 2.028500e+02 | 2.007400e+02 | 2.010000e+02 | 2.001400e+02 |
Low | 1.883800e+02 | 1.910500e+02 | 1.931500e+02 | 1.931400e+02 | 1.959300e+02 | 1.963400e+02 | 1.992300e+02 | 1.981800e+02 | 1.984400e+02 | 1.962100e+02 |
Open | 1.916400e+02 | 1.910900e+02 | 1.932500e+02 | 1.947900e+02 | 1.964500e+02 | 1.964200e+02 | 2.003200e+02 | 1.986800e+02 | 2.008500e+02 | 1.992000e+02 |
Close | 1.912400e+02 | 1.940200e+02 | 1.953500e+02 | 1.956900e+02 | 1.970000e+02 | 2.001000e+02 | 1.995000e+02 | 2.006200e+02 | 1.989500e+02 | 1.988700e+02 |
Volume | 2.786200e+07 | 2.276570e+07 | 2.327180e+07 | 1.911430e+07 | 1.852660e+07 | 2.588170e+07 | 3.576820e+07 | 2.169530e+07 | 2.090080e+07 | 2.774430e+07 |
Adj Close | 1.912400e+02 | 1.940200e+02 | 1.953500e+02 | 1.956900e+02 | 1.970000e+02 | 2.001000e+02 | 1.995000e+02 | 2.006200e+02 | 1.989500e+02 | 1.988700e+02 |
按列对DataFrame进行排序
df.sort_values("Open",ascending=False)
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-11 | 201.000000 | 198.440002 | 200.850006 | 198.949997 | 20900800 | 198.949997 |
2019-04-09 | 202.850006 | 199.229996 | 200.320007 | 199.500000 | 35768200 | 199.500000 |
2019-04-12 | 200.139999 | 196.210007 | 199.199997 | 198.869995 | 27744300 | 198.869995 |
2019-04-10 | 200.740005 | 198.179993 | 198.679993 | 200.619995 | 21695300 | 200.619995 |
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
2019-04-08 | 200.229996 | 196.339996 | 196.419998 | 200.100006 | 25881700 | 200.100006 |
2019-04-04 | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300 | 195.690002 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
选择一列数据:
df["Open"]
Date
2019-04-01 191.639999
2019-04-02 191.089996
2019-04-03 193.250000
2019-04-04 194.789993
2019-04-05 196.449997
2019-04-08 196.419998
2019-04-09 200.320007
2019-04-10 198.679993
2019-04-11 200.850006
2019-04-12 199.199997
Name: Open, dtype: float64
# 返回DataFrame
df[["Open"]]
Open | |
---|---|
Date | |
2019-04-01 | 191.639999 |
2019-04-02 | 191.089996 |
2019-04-03 | 193.250000 |
2019-04-04 | 194.789993 |
2019-04-05 | 196.449997 |
2019-04-08 | 196.419998 |
2019-04-09 | 200.320007 |
2019-04-10 | 198.679993 |
2019-04-11 | 200.850006 |
2019-04-12 | 199.199997 |
选择多列
df[["Open","High"]]
Open | High | |
---|---|---|
Date | ||
2019-04-01 | 191.639999 | 191.679993 |
2019-04-02 | 191.089996 | 194.460007 |
2019-04-03 | 193.250000 | 196.500000 |
2019-04-04 | 194.789993 | 196.369995 |
2019-04-05 | 196.449997 | 197.100006 |
2019-04-08 | 196.419998 | 200.229996 |
2019-04-09 | 200.320007 | 202.850006 |
2019-04-10 | 198.679993 | 200.740005 |
2019-04-11 | 200.850006 | 201.000000 |
2019-04-12 | 199.199997 | 200.139999 |
选择多行
df[0:3]
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
df.loc[行标签,列标签]
df.loc[“a”:“b”] #选取ab两行数据
df.loc[:,“Open”] #选取Open列的数据
df.loc的第一个参数是行标签,第二个参数为列标签(可选参数,默认为所有列标签),两个参数既可以是列表也可以是单个字符,如果两个参数都为列表则返回的是DataFrame,否则,则为Series
df.loc["2019-04-01","Open"]
191.63999938964844
df.loc["2019-04-01":"2019-04-03"]
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
df.loc[:,"Open"]
Date
2019-04-01 191.639999
2019-04-02 191.089996
2019-04-03 193.250000
2019-04-04 194.789993
2019-04-05 196.449997
2019-04-08 196.419998
2019-04-09 200.320007
2019-04-10 198.679993
2019-04-11 200.850006
2019-04-12 199.199997
Name: Open, dtype: float64
df.iloc[行位置,列位置]
df.iloc[1,1] #选取第二行,第二列的值,返回的为单个值
df.iloc[[0,2],:] #选取第一行及第三行的数据
df.iloc[0:2,:] #选取第一行到第三行(不包含)的数据
df.iloc[:,1] #选取所有记录的第二列的值,返回的为一个Series
df.iloc[1,:] #选取第一行数据,返回的为一个Series
df
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
2019-04-04 | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300 | 195.690002 |
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
2019-04-08 | 200.229996 | 196.339996 | 196.419998 | 200.100006 | 25881700 | 200.100006 |
2019-04-09 | 202.850006 | 199.229996 | 200.320007 | 199.500000 | 35768200 | 199.500000 |
2019-04-10 | 200.740005 | 198.179993 | 198.679993 | 200.619995 | 21695300 | 200.619995 |
2019-04-11 | 201.000000 | 198.440002 | 200.850006 | 198.949997 | 20900800 | 198.949997 |
2019-04-12 | 200.139999 | 196.210007 | 199.199997 | 198.869995 | 27744300 | 198.869995 |
# 选取第二行,第二列的值,返回的为单个值
df.iloc[1,1]
191.0500030517578
# 选取第一行即第三行的数据
df.iloc[[0,2],:]
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
# 选取第一行到第三行(不包含)的数据
df.iloc[0:2,:]
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
# 选取所有记录的第一列的值,返回的为一个Series
df.iloc[:,1]
Date
2019-04-01 188.380005
2019-04-02 191.050003
2019-04-03 193.149994
2019-04-04 193.139999
2019-04-05 195.929993
2019-04-08 196.339996
2019-04-09 199.229996
2019-04-10 198.179993
2019-04-11 198.440002
2019-04-12 196.210007
Name: Low, dtype: float64
# 选取第一行数据。返回的为一个Series
df.iloc[1,:]
High 1.944600e+02
Low 1.910500e+02
Open 1.910900e+02
Close 1.940200e+02
Volume 2.276570e+07
Adj Close 1.940200e+02
Name: 2019-04-02, dtype: float64
df[逻辑条件]
df[df.one >= 2] #单个逻辑条件
df[(df.one >=1 )&(df.one<3)] #多个逻辑条件组合
df
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
2019-04-04 | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300 | 195.690002 |
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
2019-04-08 | 200.229996 | 196.339996 | 196.419998 | 200.100006 | 25881700 | 200.100006 |
2019-04-09 | 202.850006 | 199.229996 | 200.320007 | 199.500000 | 35768200 | 199.500000 |
2019-04-10 | 200.740005 | 198.179993 | 198.679993 | 200.619995 | 21695300 | 200.619995 |
2019-04-11 | 201.000000 | 198.440002 | 200.850006 | 198.949997 | 20900800 | 198.949997 |
2019-04-12 | 200.139999 | 196.210007 | 199.199997 | 198.869995 | 27744300 | 198.869995 |
df[df.Open > 195]
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
2019-04-08 | 200.229996 | 196.339996 | 196.419998 | 200.100006 | 25881700 | 200.100006 |
2019-04-09 | 202.850006 | 199.229996 | 200.320007 | 199.500000 | 35768200 | 199.500000 |
2019-04-10 | 200.740005 | 198.179993 | 198.679993 | 200.619995 | 21695300 | 200.619995 |
2019-04-11 | 201.000000 | 198.440002 | 200.850006 | 198.949997 | 20900800 | 198.949997 |
2019-04-12 | 200.139999 | 196.210007 | 199.199997 | 198.869995 | 27744300 | 198.869995 |
df[(df.Open > 195)&(df.Close < 199)]
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
2019-04-11 | 201.000000 | 198.440002 | 200.850006 | 198.949997 | 20900800 | 198.949997 |
2019-04-12 | 200.139999 | 196.210007 | 199.199997 | 198.869995 | 27744300 | 198.869995 |