08 Python之Pandas库选择查看、选择

Pandas查看和选择数据

Pandas是基于Numpy构建的,让以Numpy为中心的应用变得更加简单。平台获取的数据主要是以Pandas中DataFrame的形式。除此之外,Pandas还包括一维数组Series以及三维的Pannel。

下面将进行详细介绍:

Series:一维数组,与numpy中的一维array类似。二者与Python基本的数据结构List也很相近,其区别是:List中的元素可以使不同的数据类型,而Array和Series中则只允许存储相同的数据类型,这样可以更有效的使用内存,提高运算效率

DataFrame:二维的表格型数据结构。很多功能与R中data.frame类似。可以将DataFrame理解为Series的容器。以下的内容主要以DataFrame为主。

# 首先导入库
import pandas as pd
import numpy as np

1.Series

由一组数据(各种numpy数据类型),以及一组与之相关的标签数据(即索引)组成。仅由一组数据即可产生最简单的Series,可以通过传递一个list对象来创建一个Series,Pandas默认创建整型索引。

# 创建一个Series:
s = pd.Series([1,3,5,np.nan,6,8])
s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
# 获取Series的索引:
s.index
RangeIndex(start=0, stop=6, step=1)

2.DataFrame

DataFrame是一个表格型的数据结构,它含有一组有序的列,每一列的数据结构都是相同的,而不同的列之间则可以是不同的数据结构(数值、字符、布尔值等)。或者以数据库进行类比,DataFrame中的每一行是一个记录,名称为Index的一个元素,而每一列则为一个字段,是这个记录的一个属性。DataFrame既有行索引也有列索引,可以被看做由Series组成的字典(共用同一个索引)。

2.1 创建一个DataFrame

dates = pd.date_range("20130101",periods=6)
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list("ABCD"))
df
A B C D
2013-01-01 0.513020 1.541941 -0.787835 -0.666850
2013-01-02 0.474004 1.059586 -0.652823 0.343018
2013-01-03 1.230373 0.725093 0.371367 -0.215019
2013-01-04 0.464843 0.056014 -1.149305 1.216529
2013-01-05 -0.950310 0.388610 -0.779216 1.453014
2013-01-06 -1.081557 -0.687838 1.702892 -1.268365

2.2 股票数据的获取

1.通过API接口获取

(1)网络爬虫概述

网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。

(2)金融数据模块获取股票数据方法:DataReader()

(3)DataFrame存储为CSV文件方法:dataframe.to_csv()

import numpy as np
import pandas as pd
import pandas_datareader.data as web
import datetime

# 爬取数据
df_csvsave = web.DataReader("AAPL","yahoo",datetime.datetime(2019,4,1),datetime.date.today())

# 保存为csv数据
df_csvsave.to_csv(".\\data.csv",columns=df_csvsave.columns,index=True)

# 展示数据
df_csvsave
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700 194.020004
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800 195.350006
2019-04-04 196.369995 193.139999 194.789993 195.690002 19114300 195.690002
2019-04-05 197.100006 195.929993 196.449997 197.000000 18526600 197.000000
2019-04-08 200.229996 196.339996 196.419998 200.100006 25881700 200.100006
2019-04-09 202.850006 199.229996 200.320007 199.500000 35768200 199.500000
2019-04-10 200.740005 198.179993 198.679993 200.619995 21695300 200.619995
2019-04-11 201.000000 198.440002 200.850006 198.949997 20900800 198.949997
2019-04-12 200.139999 196.210007 199.199997 198.869995 27744300 198.869995

2.通过CSV文件获取

(1)获取CSV格式的股票数据

(2)CSV文件数据转换为DataFrame方法:pd.read_csv()

df = pd.read_csv("data.csv",index_col=0,encoding="gb2312")
df
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700 194.020004
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800 195.350006
2019-04-04 196.369995 193.139999 194.789993 195.690002 19114300 195.690002
2019-04-05 197.100006 195.929993 196.449997 197.000000 18526600 197.000000
2019-04-08 200.229996 196.339996 196.419998 200.100006 25881700 200.100006
2019-04-09 202.850006 199.229996 200.320007 199.500000 35768200 199.500000
2019-04-10 200.740005 198.179993 198.679993 200.619995 21695300 200.619995
2019-04-11 201.000000 198.440002 200.850006 198.949997 20900800 198.949997
2019-04-12 200.139999 196.210007 199.199997 198.869995 27744300 198.869995

2.2 查看数据

查看前几条数据:

# 默认是查询5条
df.head()
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700 194.020004
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800 195.350006
2019-04-04 196.369995 193.139999 194.789993 195.690002 19114300 195.690002
2019-04-05 197.100006 195.929993 196.449997 197.000000 18526600 197.000000

查看后几条数据:

# 默认查询后5条
df.tail()
High Low Open Close Volume Adj Close
Date
2019-04-08 200.229996 196.339996 196.419998 200.100006 25881700 200.100006
2019-04-09 202.850006 199.229996 200.320007 199.500000 35768200 199.500000
2019-04-10 200.740005 198.179993 198.679993 200.619995 21695300 200.619995
2019-04-11 201.000000 198.440002 200.850006 198.949997 20900800 198.949997
2019-04-12 200.139999 196.210007 199.199997 198.869995 27744300 198.869995

查看DataFrame的索引

df.index
Index(['2019-04-01', '2019-04-02', '2019-04-03', '2019-04-04', '2019-04-05',
       '2019-04-08', '2019-04-09', '2019-04-10', '2019-04-11', '2019-04-12'],
      dtype='object', name='Date')

查看DataFrame的列名

df.columns
Index(['High', 'Low', 'Open', 'Close', 'Volume', 'Adj Close'], dtype='object')

查看DataFrame的值

df.values
array([[1.91679993e+02, 1.88380005e+02, 1.91639999e+02, 1.91240005e+02,
        2.78620000e+07, 1.91240005e+02],
       [1.94460007e+02, 1.91050003e+02, 1.91089996e+02, 1.94020004e+02,
        2.27657000e+07, 1.94020004e+02],
       [1.96500000e+02, 1.93149994e+02, 1.93250000e+02, 1.95350006e+02,
        2.32718000e+07, 1.95350006e+02],
       [1.96369995e+02, 1.93139999e+02, 1.94789993e+02, 1.95690002e+02,
        1.91143000e+07, 1.95690002e+02],
       [1.97100006e+02, 1.95929993e+02, 1.96449997e+02, 1.97000000e+02,
        1.85266000e+07, 1.97000000e+02],
       [2.00229996e+02, 1.96339996e+02, 1.96419998e+02, 2.00100006e+02,
        2.58817000e+07, 2.00100006e+02],
       [2.02850006e+02, 1.99229996e+02, 2.00320007e+02, 1.99500000e+02,
        3.57682000e+07, 1.99500000e+02],
       [2.00740005e+02, 1.98179993e+02, 1.98679993e+02, 2.00619995e+02,
        2.16953000e+07, 2.00619995e+02],
       [2.01000000e+02, 1.98440002e+02, 2.00850006e+02, 1.98949997e+02,
        2.09008000e+07, 1.98949997e+02],
       [2.00139999e+02, 1.96210007e+02, 1.99199997e+02, 1.98869995e+02,
        2.77443000e+07, 1.98869995e+02]])

使用describe()函数对于数据的快速统计汇总

df.describe()
High Low Open Close Volume Adj Close
count 10.000000 10.000000 10.000000 10.000000 1.000000e+01 10.000000
mean 198.107001 195.004999 196.268999 197.134001 2.435307e+07 197.134001
std 3.458631 3.500351 3.518317 3.029131 5.169550e+06 3.029131
min 191.679993 188.380005 191.089996 191.240005 1.852660e+07 191.240005
25% 196.402496 193.142498 193.634998 195.435005 2.109942e+07 195.435005
50% 198.620003 196.070000 196.434998 197.934998 2.301875e+07 197.934998
75% 200.612503 197.719994 199.069996 199.362499 2.727865e+07 199.362499
max 202.850006 199.229996 200.850006 200.619995 3.576820e+07 200.619995

对数据的转置:

df.T
Date 2019-04-01 2019-04-02 2019-04-03 2019-04-04 2019-04-05 2019-04-08 2019-04-09 2019-04-10 2019-04-11 2019-04-12
High 1.916800e+02 1.944600e+02 1.965000e+02 1.963700e+02 1.971000e+02 2.002300e+02 2.028500e+02 2.007400e+02 2.010000e+02 2.001400e+02
Low 1.883800e+02 1.910500e+02 1.931500e+02 1.931400e+02 1.959300e+02 1.963400e+02 1.992300e+02 1.981800e+02 1.984400e+02 1.962100e+02
Open 1.916400e+02 1.910900e+02 1.932500e+02 1.947900e+02 1.964500e+02 1.964200e+02 2.003200e+02 1.986800e+02 2.008500e+02 1.992000e+02
Close 1.912400e+02 1.940200e+02 1.953500e+02 1.956900e+02 1.970000e+02 2.001000e+02 1.995000e+02 2.006200e+02 1.989500e+02 1.988700e+02
Volume 2.786200e+07 2.276570e+07 2.327180e+07 1.911430e+07 1.852660e+07 2.588170e+07 3.576820e+07 2.169530e+07 2.090080e+07 2.774430e+07
Adj Close 1.912400e+02 1.940200e+02 1.953500e+02 1.956900e+02 1.970000e+02 2.001000e+02 1.995000e+02 2.006200e+02 1.989500e+02 1.988700e+02

按列对DataFrame进行排序

df.sort_values("Open",ascending=False)
High Low Open Close Volume Adj Close
Date
2019-04-11 201.000000 198.440002 200.850006 198.949997 20900800 198.949997
2019-04-09 202.850006 199.229996 200.320007 199.500000 35768200 199.500000
2019-04-12 200.139999 196.210007 199.199997 198.869995 27744300 198.869995
2019-04-10 200.740005 198.179993 198.679993 200.619995 21695300 200.619995
2019-04-05 197.100006 195.929993 196.449997 197.000000 18526600 197.000000
2019-04-08 200.229996 196.339996 196.419998 200.100006 25881700 200.100006
2019-04-04 196.369995 193.139999 194.789993 195.690002 19114300 195.690002
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800 195.350006
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700 194.020004

2.3 选择数据

2.3.1 通过下标选取数据:

选择一列数据:

df["Open"]
Date
2019-04-01    191.639999
2019-04-02    191.089996
2019-04-03    193.250000
2019-04-04    194.789993
2019-04-05    196.449997
2019-04-08    196.419998
2019-04-09    200.320007
2019-04-10    198.679993
2019-04-11    200.850006
2019-04-12    199.199997
Name: Open, dtype: float64
# 返回DataFrame
df[["Open"]]
Open
Date
2019-04-01 191.639999
2019-04-02 191.089996
2019-04-03 193.250000
2019-04-04 194.789993
2019-04-05 196.449997
2019-04-08 196.419998
2019-04-09 200.320007
2019-04-10 198.679993
2019-04-11 200.850006
2019-04-12 199.199997

选择多列

df[["Open","High"]]
Open High
Date
2019-04-01 191.639999 191.679993
2019-04-02 191.089996 194.460007
2019-04-03 193.250000 196.500000
2019-04-04 194.789993 196.369995
2019-04-05 196.449997 197.100006
2019-04-08 196.419998 200.229996
2019-04-09 200.320007 202.850006
2019-04-10 198.679993 200.740005
2019-04-11 200.850006 201.000000
2019-04-12 199.199997 200.139999

选择多行

df[0:3]
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700 194.020004
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800 195.350006

2.3.2 使用标签选取数据

df.loc[行标签,列标签]

df.loc[“a”:“b”] #选取ab两行数据

df.loc[:,“Open”] #选取Open列的数据

df.loc的第一个参数是行标签,第二个参数为列标签(可选参数,默认为所有列标签),两个参数既可以是列表也可以是单个字符,如果两个参数都为列表则返回的是DataFrame,否则,则为Series

df.loc["2019-04-01","Open"]
191.63999938964844
df.loc["2019-04-01":"2019-04-03"]
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700 194.020004
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800 195.350006
df.loc[:,"Open"]
Date
2019-04-01    191.639999
2019-04-02    191.089996
2019-04-03    193.250000
2019-04-04    194.789993
2019-04-05    196.449997
2019-04-08    196.419998
2019-04-09    200.320007
2019-04-10    198.679993
2019-04-11    200.850006
2019-04-12    199.199997
Name: Open, dtype: float64

2.3.3 使用位置选取数据

df.iloc[行位置,列位置]

df.iloc[1,1] #选取第二行,第二列的值,返回的为单个值

df.iloc[[0,2],:] #选取第一行及第三行的数据

df.iloc[0:2,:] #选取第一行到第三行(不包含)的数据

df.iloc[:,1] #选取所有记录的第二列的值,返回的为一个Series

df.iloc[1,:] #选取第一行数据,返回的为一个Series

df
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700 194.020004
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800 195.350006
2019-04-04 196.369995 193.139999 194.789993 195.690002 19114300 195.690002
2019-04-05 197.100006 195.929993 196.449997 197.000000 18526600 197.000000
2019-04-08 200.229996 196.339996 196.419998 200.100006 25881700 200.100006
2019-04-09 202.850006 199.229996 200.320007 199.500000 35768200 199.500000
2019-04-10 200.740005 198.179993 198.679993 200.619995 21695300 200.619995
2019-04-11 201.000000 198.440002 200.850006 198.949997 20900800 198.949997
2019-04-12 200.139999 196.210007 199.199997 198.869995 27744300 198.869995
# 选取第二行,第二列的值,返回的为单个值
df.iloc[1,1]
191.0500030517578
# 选取第一行即第三行的数据
df.iloc[[0,2],:]
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000 191.240005
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800 195.350006
# 选取第一行到第三行(不包含)的数据
df.iloc[0:2,:]
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700 194.020004
# 选取所有记录的第一列的值,返回的为一个Series
df.iloc[:,1]
Date
2019-04-01    188.380005
2019-04-02    191.050003
2019-04-03    193.149994
2019-04-04    193.139999
2019-04-05    195.929993
2019-04-08    196.339996
2019-04-09    199.229996
2019-04-10    198.179993
2019-04-11    198.440002
2019-04-12    196.210007
Name: Low, dtype: float64
# 选取第一行数据。返回的为一个Series
df.iloc[1,:]
High         1.944600e+02
Low          1.910500e+02
Open         1.910900e+02
Close        1.940200e+02
Volume       2.276570e+07
Adj Close    1.940200e+02
Name: 2019-04-02, dtype: float64

2.3.4 通过逻辑指针进行数据切片

df[逻辑条件]

df[df.one >= 2] #单个逻辑条件

df[(df.one >=1 )&(df.one<3)] #多个逻辑条件组合

df
High Low Open Close Volume Adj Close
Date
2019-04-01 191.679993 188.380005 191.639999 191.240005 27862000 191.240005
2019-04-02 194.460007 191.050003 191.089996 194.020004 22765700 194.020004
2019-04-03 196.500000 193.149994 193.250000 195.350006 23271800 195.350006
2019-04-04 196.369995 193.139999 194.789993 195.690002 19114300 195.690002
2019-04-05 197.100006 195.929993 196.449997 197.000000 18526600 197.000000
2019-04-08 200.229996 196.339996 196.419998 200.100006 25881700 200.100006
2019-04-09 202.850006 199.229996 200.320007 199.500000 35768200 199.500000
2019-04-10 200.740005 198.179993 198.679993 200.619995 21695300 200.619995
2019-04-11 201.000000 198.440002 200.850006 198.949997 20900800 198.949997
2019-04-12 200.139999 196.210007 199.199997 198.869995 27744300 198.869995
df[df.Open > 195]
High Low Open Close Volume Adj Close
Date
2019-04-05 197.100006 195.929993 196.449997 197.000000 18526600 197.000000
2019-04-08 200.229996 196.339996 196.419998 200.100006 25881700 200.100006
2019-04-09 202.850006 199.229996 200.320007 199.500000 35768200 199.500000
2019-04-10 200.740005 198.179993 198.679993 200.619995 21695300 200.619995
2019-04-11 201.000000 198.440002 200.850006 198.949997 20900800 198.949997
2019-04-12 200.139999 196.210007 199.199997 198.869995 27744300 198.869995
df[(df.Open > 195)&(df.Close < 199)]
High Low Open Close Volume Adj Close
Date
2019-04-05 197.100006 195.929993 196.449997 197.000000 18526600 197.000000
2019-04-11 201.000000 198.440002 200.850006 198.949997 20900800 198.949997
2019-04-12 200.139999 196.210007 199.199997 198.869995 27744300 198.869995

你可能感兴趣的:(金融量化之路,python量化之路)