1.1.2 投资-编程基础-pandas

跳转到根目录:知行合一:投资篇

已完成:
1.1 编程基础
  1.1.1 投资-编程基础-numpy
  1.1.2 投资-编程基础-pandas
1.2 金融数据处理
1.3 金融数据可视化

文章目录

  • 1. 创建dataframe
    • 1.1. 创建dataframe,dict{list}
    • 1.2. 构造DataFrame,dict{dict} T
    • 1.3. 创建dataframe,list[dict]
    • 1.4. 创建dataframe,list[list]
    • 1.5. 创建空dataframe
    • 1.6. dataframe读写csv
      • 1.6.1. 读取网络csv
      • 1.6.2. 读取本地csv
      • 1.6.3. dataframe写入到csv
  • 2. 访问dataframe
    • 2.1. dataframe取数,根据列
    • 2.2. dataframe取数,根据行
    • 2.3. dataframe取数,行切片
    • 2.4. dataframe取数,行列切片
    • 2.5. 条件筛选
    • 2.6. DataFrame的遍历
  • 3. 变更dataframe
    • 3.1. 重命名列
    • 3.2. dataframe追加行
    • 3.3. 修改dataframe 行数据
    • 3.4. dataframe合并,pd.concat
    • 3.5. dataframe合并,pd.merge
  • 4. 常用函数
    • 4.1. 取index、values
    • 4.2. dataframe读写csv
    • 4.3. 处理csv空数据
      • 4.3.1. 读取csv,可以指定哪些是空数据
      • 4.3.2. 读取csv,指定列,如果有空数据,则删除整行
      • 4.3.3. 读取csv,填充NaN,空数据,替换
      • 4.3.4. 读取csv,去掉空数据 dropna()
    • 4.4. 聚类分析groupby
    • 4.5. 合并:concat、merge
    • 4.6. shift 错位移动
    • 4.7. pct_change 变化率
    • 4.8. rolling 卷积,计算移动平均线ma
  • 5. 案例
    • 5.1. dataframe合并
    • 5.2. 多股票收盘价合并
    • 5.3. 收益率
      • 5.3.1. 简单收益率
      • 5.3.2. 累积收益率cumprod
    • 5.4. 计算每年滚动收益率

1. 创建dataframe

1.1. 创建dataframe,dict{list}

意思就是,外层是dict,里面是list列表数据。dict的key是列名,list是value值。

# 简单的pandas
import pandas as pd
test_data = {
    "country":["China","China","China"],
    "sites":["baidu","sougou","hao123"],
    "rank":[1,3,2]
}
df1 = pd.DataFrame(test_data)
print(df1)

  country   sites  rank
0   China   baidu     1
1   China  sougou     3
2   China  hao123     2
import pandas as pd
#通过字典创建DataFrame (含Series)
d={'one':pd.Series([1.,2.,3.],index=['a','b','c']),
   'two':pd.Series([1.,2.,3.,4.,],index=['a','b','c','d']),
   'three':range(4),
   'four':1.,
   'five':'f'}
df=pd.DataFrame(d)
print (df)

#可以使用dataframe.index和dataframe.columns来查看DataFrame的行和列,
#dataframe.values则以数组的形式返回DataFrame的元素
print ("DataFrame index:\n",df.index)
print ("DataFrame columns:\n",df.columns)
print ("DataFrame values:\n",df.values)

   one  two  three  four five
a  1.0  1.0      0   1.0    f
b  2.0  2.0      1   1.0    f
c  3.0  3.0      2   1.0    f
d  NaN  4.0      3   1.0    f
DataFrame index:
 Index(['a', 'b', 'c', 'd'], dtype='object')
DataFrame columns:
 Index(['one', 'two', 'three', 'four', 'five'], dtype='object')
DataFrame values:
 [[1.0 1.0 0 1.0 'f']
 [2.0 2.0 1 1.0 'f']
 [3.0 3.0 2 1.0 'f']
 [nan 4.0 3 1.0 'f']]
#DataFrame也可以从值是数组的字典创建,但是各个数组的长度需要相同:
d = {'one': [1., 2., 3., 4.], 'two': [4., 3., 2., 1.]}
df = DataFrame(d, index=['a', 'b', 'c', 'd'])
print df
   one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0

1.2. 构造DataFrame,dict{dict} T

# 构造DataFrame:字典格式的json
json1 = {
    "2021-07-01":{"name":"kelvin","age":"31","region":"江苏"},
    "2021-07-02":{"name":"tom","age":"29","region":"上海","unkonwn":"111"},
    "2021-07-03":{"name":"kipper","age":"15","region":"杭州"}
}
df = pd.DataFrame(json1).T  # T操作是进行转置,就是行列标题翻转。本来是name、age作为行id的。
print(df)

              name age region unkonwn
2021-07-01  kelvin  31     江苏     NaN
2021-07-02     tom  29     上海     111
2021-07-03  kipper  15     杭州     NaN
json1 = {
    "2021-07-01":{"close":10},
    "2021-07-02":{"close":11},
    "2021-07-03":{"close":6}
}
df = pd.DataFrame(json1).T
print(df)
index_price=pd.DataFrame({'列名1':df.close}).dropna()
print(index_price)

            close
2021-07-01     10
2021-07-02     11
2021-07-03      6
            列名1
2021-07-01   10
2021-07-02   11
2021-07-03    6

1.3. 创建dataframe,list[dict]

#值非数组时,没有这一限制,并且缺失值补成NaN
d= [{'a': 1.6, 'b': 2}, {'a': 3, 'b': 6, 'c': 9}]
df = DataFrame(d)
print df
     a  b    c
0  1.6  2  NaN
1  3.0  6  9.0
# 构造DataFrame:从json构建DataFrame
# df = pd.read_json('sites.json'),可以从sites.json文件构建DataFrame
json1 = [
    {"name":"kelvin","age":"31","region":"江苏"},
    {"name":"tom","age":"29","region":"上海","unkonwn":"111"},
    {"name":"kipper","age":"15","region":"杭州"}
]
df = pd.DataFrame(json1)
print(df)
df.to_json()

     name age region unkonwn
0  kelvin  31     江苏     NaN
1     tom  29     上海     111
2  kipper  15     杭州     NaN
'{"name":{"0":"kelvin","1":"tom","2":"kipper"},"age":{"0":"31","1":"29","2":"15"},"region":{"0":"\\u6c5f\\u82cf","1":"\\u4e0a\\u6d77","2":"\\u676d\\u5dde"},"unkonwn":{"0":null,"1":"111","2":null}}'

1.4. 创建dataframe,list[list]

# DataFrame获取数据 df.loc[0]
my_data = [["kelvin","31"],["tom","29"],["kipper","13"]]
my_column = ["name", "age"]
df = pd.DataFrame(data = my_data, columns = my_column)
print(df)
print()
print(df.loc[0]) # 第0行
print()
print(df.loc[0]["name"])

     name age
0  kelvin  31
1     tom  29
2  kipper  13

name    kelvin
age         31
Name: 0, dtype: object

kelvin

1.5. 创建空dataframe

#在实际处理数据时,有时需要创建一个空的DataFrame,可以这么做
df = DataFrame()
print (df)
Empty DataFrame
Columns: []
Index: []

1.6. dataframe读写csv

1.6.1. 读取网络csv

注意:要引入ssl包,否则报错

提供几个已经抓取的csv文件,可以直接使用:

沪深300历史:SH510300.csv

沪深300历史收盘价:SH510300-close.csv

中证500历史收盘价:SH510500-close.csv

import pandas as pd
import ssl  # # URLError: 
ssl._create_default_https_context = ssl._create_unverified_context

sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/SH510300.csv")
sh300

	date	uuid	volume	open	high	low	close	chg	percent	turnoverrate	amount
0	2012/5/28	SH510300|2012-05-28	1277518769	2.1572	2.2046	2.1513	2.2020	0.0255	1.17	10.45	NaN
1	2012/5/29	SH510300|2012-05-29	714949008	2.2004	2.2503	2.2004	2.2359	0.0339	1.54	5.85	NaN
2	2012/5/30	SH510300|2012-05-30	265887198	2.2342	2.2384	2.2266	2.2291	-0.0068	-0.30	2.17	NaN
3	2012/5/31	SH510300|2012-05-31	178155984	2.2164	2.2367	2.2097	2.2240	-0.0051	-0.23	1.46	NaN
4	2012/6/1	SH510300|2012-06-01	179350035	2.2232	2.2494	2.2156	2.2240	0.0000	0.00	1.47	NaN
...	...	...	...	...	...	...	...	...	...	...	...
2792	2023/11/20	SH510300|2023-11-20	858430360	3.6370	3.6570	3.6100	3.6450	0.0110	0.30	0.00	3.119865e+09
2793	2023/11/21	SH510300|2023-11-21	931605485	3.6550	3.6860	3.6400	3.6500	0.0050	0.14	0.00	3.414863e+09
2794	2023/11/22	SH510300|2023-11-22	762202706	3.6410	3.6460	3.6100	3.6110	-0.0390	-1.07	0.00	2.765608e+09
2795	2023/11/23	SH510300|2023-11-23	774971808	3.6090	3.6320	3.5950	3.6300	0.0190	0.53	0.00	2.800813e+09
2796	2023/11/24	SH510300|2023-11-24	743453294	3.6260	3.6270	3.5980	3.6050	-0.0250	-0.69	0.00	2.684276e+09
2797 rows × 11 columns

对于上面的例子,可以更进一步,读取csv之后:

  1. 将关于日期的列,转换为pandas的日期格式
  2. 将其中的“date”这一列设置为索引列,方便取书(比如取出2023年的数据)
import pandas as pd
import ssl  # # URLError: 
ssl._create_default_https_context = ssl._create_unverified_context

sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/SH510300.csv", parse_dates=['date'], index_col='date')
# 取2023年数据
sh300.loc['2023']

							uuid							volume		open		high		low			close		chg		percent	turnoverrate	amount
date										
2023-01-03	SH510300|2023-01-03	656310782	3.8761	3.8998	3.8299	3.8880	0.0109	0.28	0.0	2.576088e+09
2023-01-04	SH510300|2023-01-04	980799721	3.8870	3.9037	3.8702	3.8929	0.0049	0.13	0.0	3.874438e+09
2023-01-05	SH510300|2023-01-05	774502293	3.9136	3.9726	3.9106	3.9667	0.0738	1.90	0.0	3.108571e+09
2023-01-06	SH510300|2023-01-06	541080825	3.9677	3.9992	3.9638	3.9825	0.0158	0.40	0.0	2.187353e+09
2023-01-09	SH510300|2023-01-09	780959941	4.0022	4.0228	3.9894	4.0071	0.0246	0.62	0.0	3.178181e+09
...	...	...	...	...	...	...	...	...	...	...
2023-11-20	SH510300|2023-11-20	858430360	3.6370	3.6570	3.6100	3.6450	0.0110	0.30	0.0	3.119865e+09
2023-11-21	SH510300|2023-11-21	931605485	3.6550	3.6860	3.6400	3.6500	0.0050	0.14	0.0	3.414863e+09
2023-11-22	SH510300|2023-11-22	762202706	3.6410	3.6460	3.6100	3.6110	-0.0390	-1.07	0.0	2.765608e+09
2023-11-23	SH510300|2023-11-23	774971808	3.6090	3.6320	3.5950	3.6300	0.0190	0.53	0.0	2.800813e+09
2023-11-24	SH510300|2023-11-24	743453294	3.6260	3.6270	3.5980	3.6050	-0.0250	-0.69	0.0	2.684276e+09
217 rows × 10 columns

1.6.2. 读取本地csv

# pandas读取csv, read_csv, head, tail
df = pd.read_csv("nba.csv")
df.head(10)
print(df)

              Name            Team  Number Position   Age Height  Weight  \
0    Avery Bradley  Boston Celtics     0.0       PG  25.0    6-2   180.0   
1      Jae Crowder  Boston Celtics    99.0       SF  25.0    6-6   235.0   
2     John Holland  Boston Celtics    30.0       SG  27.0    6-5   205.0   
3      R.J. Hunter  Boston Celtics    28.0       SG  22.0    6-5   185.0   
4    Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0   
..             ...             ...     ...      ...   ...    ...     ...   
453   Shelvin Mack       Utah Jazz     8.0       PG  26.0    6-3   203.0   
454      Raul Neto       Utah Jazz    25.0       PG  24.0    6-1   179.0   
455   Tibor Pleiss       Utah Jazz    21.0        C  26.0    7-3   256.0   
456    Jeff Withey       Utah Jazz    24.0        C  26.0    7-0   231.0   
457            NaN             NaN     NaN      NaN   NaN    NaN     NaN   

               College     Salary  
0                Texas  7730337.0  
1            Marquette  6796117.0  
2    Boston University        NaN  
3        Georgia State  1148640.0  
4                  NaN  5000000.0  
..                 ...        ...  
453             Butler  2433333.0  
454                NaN   900000.0  
455                NaN  2900000.0  
456             Kansas   947276.0  
457                NaN        NaN  

[458 rows x 9 columns]

1.6.3. dataframe写入到csv

# DataFrame写入到csv
name = ["kelvin", "tom", "kipper"]
age = [31, 29, 15]
region = ["江苏", "上海", "杭州"]
dict1 = {"name":name, "age":age, "region":region}
df = pd.DataFrame(dict1)
print(df)
df.to_csv("test_csv.csv")

     name  age region
0  kelvin   31     江苏
1     tom   29     上海
2  kipper   15     杭州

2. 访问dataframe

2.1. dataframe取数,根据列

DataFrame是以列作为操作的基础的,全部操作都想象成先从DataFrame里取一列

import pandas as pd
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
print('按列名取数:\n',df['calories'])
print(df['calories']['day1'])  # 访问特定列的某一行元素,即:calories列的day1行元素
# 取2列
print('取2列:\n',df[['calories', 'duration']])

      calories  duration
day1       420        50
day2       380        40
day3       390        45
按列名取数:
 day1    420
day2    380
day3    390
Name: calories, dtype: int64
4202:
       calories  duration
day1       420        50
day2       380        40
day3       390        45

2.2. dataframe取数,根据行

df.loc[[“day1”,“day2”] # 按行名取数

print (df.iloc[0]) #选取第一行元素,i应该是代表行的索引值,即使定义了行名,也可以通过索引取数。

print (df.loc[‘day2’])#选取day2对应行元素

# DataFrame返回多行数据
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
print('按行名取数:\n', df.loc[["day1","day2"]])
print ('iloc选取第一行元素:\n', df.iloc[0]['calories'])    #选取第一行元素

      calories  duration
day1       420        50
day2       380        40
day3       390        45
按行名取数:
       calories  duration
day1       420        50
day2       380        40
iloc选取第一行元素:
 420

2.3. dataframe取数,行切片

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
print('行切片:\n', df[0:2])  # 取第0、1行,不包括2行

      calories  duration
day1       420        50
day2       380        40
day3       390        45
行切片:
       calories  duration
day1       420        50
day2       380        40

2.4. dataframe取数,行列切片

#行列组合起来选取数据:
print (df[['b', 'd']].iloc[[1, 3]])  # b、d列,1、3行
print (df.iloc[[1, 3]][['b', 'd']])  # 1、3行,b、d列
print (df[['b', 'd']].loc[['beta', 'delta']])  # b、d列,beta、delta行
print (df.loc[['beta', 'delta']][['b', 'd']])  # beta、delta行,b、d列

df: 
          a    b     c     d     e
alpha  0.0  0.0   0.0   0.0   0.0
beta   1.0  2.0   3.0   4.0   5.0
gamma  2.0  4.0   6.0   8.0  10.0
delta  3.0  6.0   9.0  12.0  15.0
eta    4.0  8.0  12.0  16.0  20.0

         b     d
beta   2.0   4.0
delta  6.0  12.0
         b     d
beta   2.0   4.0
delta  6.0  12.0
         b     d
beta   2.0   4.0
delta  6.0  12.0
         b     d
beta   2.0   4.0
delta  6.0  12.0
#如果不是需要访问特定行列,而只是某个特殊位置的元素的话,
#dataframe.at和dataframe.iat
#是最快的方式,它们分别用于使用索引和下标进行访问
print(df)
print (df.iat[2, 3])  #相当于第3行第4列
print (df.at['gamma', 'd'])
         a    b     c     d     e
alpha  0.0  0.0   0.0   0.0   0.0
beta   1.0  2.0   3.0   4.0   5.0
gamma  2.0  4.0   6.0   8.0  10.0
delta  3.0  6.0   9.0  12.0  15.0
eta    4.0  8.0  12.0  16.0  20.0
8.0
8.0

2.5. 条件筛选

import pandas as pd

sh300 = pd.read_csv('SH510300-收盘价.csv')
print('sh300原数据\n', sh300.head())
sh300.columns = ['date', 'sh300']
print('改列名后的sh300\n', sh300.head())
sh500 = pd.read_csv('SH510500-收盘价.csv')
print('sh500原数据\n', sh500.head())
sh500.columns = ['date', 'sh500']
print('改列名后的sh500\n', sh500.head())
merged_df = sh300.merge(sh500, on = 'date', how="outer")
print('merge之后的:\n', merged_df)
print(merged_df[merged_df.date>'2012/5/30']) #条件筛选

sh300原数据
         date   close
0  2012/5/28  2.2020
1  2012/5/29  2.2359
2  2012/5/30  2.2291
3  2012/5/31  2.2240
4   2012/6/1  2.2240
改列名后的sh300
         date   sh300
0  2012/5/28  2.2020
1  2012/5/29  2.2359
2  2012/5/30  2.2291
3  2012/5/31  2.2240
4   2012/6/1  2.2240
sh500原数据
         date   close
0  2013/3/15  3.0215
1  2013/3/18  2.9717
2  2013/3/19  2.9904
3  2013/3/20  3.0683
4  2013/3/21  3.0994
改列名后的sh500
         date   sh500
0  2013/3/15  3.0215
1  2013/3/18  2.9717
2  2013/3/19  2.9904
3  2013/3/20  3.0683
4  2013/3/21  3.0994
merge之后的:
             date   sh300  sh500
0      2012/5/28  2.2020    NaN
1      2012/5/29  2.2359    NaN
2      2012/5/30  2.2291    NaN
3      2012/5/31  2.2240    NaN
4       2012/6/1  2.2240    NaN
...          ...     ...    ...
2792  2023/11/20  3.6450  5.758
2793  2023/11/21  3.6500  5.741
2794  2023/11/22  3.6110  5.668
2795  2023/11/23  3.6300  5.719
2796  2023/11/24  3.6050  5.673

[2797 rows x 3 columns]
            date   sh300  sh500
3      2012/5/31  2.2240    NaN
4       2012/6/1  2.2240    NaN
5       2012/6/4  2.1631    NaN
6       2012/6/5  2.1657    NaN
7       2012/6/6  2.1640    NaN
...          ...     ...    ...
2792  2023/11/20  3.6450  5.758
2793  2023/11/21  3.6500  5.741
2794  2023/11/22  3.6110  5.668
2795  2023/11/23  3.6300  5.719
2796  2023/11/24  3.6050  5.673

[2733 rows x 3 columns]

2.6. DataFrame的遍历

# DataFrame的遍历,遍历DataFrame
person = {
  "name": ['Google', 'Runoob' , 'Taobao'],
  "age": [50, 200, 12345]    
}
df = pd.DataFrame(person)
print(df)
print(df.index)
for x in df.index:
    print(x)
    if df.loc[x, "age"] > 120:
        df.loc[x, "age"] = 1
        # df.drop(x, inplace = True),这种操作是删除一行
print(df)

     name    age
0  Google     50
1  Runoob    200
2  Taobao  12345
RangeIndex(start=0, stop=3, step=1)
0
1
2
     name  age
0  Google   50
1  Runoob    1
2  Taobao    1

3. 变更dataframe

3.1. 重命名列

  • 方法1:df1.columns = [‘date’, ‘sh300’]。这就要求我们把所有的列名都标注出来。
import pandas as pd
json1 = [{"date":"2021-07-01", "close": 3.1},
    {"date":"2021-07-02", "close": 3.3},
    {"date":"2021-07-03", "close": 2.8}
]
df1 = pd.DataFrame(json1)
print(df1)
# 修改df1的列名
# columns[0] = 'sh300' # 不能直接改,TypeError: Index does not support mutable operations
df1.columns = ['date', 'sh300']
print('改列名后的df1:\n', df1)

            close
2021-07-01    3.1
2021-07-02    3.3
2021-07-03    2.8
改列名后的df1:
             sh300
2021-07-01    3.1
2021-07-02    3.3
2021-07-03    2.8
  • 方法2:sh300 = sh300.rename(columns={‘close’: ‘sh300_close’})。只改close这个列的名字。
import pandas as pd
import ssl  # # URLError: 
ssl._create_default_https_context = ssl._create_unverified_context

sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/SH510300-close.csv", parse_dates=['date'], index_col='date')
print(sh300)

sh300 = sh300.rename(columns={'close': 'sh300_close'})
print('列重命名后:\n', sh300)

             close
date              
2012-05-28  2.2020
2012-05-29  2.2359
2012-05-30  2.2291
2012-05-31  2.2240
2012-06-01  2.2240
...            ...
2023-11-20  3.6450
2023-11-21  3.6500
2023-11-22  3.6110
2023-11-23  3.6300
2023-11-24  3.6050

[2797 rows x 1 columns]
列重命名后:
              close
date              
2012-05-28  2.2020
2012-05-29  2.2359
2012-05-30  2.2291
2012-05-31  2.2240
2012-06-01  2.2240
...            ...
2023-11-20  3.6450
2023-11-21  3.6500
2023-11-22  3.6110
2023-11-23  3.6300
2023-11-24  3.6050

[2797 rows x 1 columns]

3.2. dataframe追加行

import pandas as pd
json1 = [{"date":"2021-07-01", "close": 3.1},
    {"date":"2021-07-02", "close": 3.3},
    {"date":"2021-07-03", "close": 2.8}
]
df1 = pd.DataFrame(json1)
print(df1)

# 构造增加一行的数据
new_line = {"date":"2021-07-04", "close": 7.4}
df1.append(new_line, ignore_index=True)  # Can only append a dict if ignore_index=True

         date  close
0  2021-07-01    3.1
1  2021-07-02    3.3
2  2021-07-03    2.8
			date	close
0	2021-07-01	3.1
1	2021-07-02	3.3
2	2021-07-03	2.8
3	2021-07-04	7.4

3.3. 修改dataframe 行数据

# 修改DataFrame中的错误数据
person = {
  "name": ['Google', 'Runoob' , 'Taobao'],
  "age": [50, 40, 12345]    # 12345 年龄数据是错误的
}
df = pd.DataFrame(person)
print(df)
print()
df.loc[2,"age"] = 30 # 修改数据
print(df)

     name    age
0  Google     50
1  Runoob     40
2  Taobao  12345

     name  age
0  Google   50
1  Runoob   40
2  Taobao   30

3.4. dataframe合并,pd.concat

import pandas as pd
json1 = {
    "2021-07-01":{"close": 3.1},
    "2021-07-02":{"close": 3.3},
    "2021-07-03":{"close": 2.8}
}
df1 = pd.DataFrame(json1).T
print(df1)
# 修改df1的列名
# columns[0] = 'sh300' # 不能直接改,TypeError: Index does not support mutable operations
df1.columns = ['sh300']
print('改列名后的df1:\n', df1)
json2 = {
    "2021-07-01":{"close": 11},
    "2021-07-02":{"close": 8},
    "2021-07-04":{"close": 20}
}
df2 = pd.DataFrame(json2).T
df2.columns = ['sh500']
print(df2)
# 合并2个df
df3 = pd.concat([df1, df2], axis=1)
print('合并后的df3:\n', df3)

            close
2021-07-01    3.1
2021-07-02    3.3
2021-07-03    2.8
改列名后的df1:
             sh300
2021-07-01    3.1
2021-07-02    3.3
2021-07-03    2.8
            sh500
2021-07-01     11
2021-07-02      8
2021-07-04     20
合并后的df3:
             sh300  sh500
2021-07-01    3.1   11.0
2021-07-02    3.3    8.0
2021-07-03    2.8    NaN
2021-07-04    NaN   20.0

3.5. dataframe合并,pd.merge

这个文章比较全面讲了merge的细节操作:https://zhuanlan.zhihu.com/p/634229183

下面使用一个股市收盘价常用的格式来看一下如何将2个股票收盘价合并:

import pandas as pd
json1 = [{"date":"2021-07-01", "close": 3.1},
    {"date":"2021-07-02", "close": 3.3},
    {"date":"2021-07-03", "close": 2.8}
]
df1 = pd.DataFrame(json1)
print(df1)
# 修改df1的列名
# columns[0] = 'sh300' # 不能直接改,TypeError: Index does not support mutable operations
df1.columns = ['date', 'sh300']
print('改列名后的df1:\n', df1)
json2 = [{"date":"2021-07-01", "close": 11},
    {"date":"2021-07-02", "close": 8},
    {"date":"2021-07-04", "close": 20}
]
df2 = pd.DataFrame(json2)
df2.columns = ['date', 'sh500']
print('改列名后的df2:\n', df2)
merged_df = df1.merge(df2, on = 'date', how="outer")
print('merge之后的:\n', merged_df)

         date  close
0  2021-07-01    3.1
1  2021-07-02    3.3
2  2021-07-03    2.8
改列名后的df1:
          date  sh300
0  2021-07-01    3.1
1  2021-07-02    3.3
2  2021-07-03    2.8
改列名后的df2:
          date  sh500
0  2021-07-01     11
1  2021-07-02      8
2  2021-07-04     20
merge之后的:
          date  sh300  sh500
0  2021-07-01    3.1   11.0
1  2021-07-02    3.3    8.0
2  2021-07-03    2.8    NaN
3  2021-07-04    NaN   20.0

4. 常用函数

4.1. 取index、values

import pandas as pd
import ssl  # # URLError: 
ssl._create_default_https_context = ssl._create_unverified_context

sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/SH510300-close.csv", parse_dates=['date'], index_col='date')
print(sh300)

sh300.index[0]         # Timestamp('2012-05-28 00:00:00')
sh300.index            # DatetimeIndex(['2012-05-28', '2012-05-29', '2012-05-30', '2012-05-31',...dtype='datetime64[ns]', name='date', length=2797, freq=None)
sh300.index.to_list()  # [Timestamp('2012-05-28 00:00:00'),Timestamp('2012-05-29 00:00:00'),....]
sh300['close']['2023-11-24']  # 3.605
sh300['close'].to_list()      # [2.202, 2.2359,...3.605]
sh300['close'].values         # array([2.202 , 2.2359, 2.2291, ..., 3.611 , 3.63  , 3.605 ])
sh300['close'].values[-1]     # 3.605


             close
date              
2012-05-28  2.2020
2012-05-29  2.2359
2012-05-30  2.2291
2012-05-31  2.2240
2012-06-01  2.2240
...            ...
2023-11-20  3.6450
2023-11-21  3.6500
2023-11-22  3.6110
2023-11-23  3.6300
2023-11-24  3.6050

[2797 rows x 1 columns]
3.605

4.2. dataframe读写csv

参考上面的“创建dataframe”

4.3. 处理csv空数据

4.3.1. 读取csv,可以指定哪些是空数据

# 读取csv的时候,可以指定哪些是空数据;
# 原先,na和--不会被认为是空,指定后,读取出来就是NaN了;同时,NaN、空、NA、n/a仍旧被认为是空
missing_value = ["na","--"]
df = pd.read_csv("property-data.csv", na_values = missing_value)
print(df)

           PID  ST_NUM     ST_NAME OWN_OCCUPIED  NUM_BEDROOMS NUM_BATH   SQ_FT
0  100001000.0   104.0      PUTNAM            Y           3.0        1  1000.0
1  100002000.0   197.0   LEXINGTON            N           3.0      1.5     NaN
2  100003000.0     NaN   LEXINGTON            N           NaN        1   850.0
3  100004000.0   201.0    BERKELEY           12           1.0      NaN   700.0
4          NaN   203.0    BERKELEY            Y           3.0        2  1600.0
5  100006000.0   207.0    BERKELEY            Y           NaN        1   800.0
6  100007000.0     NaN  WASHINGTON          NaN           2.0   HURLEY   950.0
7  100008000.0   213.0     TREMONT            Y           1.0        1     NaN
8  100009000.0   215.0     TREMONT            Y           NaN        2  1800.0

4.3.2. 读取csv,指定列,如果有空数据,则删除整行

# 指定列,如果有空数据,则删除整行
# DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
df = pd.read_csv("property-data.csv")
print(df)
print()
df.dropna(subset=["ST_NUM"])
print(df)

           PID  ST_NUM     ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0  100001000.0   104.0      PUTNAM            Y            3        1  1000
1  100002000.0   197.0   LEXINGTON            N            3      1.5    --
2  100003000.0     NaN   LEXINGTON            N          NaN        1   850
3  100004000.0   201.0    BERKELEY           12            1      NaN   700
4          NaN   203.0    BERKELEY            Y            3        2  1600
5  100006000.0   207.0    BERKELEY            Y          NaN        1   800
6  100007000.0     NaN  WASHINGTON          NaN            2   HURLEY   950
7  100008000.0   213.0     TREMONT            Y            1        1   NaN
8  100009000.0   215.0     TREMONT            Y           na        2  1800

           PID  ST_NUM     ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0  100001000.0   104.0      PUTNAM            Y            3        1  1000
1  100002000.0   197.0   LEXINGTON            N            3      1.5    --
2  100003000.0     NaN   LEXINGTON            N          NaN        1   850
3  100004000.0   201.0    BERKELEY           12            1      NaN   700
4          NaN   203.0    BERKELEY            Y            3        2  1600
5  100006000.0   207.0    BERKELEY            Y          NaN        1   800
6  100007000.0     NaN  WASHINGTON          NaN            2   HURLEY   950
7  100008000.0   213.0     TREMONT            Y            1        1   NaN
8  100009000.0   215.0     TREMONT            Y           na        2  1800

4.3.3. 读取csv,填充NaN,空数据,替换

# 填充NaN,空数据,替换
df = pd.read_csv("property-data.csv")
print(df)
df["ST_NUM"].fillna(10000, inplace=True)
print()
print(df)

           PID  ST_NUM     ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0  100001000.0   104.0      PUTNAM            Y            3        1  1000
1  100002000.0   197.0   LEXINGTON            N            3      1.5    --
2  100003000.0     NaN   LEXINGTON            N          NaN        1   850
3  100004000.0   201.0    BERKELEY           12            1      NaN   700
4          NaN   203.0    BERKELEY            Y            3        2  1600
5  100006000.0   207.0    BERKELEY            Y          NaN        1   800
6  100007000.0     NaN  WASHINGTON          NaN            2   HURLEY   950
7  100008000.0   213.0     TREMONT            Y            1        1   NaN
8  100009000.0   215.0     TREMONT            Y           na        2  1800

           PID   ST_NUM     ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0  100001000.0    104.0      PUTNAM            Y            3        1  1000
1  100002000.0    197.0   LEXINGTON            N            3      1.5    --
2  100003000.0  10000.0   LEXINGTON            N          NaN        1   850
3  100004000.0    201.0    BERKELEY           12            1      NaN   700
4          NaN    203.0    BERKELEY            Y            3        2  1600
5  100006000.0    207.0    BERKELEY            Y          NaN        1   800
6  100007000.0  10000.0  WASHINGTON          NaN            2   HURLEY   950
7  100008000.0    213.0     TREMONT            Y            1        1   NaN
8  100009000.0    215.0     TREMONT            Y           na        2  1800

4.3.4. 读取csv,去掉空数据 dropna()

一些常用的操作

df3.iat[3,3]=np.NaN #令第3行第3列的数为缺失值(0.129151)
df3.iat[1,2]=np.NaN #令第1行第2列的数为缺失值(1.127064)

#丢弃存在缺失值的行
#设定how=all只会删除那些全是NaN的行:
df3.dropna(how='any')

#删除列也一样,设置axis=1
df3.dropna(how='any',axis=1)

#thresh参数,如thresh=4,一行中至少有4个非NaN值,否则删除
df3.iloc[2,2]=np.NaN
df3.dropna(thresh=4)

#使在改变DataFrame 和 Series 的操作时,会返回一个新的对象,
#原对象不变,如果要改变原对象,可以添加参数 inplace = True用列均值填充
#使用该列的均值填充
df3['C'].fillna(df3['C'].mean(),inplace=True)
# DataFrame去掉空数据
# 可以看出,这些不是null:na、--
# 这些是null:NaN、空、NA、n/a
# 如果是空、NA,那么读取到DataFrame会被转成NaN
df = pd.read_csv("property-data.csv")
print(df)
print(df.isnull()) # 可以看出,这些不是null:na、--;这些是null:NaN、空、NA、n/a
print()
print(df.dropna()) # 小写的na,不会被删掉;--也不会被删掉

           PID  ST_NUM     ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0  100001000.0   104.0      PUTNAM            Y            3        1  1000
1  100002000.0   197.0   LEXINGTON            N            3      1.5    --
2  100003000.0     NaN   LEXINGTON            N          NaN        1   850
3  100004000.0   201.0    BERKELEY           12            1      NaN   700
4          NaN   203.0    BERKELEY            Y            3        2  1600
5  100006000.0   207.0    BERKELEY            Y          NaN        1   800
6  100007000.0     NaN  WASHINGTON          NaN            2   HURLEY   950
7  100008000.0   213.0     TREMONT            Y            1        1   NaN
8  100009000.0   215.0     TREMONT            Y           na        2  1800
     PID  ST_NUM  ST_NAME  OWN_OCCUPIED  NUM_BEDROOMS  NUM_BATH  SQ_FT
0  False   False    False         False         False     False  False
1  False   False    False         False         False     False  False
2  False    True    False         False          True     False  False
3  False   False    False         False         False      True  False
4   True   False    False         False         False     False  False
5  False   False    False         False          True     False  False
6  False    True    False          True         False     False  False
7  False   False    False         False         False     False   True
8  False   False    False         False         False     False  False

           PID  ST_NUM    ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0  100001000.0   104.0     PUTNAM            Y            3        1  1000
1  100002000.0   197.0  LEXINGTON            N            3      1.5    --
8  100009000.0   215.0    TREMONT            Y           na        2  1800

4.4. 聚类分析groupby

import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : ['true', 'false', 'true', 'false',
                           'true', 'false', 'true', 'false'],
                    'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                    'C' : np.random.randn(8),
                    'D' : np.random.randn(8)})
print(df)

print(df.groupby(['A']).sum())  #以A列特征分类并加总
print(df.groupby(['A','B']).sum())  # A、B列特征分类并加总

       A      B         C         D
0   true    one  0.131962  0.000795
1  false    one -0.282576  0.440043
2   true    two -1.467742 -1.328217
3  false  three  1.228367  0.637844
4   true    two  0.119230  0.894900
5  false    two -0.067859  0.507391
6   true    one  0.870252  1.892529
7  false  three  0.671450  0.736440
              C         D
A                        
false  1.549382  2.321718
true  -0.346298  1.460008
                    C         D
A     B                        
false one   -0.282576  0.440043
      three  1.899816  1.374284
      two   -0.067859  0.507391
true  one    1.002214  1.893324
      two   -1.348512 -0.433317

4.5. 合并:concat、merge

参考上面的“变更dataframe”

4.6. shift 错位移动

import pandas as pd
import ssl  # # URLError: 
ssl._create_default_https_context = ssl._create_unverified_context

sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/SH510300-close.csv", parse_dates=['date'], index_col='date')
print(sh300)

print('往下挪动1个:', sh300.shift(1))  # shift(1),往下挪动1个。自然,-1就是往上挪动1个。
# 常用于计算变化率。第2天的涨跌,公式:(第2天 - 第1天)/第1天
ret_daily = (sh300 - sh300.shift(1))/sh300
print('使用shift计算每天的收益率:', ret_daily)

# 通过下面的 pct_change 函数结果观察能看到,shift和pct_change有精度误差,手动算一下,pct_change的确是更准确。
print('pct_change:结果', sh300.pct_change())


             close
date              
2012-05-28  2.2020
2012-05-29  2.2359
2012-05-30  2.2291
2012-05-31  2.2240
2012-06-01  2.2240
...            ...
2023-11-20  3.6450
2023-11-21  3.6500
2023-11-22  3.6110
2023-11-23  3.6300
2023-11-24  3.6050

[2797 rows x 1 columns]
往下挪动1:              close
date              
2012-05-28     NaN
2012-05-29  2.2020
2012-05-30  2.2359
2012-05-31  2.2291
2012-06-01  2.2240
...            ...
2023-11-20  3.6340
2023-11-21  3.6450
2023-11-22  3.6500
2023-11-23  3.6110
2023-11-24  3.6300

[2797 rows x 1 columns]
使用shift计算每天的收益率:                close
date                
2012-05-28       NaN
2012-05-29  0.015162
2012-05-30 -0.003051
2012-05-31 -0.002293
2012-06-01  0.000000
...              ...
2023-11-20  0.003018
2023-11-21  0.001370
2023-11-22 -0.010800
2023-11-23  0.005234
2023-11-24 -0.006935

[2797 rows x 1 columns]
pct_change:结果                close
date                
2012-05-28       NaN
2012-05-29  0.015395
2012-05-30 -0.003041
2012-05-31 -0.002288
2012-06-01  0.000000
...              ...
2023-11-20  0.003027
2023-11-21  0.001372
2023-11-22 -0.010685
2023-11-23  0.005262
2023-11-24 -0.006887

[2797 rows x 1 columns]

4.7. pct_change 变化率

参考上一个案例,“shift 错位移动”

4.8. rolling 卷积,计算移动平均线ma

import pandas as pd
import ssl  # # URLError: 
ssl._create_default_https_context = ssl._create_unverified_context

sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/SH510300-close.csv", parse_dates=['date'], index_col='date')
print(sh300)

#移动平均线:
ma_day = [5,20,52,252]
for ma in ma_day:
    column_name = "%s日均线" %(str(ma))
    sh300[column_name] = sh300["close"].rolling(ma).mean()
print(sh300.head(10))


             close
date              
2012-05-28  2.2020
2012-05-29  2.2359
2012-05-30  2.2291
2012-05-31  2.2240
2012-06-01  2.2240
...            ...
2023-11-20  3.6450
2023-11-21  3.6500
2023-11-22  3.6110
2023-11-23  3.6300
2023-11-24  3.6050

[2797 rows x 1 columns]
             close     5日均线  20日均线  52日均线  252日均线
date                                             
2012-05-28  2.2020      NaN    NaN    NaN     NaN
2012-05-29  2.2359      NaN    NaN    NaN     NaN
2012-05-30  2.2291      NaN    NaN    NaN     NaN
2012-05-31  2.2240      NaN    NaN    NaN     NaN
2012-06-01  2.2240  2.22300    NaN    NaN     NaN
2012-06-04  2.1631  2.21522    NaN    NaN     NaN
2012-06-05  2.1657  2.20118    NaN    NaN     NaN
2012-06-06  2.1640  2.18816    NaN    NaN     NaN
2012-06-07  2.1505  2.17346    NaN    NaN     NaN
2012-06-08  2.1429  2.15724    NaN    NaN     NaN

5. 案例

5.1. dataframe合并

参考 ”变更dataframe“的concat、merge

5.2. 多股票收盘价合并

import pandas as pd
import ssl  # # URLError: 
ssl._create_default_https_context = ssl._create_unverified_context

sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/510300.csv", parse_dates=['date'], index_col='date')
sh500 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/510500.csv", parse_dates=['date'], index_col='date')
yiyao512010 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/512010.csv", parse_dates=['date'], index_col='date')

# 只取close列
sh300 = sh300[['close']]  # 这里要有双方括号,否则取出来的是Series,没有列名字。
sh500 = sh500[['close']]
yiyao512010 = yiyao512010[['close']]

# 重命名列
sh300 = sh300.rename(columns={'close': '510300'})
sh500 = sh500.rename(columns={'close': '510500'})
yiyao512010 = yiyao512010.rename(columns={'close': '512010'})

# 拼接数据
merged_df = pd.merge(sh300,sh500, on = 'date', how="outer")
merged_df = pd.merge(merged_df, yiyao512010, on = 'date', how="outer")
print(merged_df)

            510300  510500  512010
date                              
2012-05-28   2.004     NaN     NaN
2012-05-29   2.044     NaN     NaN
2012-05-30   2.036     NaN     NaN
2012-05-31   2.030     NaN     NaN
2012-06-01   2.030     NaN     NaN
...            ...     ...     ...
2023-12-20   3.369   5.412   0.401
2023-12-21   3.400   5.426   0.405
2023-12-22   3.406   5.410   0.403
2023-12-25   3.415   5.403   0.404
2023-12-26   3.392   5.349   0.400

[2819 rows x 3 columns]

5.3. 收益率

5.3.1. 简单收益率

简单收益率 = (本期价值 - 上期价值)/ 上期价值 * 100%

import pandas as pd
import ssl  # # URLError: 
ssl._create_default_https_context = ssl._create_unverified_context

sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/SH510300-close.csv", parse_dates=['date'], index_col='date')
sh300 = sh300.loc['2023']   # 只使用2023的数据计算
print(sh300)

# 简单收益率计算 (本期价值 - 上期价值)/ 上期价值 * 100%
last = sh300['close'][-1]
first = sh300['close'][0]
simple_return = ((last - first)/first).round(4) * 100
simple_return

             close
date              
2023-01-03  3.8880
2023-01-04  3.8929
2023-01-05  3.9667
2023-01-06  3.9825
2023-01-09  4.0071
...            ...
2023-11-20  3.6450
2023-11-21  3.6500
2023-11-22  3.6110
2023-11-23  3.6300
2023-11-24  3.6050

[217 rows x 1 columns]
-7.28

5.3.2. 累积收益率cumprod

sh300[‘pct_change’] = sh300[‘close’].pct_change()
sh300[‘cum_profit’] = pd.DataFrame(1+sh300[‘pct_change’]).cumprod()-1

import pandas as pd
import ssl  # # URLError: 
ssl._create_default_https_context = ssl._create_unverified_context

sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/510300.csv", parse_dates=['date'], index_col='date')
sh300 = sh300[['close']]   # 只需要close列的数据
print(sh300)

# 计算沪深300从 2012-05-28 到 2023-12-26 的滚动累积收益率
sh300['pct_change'] = sh300['close'].pct_change()
sh300['cum_profit'] = pd.DataFrame(1+sh300['pct_change']).cumprod()-1
print('2012年滚动:', sh300) # 结果是 0.692615,就是69%的收益,11年

# 计算沪深300从 2013-01-03 到 2023-12-26 的滚动累积收益率
sh300 = sh300.loc['2023']
sh300['pct_change'] = sh300['close'].pct_change()
sh300['cum_profit'] = pd.DataFrame(1+sh300['pct_change']).cumprod()-1
print('2023年滚动:', sh300) # 结果是 -0.126898。2023年,还是比较惨的。

            close
date             
2012-05-28  2.004
2012-05-29  2.044
2012-05-30  2.036
2012-05-31  2.030
2012-06-01  2.030
...           ...
2023-12-20  3.369
2023-12-21  3.400
2023-12-22  3.406
2023-12-25  3.415
2023-12-26  3.392

[2819 rows x 1 columns]
2012年滚动:             close  pct_change  cum_profit
date                                     
2012-05-28  2.004         NaN         NaN
2012-05-29  2.044    0.019960    0.019960
2012-05-30  2.036   -0.003914    0.015968
2012-05-31  2.030   -0.002947    0.012974
2012-06-01  2.030    0.000000    0.012974
...           ...         ...         ...
2023-12-20  3.369   -0.009118    0.681138
2023-12-21  3.400    0.009202    0.696607
2023-12-22  3.406    0.001765    0.699601
2023-12-25  3.415    0.002642    0.704092
2023-12-26  3.392   -0.006735    0.692615

[2819 rows x 3 columns]
2023年滚动:             close  pct_change  cum_profit
date                                     
2023-01-03  3.885         NaN         NaN
2023-01-04  3.890    0.001287    0.001287
2023-01-05  3.965    0.019280    0.020592
2023-01-06  3.981    0.004035    0.024710
2023-01-09  4.006    0.006280    0.031145
...           ...         ...         ...
2023-12-20  3.369   -0.009118   -0.132819
2023-12-21  3.400    0.009202   -0.124839
2023-12-22  3.406    0.001765   -0.123295
2023-12-25  3.415    0.002642   -0.120978
2023-12-26  3.392   -0.006735   -0.126898

[239 rows x 3 columns]

5.4. 计算每年滚动收益率

按年分组滚动cumprod

import pandas as pd
import ssl  # # URLError: 
ssl._create_default_https_context = ssl._create_unverified_context

sh300 = pd.read_csv("https://gitee.com/kelvin11/public-resources/raw/master/SH510300-close.csv", parse_dates=['date'], index_col='date')
print(sh300)

# 退化日期到年
y_sh300 = sh300.pct_change().to_period('A').dropna()
print('退化日期到年:', y_sh300)
# 按年分组,滚动计算收益率
y_ret = (y_sh300.groupby(y_sh300.index).apply(lambda x: ((1+x).cumprod()-1).iloc[-1])).round(4)
print('年分组滚动收益率:', y_ret)

             close
date              
2012-05-28  2.2020
2012-05-29  2.2359
2012-05-30  2.2291
2012-05-31  2.2240
2012-06-01  2.2240
...            ...
2023-11-20  3.6450
2023-11-21  3.6500
2023-11-22  3.6110
2023-11-23  3.6300
2023-11-24  3.6050

[2797 rows x 1 columns]
退化日期到年:          close
date          
2012  0.015395
2012 -0.003041
2012 -0.002288
2012  0.000000
2012 -0.027383
...        ...
2023  0.003027
2023  0.001372
2023 -0.010685
2023  0.005262
2023 -0.006887

[2796 rows x 1 columns]
年分组滚动收益率:        close
date        
2012 -0.0172
2013 -0.0586
2014  0.5376
2015  0.0684
2016 -0.0971
2017  0.2337
2018 -0.2415
2019  0.3860
2020  0.2908
2021 -0.0402
2022 -0.2014
2023 -0.0702

你可能感兴趣的:(投资,投资,pandas)