Python 初识Pandas- Python Data Analysis Library
学习资料来自于
1. Coursera 《用Python 玩转数据》 https://www.coursera.org/learn/hipython
2. 网站:http://pandas.pydata.org/
Pandas Series
>>> from pandas import Series >>> sa = Series(['a', 'b', 'c'], index = [0, 1, 2]) >>> sb = Series(['a', 'b', 'c']) >>> sc = Series(['a', 'c', 'b']) >>> sa.equals(sc) False >>> sb.equals(sa) True >>> sa*3 + sc*2 0 aaaaa 1 bbbcc 2 cccbb >>> from pandas import Series, DataFrame >>> data = {'language': ['Java', 'PHP', 'Python', 'R', 'C#'], 'year': [ 1995 , 1995 , 1991 ,1993, 2000]} >>> frame = DataFrame(data) >>> frame['IDE'] = Series(['Intellij', 'Notepad', 'IPython', 'R studio', 'VS']) >>> 'VS' in frame['IDE'] False >>> frame.ix[2] language Python year 1991 IDE IPython Name: 2, dtype: object |
frame.ix[2]表示取frame中的第三行数据。
frame.ix[i]表示取frame中的第i+1行数据。
Pandas DataFrame
1. 从雅虎财经获取上市公司股票历史数据,获取从两年前的今天到今天的微软公司的股票数据。
微软公司的公司代号可从http://finance.yahoo.com/q/cp?s=^DJI获得。
from matplotlib.finance import quotes_historical_yahoo from datetime import date #import pandas as pd today = date.today() start = (today.year-2, today.month, today.day) quotes = quotes_historical_yahoo('MSFT', start, today) |
2. 为quotes数据添加属性名
attributes = ['date','open','close','high','low','volume'] quotesdf = pd.DataFrame(quotes, columns = attributes) |
quotesdf 数据示例
>>> quotesdf date open close high low volume 0 735141 31.243013 31.508104 31.536509 30.958986 39839500 1 735142 31.574376 31.792134 31.820536 31.527039 36718700 2 735143 31.583847 32.114029 32.218173 31.517574 46946800 3 735144 32.076161 32.057226 32.189771 31.640650 38703800 4 735145 31.896275 32.076161 32.180305 31.830002 33008100 5 735148 31.811066 31.527040 31.915211 31.432366 35069300 |
3. 将索引列更换为日期,并删除掉原先的date列,日期格式是2015年1月30日星期五,显示为‘15/01/30,Fri’ 注意空格和符号。
date.fromordinal , date.strftime的用法如下
>>> from datetime import date >>> d=date.fromordinal(735866) >>> d datetime.date(2015, 9, 25) >>> y=date.strftime(d,"%y/%m/%d,%a") >>> y '15/09/25,Fri' |
dataFrame的drop方法可以将指定行或者指定列删除掉。
接上面的代码
list1 = [] for i in range(0, len(quotes)): x = date.fromordinal(int(quotes[i][0])) y = date.strftime(x, "%y/%m/%d,%a") list1.append(y) quotesdf.index = list1 quotesdf = quotesdf.drop(['date'], 1) |
quotesdf 数据示例
>>> quotesdf open close high low volume 13/09/30,Mon 31.243013 31.508104 31.536509 30.958986 39839500 13/10/01,Tue 31.574376 31.792134 31.820536 31.527039 36718700 13/10/02,Wed 31.583847 32.114029 32.218173 31.517574 46946800 13/10/03,Thu 32.076161 32.057226 32.189771 31.640650 38703800 13/10/04,Fri 31.896275 32.076161 32.180305 31.830002 33008100 |
4. 要获取2014年1月30日到2月10日这期间微软更换CEO阶段股票的开盘价和收盘价,下面的命令可以运行并得到我们想要的结果:
>>> quotesdf.ix['14/01/30':'14/02/10',['open','close']] open close 14/01/30,Thu 35.095384 35.162160 14/01/31,Fri 35.248015 36.097019 14/02/03,Mon 36.001626 34.799662 14/02/04,Tue 35.267094 34.675649 14/02/05,Wed 34.618415 34.170063 14/02/06,Thu 34.150985 34.513482 14/02/07,Fri 34.647033 34.875979 |
5. 查询2014年6月1日至12月31日微软股票收盘价大于45美元的记录。
>>> quotesdf['14/06/01':'14/12/01'][quotesdf.close>45] open close high low volume 14/09/08,Mon 44.819649 45.257913 45.579304 44.790434 45736700 14/09/09,Tue 45.257913 45.540346 45.744871 45.209214 40302400 14/09/10,Wed 45.598784 45.618262 45.715653 45.072868 27302400 14/09/11,Thu 45.520872 45.774088 45.774088 45.257913 29216400 14/09/12,Fri 45.686436 45.481914 45.793566 45.384519 38244700 |
6. 查询在2014年整年内(即1月1日至12月31日)微软股票收盘价最高的5天数据
>>> quotesdf['14/01/01':'14/12/31'].sort('close', ascending=False)[:5] open close high low volume 14/11/13,Thu 47.536879 48.316012 48.354970 47.439485 26208800 14/11/14,Fri 48.442622 48.286795 48.744533 48.101748 29081700 14/11/17,Mon 48.121228 48.169923 48.413402 47.858270 30315500 14/12/04,Thu 47.425075 47.866103 48.081717 47.238866 30320400 14/11/18,Tue 48.150321 47.768099 48.346334 47.728896 23995500 |
ascending=False or 0 表示降序
ascending=True or 1 表示升序
默认为升序
7. 统计微软股票在2014年中(即1月1日至12月31日)各个月价格上涨的天数
list1 = [] tmpdf = quotesdf['14/01/01':'14/12/31'] for i in range(0, len(tmpdf)): list1.append(int(tmpdf.index[i][3:5])) tmpdf['month'] = list1 print tmpdf[ tmpdf.close > tmpdf.open]['month'].value_counts() |
结果为
9 14 10 12 2 12 11 11 8 11 6 11 3 11 12 10 7 10 4 10 5 9 1 9 |
value_counts() 返回一个series, 上例中类似group by 'month' 的用法。
index的用法
>>> quotesdf.index[0] '13/09/30,Mon' >>> quotesdf.index[0][3:5] '09' |
8. 统计2014年整年微软股票每个月的成交量
>>> tmpdf.groupby('month')['volume'].sum() month 1 930226200 2 705304500 3 778425700 4 746113500 5 574362900 6 555779700 7 731616500 8 513919700 9 860827300 10 853235700 11 522988700 12 605188200 |
9. 列出2014年微软股票收盘价最高的5天和最低的5天。
>>> s = quotesdf.sort('close') >>> pd.concat([s[:5],s[-5:]]) open close high low volume 13/10/08,Tue 31.536509 31.252479 31.555445 31.053661 41017600 13/10/09,Wed 31.309286 31.309286 31.574376 31.205142 35878600 13/09/30,Mon 31.243013 31.508104 31.536509 30.958986 39839500 13/10/07,Mon 31.811066 31.527040 31.915211 31.432366 35069300 13/10/01,Tue 31.574376 31.792134 31.820536 31.527039 36718700 14/11/17,Mon 48.121228 48.169923 48.413402 47.858270 30315500 14/11/14,Fri 48.442622 48.286795 48.744533 48.101748 29081700 14/11/13,Thu 47.536879 48.316012 48.354970 47.439485 26208800 15/04/29,Wed 48.088304 48.423896 48.670655 47.871156 47804600 15/04/28,Tue 47.160490 48.522598 48.571949 47.081529 60730800 |