In [8]: !cat examples/ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
笔记:这里,我用的是 Unix 的 cat shell 命令将文件的原始内容打印到屏幕上。如果你用的是 Windows,你可以使用 type 达到同样的效果。
由于该文件以逗号分隔,所以我们可以使用 read_csv 将其读入一个 DataFrame:
In [9]: df = pd.read_csv('examples/ex1.csv')
In [10]: df
Out[10]:
a b c d message
01234 hello
15678 world
29101112 foo
我们还可以使用 read_table,并指定分隔符:
In [11]: pd.read_table('examples/ex1.csv', sep=',')
Out[11]:
a b c d message
01234 hello
15678 world
29101112 foo
并不是所有文件都有标题行。看看下面这个文件:
In [12]: !cat examples/ex2.csv
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
读入该文件的办法有两个。你可以让 pandas 为其分配默认的列名,也可以自己定义列名:
In [13]: pd.read_csv('examples/ex2.csv', header=None)
Out[13]:
0123401234 hello
15678 world
29101112 foo
In [14]: pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
Out[14]:
a b c d message
01234 hello
15678 world
29101112 foo
In [15]: names = ['a', 'b', 'c', 'd', 'message']
In [16]: pd.read_csv('examples/ex2.csv', names=names, index_col='message')
Out[16]:
a b c d
message
hello 1234
world 5678
foo 9101112
如果希望将多个列做成一个层次化索引,只需传入由列编号或列名组成的列表即可:
In [17]: !cat examples/csv_mindex.csv
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16
In [18]: parsed = pd.read_csv('examples/csv_mindex.csv',
....: index_col=['key1', 'key2'])
In [19]: parsed
Out[19]:
value1 value2
key1 key2
one a 12
b 34
c 56
d 78
two a 910
b 1112
c 1314
d 1516
In [21]: result = pd.read_table('examples/ex3.txt', sep='\s+')
In [22]: result
Out[22]:
A B C
aaa -0.264438-1.026059-0.619500
bbb 0.9272720.302904-0.032399
ccc -0.264273-0.386314-0.217601
ddd -0.871858-0.3483821.100491
In [23]: !cat examples/ex4.csv
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you# who reads CSV files with computers, anyway?1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
In [24]: pd.read_csv('examples/ex4.csv', skiprows=[0, 2, 3])
Out[24]:
a b c d message
01234 hello
15678 world
29101112 foo
缺失值处理是文件解析任务中的一个重要组成部分。缺失数据经常是要么没有(空字符串),要么用某个标记值表示。默认情况下,pandas 会用一组经常出现的标记值进行识别,比如 NA 及 NULL:
In [25]: !cat examples/ex5.csv
something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo
In [26]: result = pd.read_csv('examples/ex5.csv')
In [27]: result
Out[27]:
something a b c d message
0 one 123.04 NaN
1 two 56 NaN 8 world
2 three 91011.012 foo
In [28]: pd.isnull(result)
Out[28]:
something a b c d message
0FalseFalseFalseFalseFalseTrue1FalseFalseFalseTrueFalseFalse2FalseFalseFalseFalseFalseFalse
na_values 可以用一个列表或集合的字符串表示缺失值:
In [29]: result = pd.read_csv('examples/ex5.csv', na_values=['NULL'])
In [30]: result
Out[30]:
something a b c d message
0 one 123.04 NaN
1 two 56 NaN 8 world
2 three 91011.012 foo
字典的各列可以使用不同的 NA 标记值:
In [31]: sentinels = {'message': ['foo', 'NA'], 'something': ['two']}
In [32]: pd.read_csv('examples/ex5.csv', na_values=sentinels)
Out[32]:
something a b c d message
0 one 123.04 NaN
1 NaN 56 NaN 8 world
2 three 91011.012 NaN
In [34]: result = pd.read_csv('examples/ex6.csv')
In [35]: result
Out[35]:
one two three four key
00.467976-0.038649-0.295344-1.824726 L
1-0.3588931.4044530.704965-0.200638 B
2-0.5018400.659254-0.421691-0.057688 G
30.2048861.0741341.388361-0.982404 R
40.354628-0.1331160.283763-0.837063 Q
... ... ... ... ... ..
99952.311896-0.417070-1.409599-0.515821 L
9996-0.479893-0.6504190.745152-0.646038 E
99970.5233310.7871120.4860661.093156 K
9998-0.3625590.598894-1.8432010.887292 G
9999-0.096376-1.012999-0.657431-0.5733150
[10000 rows x 5 columns]
If you want to only read a small
如果只想读取几行(避免读取整个文件),通过 nrows 进行指定即可:
In [36]: pd.read_csv('examples/ex6.csv', nrows=5)
Out[36]:
one two three four key
00.467976-0.038649-0.295344-1.824726 L
1-0.3588931.4044530.704965-0.200638 B
2-0.5018400.659254-0.421691-0.057688 G
30.2048861.0741341.388361-0.982404 R
40.354628-0.1331160.283763-0.837063 Q
要逐块读取文件,可以指定 chunksize(行数):
In [874]: chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000)
In [875]: chunker
Out[875]: 0x8398150>
chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)
tot = pd.Series([])
for piece in chunker:
tot = tot.add(piece['key'].value_counts(), fill_value=0)
tot = tot.sort_values(ascending=False)
然后有:
In [40]: tot[:10]
Out[40]:
E 368.0
X 364.0
L 346.0
O 343.0
Q 340.0
M 338.0
J 337.0
F 335.0
K 334.0
H 330.0
dtype: float64
TextParser 还有一个 get_chunk 方法,它使你可以读取任意大小的块。
将数据写出到文本格式
数据也可以被输出为分隔符格式的文本。我们再来看看之前读过的一个 CSV 文件:
In [41]: data = pd.read_csv('examples/ex5.csv')
In [42]: data
Out[42]:
something a b c d message
0 one 123.04 NaN
1 two 56 NaN 8 world
2 three 91011.012 foo
利用 DataFrame 的 to_csv 方法,我们可以将数据写到一个以逗号分隔的文件中:
In [43]: data.to_csv('examples/out.csv')
In [44]: !cat examples/out.csv
,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo
In [45]: import sys
In [46]: data.to_csv(sys.stdout, sep='|')
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo
缺失值在输出结果中会被表示为空字符串。你可能希望将其表示为别的标记值:
In [47]: data.to_csv(sys.stdout, na_rep='NULL')
,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo
如果没有设置其他选项,则会写出行和列的标签。当然,它们也都可以被禁用:
In [48]: data.to_csv(sys.stdout, index=False, header=False)
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo
此外,你还可以只写出一部分的列,并以你指定的顺序排列:
In [49]: data.to_csv(sys.stdout, index=False, columns=['a', 'b', 'c'])
a,b,c
1,2,3.05,6,
9,10,11.0
Series 也有一个 to_csv 方法:
In [50]: dates = pd.date_range('1/1/2000', periods=7)
In [51]: ts = pd.Series(np.arange(7), index=dates)
In [52]: ts.to_csv('examples/tseries.csv')
In [53]: !cat examples/tseries.csv
2000-01-01,02000-01-02,12000-01-03,22000-01-04,32000-01-05,42000-01-06,52000-01-07,6
In [71]: print(data.to_json())
{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}
In [72]: print(data.to_json(orient='records'))
[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]
XML 和 HTML:Web 信息收集
Python 有许多可以读写常见的 HTML 和 XML 格式数据的库,包括 lxml、Beautiful Soup 和 html5lib。lxml 的速度比较快,但其它的库处理有误的 HTML 或 XML 文件更好。
pandas 有一个内置的功能,read_html,它可以使用 lxml 和 Beautiful Soup 自动将 HTML 文件中的表格解析为 DataFrame 对象。为了进行展示,我从美国联邦存款保险公司下载了一个 HTML 文件(pandas 文档中也使用过),它记录了银行倒闭的情况。首先,你需要安装 read_html 用到的库:
In [73]: tables = pd.read_html('examples/fdic_failed_bank_list.html')
In [74]: len(tables)
Out[74]: 1
In [75]: failures = tables[0]
In [76]: failures.head()
Out[76]:
Bank Name City ST CERT \
0 Allied Bank Mulberry AR 911 The Woodbury Banking Company Woodbury GA 112972 First CornerStone Bank King of Prussia PA 353123 Trust Company Bank Memphis TN 99564 North Milwaukee State Bank Milwaukee WI 20364
Acquiring Institution Closing Date Updated Date
0 Today's Bank September 23, 2016 November 17, 2016
1 United Bank August 19, 2016 November 17, 2016
2 First-Citizens Bank & Trust Company May 6, 2016 September 6, 2016
3 The Bank of Fayette County April 29, 2016 September 6, 2016
4 First-Citizens Bank & Trust Company March 11, 2016 June 16, 2016
因为 failures 有许多列,pandas 插入了一个换行符 \。
这里,我们可以做一些数据清洗和分析(后面章节会进一步讲解),比如计算按年份计算倒闭的银行数:
In [77]: close_timestamps = pd.to_datetime(failures['Closing Date'])
In [78]: close_timestamps.dt.year.value_counts()
Out[78]:
20101572009140201192201251200825
...
2004420014200732003320002
Name: Closing Date, Length: 15, dtype: int64
利用 lxml.objectify 解析 XML
XML(Extensible Markup Language)是另一种常见的支持分层、嵌套数据以及元数据的结构化数据格式。本书所使用的这些文件实际上来自于一个很大的 XML 文档。
前面,我介绍了 pandas.read_html 函数,它可以使用 lxml 或 Beautiful Soup 从 HTML 解析数据。XML 和 HTML 的结构很相似,但 XML 更为通用。这里,我会用一个例子演示如何利用 lxml 从 XML 格式解析数据。
纽约大都会运输署发布了一些有关其公交和列车服务的数据资料(http://www.mta.info/developers/download.html)。这里,我们将看看包含在一组 XML 文件中的运行情况数据。每项列车或公交服务都有各自的文件(如 Metro-North Railroad 的文件是 Performance_MNR.xml),其中每条 XML 记录就是一条月度数据,如下所示:
<INDICATOR><INDICATOR_SEQ>373889INDICATOR_SEQ><PARENT_SEQ>PARENT_SEQ><AGENCY_NAME>Metro-North RailroadAGENCY_NAME><INDICATOR_NAME>Escalator AvailabilityINDICATOR_NAME><DESCRIPTION>Percent of the time that escalators are operational
systemwide. The availability rate is based on physical observations performed
the morning of regular business days only. This is a new indicator the agency
began reporting in 2009.DESCRIPTION><PERIOD_YEAR>2011PERIOD_YEAR><PERIOD_MONTH>12PERIOD_MONTH><CATEGORY>Service IndicatorsCATEGORY><FREQUENCY>MFREQUENCY><DESIRED_CHANGE>UDESIRED_CHANGE><INDICATOR_UNIT>%INDICATOR_UNIT><DECIMAL_PLACES>1DECIMAL_PLACES><YTD_TARGET>97.00YTD_TARGET><YTD_ACTUAL>YTD_ACTUAL><MONTHLY_TARGET>97.00MONTHLY_TARGET><MONTHLY_ACTUAL>MONTHLY_ACTUAL>INDICATOR>
我们先用 lxml.objectify 解析该文件,然后通过 getroot 得到该 XML 文件的根节点的引用:
root.INDICATOR 返回一个用于产生各个 XML 元素的生成器。对于每条记录,我们可以用标记名(如 YTD_ACTUAL)和数据值填充一个字典(排除几个标记):
data = []
skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ',
'DESIRED_CHANGE', 'DECIMAL_PLACES']
for elt in root.INDICATOR:
el_data = {}
for child in elt.getchildren():
if child.tag in skip_fields:
continue
el_data[child.tag] = child.pyval
data.append(el_data)
最后,将这组字典转换为一个 DataFrame:
In [81]: perf = pd.DataFrame(data)
In [82]: perf.head()
Out[82]:
Empty DataFrame
Columns: []
Index: []
XML 数据可以比本例复杂得多。每个标记都可以有元数据。看看下面这个 HTML 的链接标签(它也算是一段有效的 XML):
from io import StringIO
tag = 'Google'
root = objectify.parse(StringIO(tag)).getroot()
现在就可以访问标签或链接文本中的任何字段了(如 href):
In [84]: root
Out[84]: 0x7f6b15817748>
In [85]: root.get('href')
Out[85]: 'http://www.google.com'
In [86]: root.text
Out[86]: 'Google'
In [87]: frame = pd.read_csv('examples/ex1.csv')
In [88]: frame
Out[88]:
a b c d message
01234 hello
15678 world
29101112 foo
In [89]: frame.to_pickle('examples/frame_pickle')
In [92]: frame = pd.DataFrame({'a': np.random.randn(100)})
In [93]: store = pd.HDFStore('mydata.h5')
In [94]: store['obj1'] = frame
In [95]: store['obj1_col'] = frame['a']
In [96]: store
Out[96]:
<class 'pandas.io.pytables.HDFStore'>
Filepath: mydata.h5
/obj1 frame (shape->[100,1])
/obj1_col series (shape->[100])
/obj2 frame_table (typ->appendable,nrows->100,ncols->1,indexers->
[index])
/obj3 frame_table (typ->appendable,nrows->100,ncols->1,indexers->
[index])
HDF5 文件中的对象可以通过与字典一样的 API 进行获取:
In [97]: store['obj1']
Out[97]:
a
0-0.20470810.4789432-0.5194393-0.55573041.965781
.. ...
950.795253960.11811097-0.748532980.584970990.152677
[100 rows x 1 columns]
In [98]: store.put('obj2', frame, format='table')
In [99]: store.select('obj2', where=['index >= 10 and index <= 15'])
Out[99]:
a
101.00718911-1.296221120.274992130.228913141.352917150.886429
In [100]: store.close()
put 是 store['obj2'] = frame 方法的显示版本,允许我们设置其它的选项,比如格式。
pandas.read_hdf 函数可以快捷使用这些工具:
In [101]: frame.to_hdf('mydata.h5', 'obj3', format='table')
In [102]: pd.read_hdf('mydata.h5', 'obj3', where=['index < 5'])
Out[102]:
a
0-0.20470810.4789432-0.5194393-0.55573041.965781
In [113]: import requests
In [114]: url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
In [115]: resp = requests.get(url)
In [116]: resp
Out[116]: 200]>
In [117]: data = resp.json()
In [118]: data[0]['title']
Out[118]: 'Period does not round down for frequencies less that 1 hour'
data 中的每个元素都是一个包含所有 GitHub 主题页数据(不包含评论)的字典。我们可以直接传递数据到 DataFrame,并提取感兴趣的字段:
In [119]: issues = pd.DataFrame(data, columns=['number', 'title',
.....: 'labels', 'state'])
In [120]: issues
Out[120]:
number title \
017666 Period does not round down for frequencies les...
117665 DOC: improve docstring of function where
217664 COMPAT: skip 32-bit test on int repr
317662 implement Delegator class
4 17654 BUG: Fix series rename called with str alterin...
.. ... ...
2517603 BUG: Correctly localize naive datetime strings...
2617599 core.dtypes.generic --> cython
2717596 Merge cdate_range functionality into bdate_range
2817587 Time Grouper bug fix when applied for list gro...
2917583 BUG: fix tz-aware DatetimeIndex + TimedeltaInd...
labels state
0 [] open
1 [{'id': 134699, 'url': 'https://api.github.com... open
2 [{'id': 563047854, 'url': 'https://api.github.... open
3 [] open
4 [{'id': 76811, 'url': 'https://api.github.com/... open
.. ... ...
25 [{'id': 76811, 'url': 'https://api.github.com/... open
26 [{'id': 49094459, 'url': 'https://api.github.c... open
27 [{'id': 35818298, 'url': 'https://api.github.c... open
28 [{'id': 233160, 'url': 'https://api.github.com... open
29 [{'id': 76811, 'url': 'https://api.github.com/... open
[30 rows x 4 columns]
花费一些精力,你就可以创建一些更高级的常见的 Web API 的接口,返回 DataFrame 对象,方便进行分析。
6.4 数据库交互
在商业场景下,大多数数据可能不是存储在文本或 Excel 文件中。基于 SQL 的关系型数据库(如 SQL Server、PostgreSQL 和 MySQL 等)使用非常广泛,其它一些数据库也很流行。数据库的选择通常取决于性能、数据完整性以及应用程序的伸缩性需求。
In [121]: import sqlite3
In [122]: query = """
.....: CREATE TABLE test
.....: (a VARCHAR(20), b VARCHAR(20),
.....: c REAL, d INTEGER
.....: );"""
In [123]: con = sqlite3.connect('mydata.sqlite')
In [124]: con.execute(query)
Out[124]: 0x7f6b12a50f10>
In [125]: con.commit()
然后插入几行数据:
In [126]: data = [('Atlanta', 'Georgia', 1.25, 6),
.....: ('Tallahassee', 'Florida', 2.6, 3),
.....: ('Sacramento', 'California', 1.7, 5)]
In [127]: stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
In [128]: con.executemany(stmt, data)
Out[128]: 0x7f6b15c66ce0>
In [133]: cursor.description
Out[133]:
(('a', None, None, None, None, None, None),
('b', None, None, None, None, None, None),
('c', None, None, None, None, None, None),
('d', None, None, None, None, None, None))
In [134]: pd.DataFrame(rows, columns=[x[0] for x in cursor.description])
Out[134]:
a b c d
0 Atlanta Georgia 1.2561 Tallahassee Florida 2.6032 Sacramento California 1.705
In [135]: import sqlalchemy as sqla
In [136]: db = sqla.create_engine('sqlite:///mydata.sqlite')
In [137]: pd.read_sql('select * from test', db)
Out[137]:
a b c d
0 Atlanta Georgia 1.2561 Tallahassee Florida 2.6032 Sacramento California 1.705
转:http://stackoverflow.com/questions/18145774/eclipse-an-error-occurred-while-filtering-resources
maven报错:
maven An error occurred while filtering resources
Maven -> Update Proje
在SVN服务控制台打开资源库“SVN无法读取current” ---摘自网络 写道 SVN无法读取current修复方法 Can't read file : End of file found
文件:repository/db/txn_current、repository/db/current
其中current记录当前最新版本号,txn_current记录版本库中版本