输入输出通常可以划分为几大类:读取文本文件和其他更高效的磁盘存储格式,加载数据库中的数据,利用Web API操作网络资源
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
!type C:\\pytest\\ch06\\ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
由于该文件是逗号隔开,所以用read_csv将其读入一个DataFrame:
df=pd.read_csv('C:\\pytest\\ch06\\ex1.csv')
df
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
也可以使用read_table,只不过需要指定分隔符而已:
pd.read_table('C:\\pytest\\ch06\\ex1.csv',sep=',')
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
没有标题行文件:
!type C:\\pytest\\ch06\\ex2.csv
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
pd.read_csv('C:\\pytest\\ch06\\ex2.csv',header=None)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
pd.read_csv('C:\\pytest\\ch06\\ex2.csv',names=['a','b','c','d','message'])
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
names=['a','b','c','d','message']
pd.read_csv('C:\\pytest\\ch06\\ex2.csv',names=names,index_col='message')
a | b | c | d | |
---|---|---|---|---|
message | ||||
hello | 1 | 2 | 3 | 4 |
world | 5 | 6 | 7 | 8 |
foo | 9 | 10 | 11 | 12 |
将多个列做成一个层次化索引,只需传入由列编号或列名组成的列表即可:
!type C:\\pytest\\ch06\\csv_mindex.csv
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16
parsed=pd.read_csv('C:\\pytest\\ch06\\csv_mindex.csv',index_col=['key1','key2'])
parsed
value1 | value2 | ||
---|---|---|---|
key1 | key2 | ||
one | a | 1 | 2 |
b | 3 | 4 | |
c | 5 | 6 | |
d | 7 | 8 | |
two | a | 9 | 10 |
b | 11 | 12 | |
c | 13 | 14 | |
d | 15 | 16 |
有些表格可能不是用固定的分隔符去分隔字段的,对于这种情况,可以编写一个正则表达式来作为read_table的分隔符。
list(open('C:\\pytest\\ch06\\ex3.txt'))
[' A B C\n',
'aaa -0.264438 -1.026059 -0.619500\n',
'bbb 0.927272 0.302904 -0.032399\n',
'ccc -0.264273 -0.386314 -0.217601\n',
'ddd -0.871858 -0.348382 1.100491\n']
该文件各个字段由数量不定的空白符分隔,这个情况可以用正则表达式\s+表示:
result=pd.read_table('C:\\pytest\\ch06\\ex3.txt',sep='\s+')
result
A | B | C | |
---|---|---|---|
aaa | -0.264438 | -1.026059 | -0.619500 |
bbb | 0.927272 | 0.302904 | -0.032399 |
ccc | -0.264273 | -0.386314 | -0.217601 |
ddd | -0.871858 | -0.348382 | 1.100491 |
可以用skiprows跳过文件的第一行、第三行和第四行:
!type C:\\pytest\\ch06\\ex4.csv
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
pd.read_csv('C:\\pytest\\ch06\\ex4.csv',skiprows=[0,2,3])
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
pandas会用一组经常出现的标记值进行识别,如NA,-1.#IND以及NULL等:
!type C:\\pytest\\ch06\\ex5.csv
something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo
result=pd.read_csv('C:\\pytest\\ch06\\ex5.csv')
result
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | foo |
pd.isnull(result)
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | False | False | False | False | False | True |
1 | False | False | False | True | False | False |
2 | False | False | False | False | False | False |
na_values可以接受一组用于表示缺失值的字符串:
result=pd.read_csv('C:\\pytest\\ch06\\ex5.csv',na_values=['NULL'])
result
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | foo |
可以用一个字典为各列指定不同的NA标记值:
sentinels={'message':['foo','NA'],'something':['two']}
pd.read_csv('C:\\pytest\\ch06\\ex5.csv',na_values=sentinels)
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | NaN | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | NaN |
读取文件的一部分或逐块对文件进行迭代
result=pd.read_csv('C:\\pytest\\ch06\\ex6.csv')
result
one | two | three | four | key | |
---|---|---|---|---|---|
0 | 0.467976 | -0.038649 | -0.295344 | -1.824726 | L |
1 | -0.358893 | 1.404453 | 0.704965 | -0.200638 | B |
2 | -0.501840 | 0.659254 | -0.421691 | -0.057688 | G |
3 | 0.204886 | 1.074134 | 1.388361 | -0.982404 | R |
4 | 0.354628 | -0.133116 | 0.283763 | -0.837063 | Q |
5 | 1.817480 | 0.742273 | 0.419395 | -2.251035 | Q |
6 | -0.776764 | 0.935518 | -0.332872 | -1.875641 | U |
7 | -0.913135 | 1.530624 | -0.572657 | 0.477252 | K |
8 | 0.358480 | -0.497572 | -0.367016 | 0.507702 | S |
9 | -1.740877 | -1.160417 | -1.637830 | 2.172201 | G |
10 | 0.240564 | -0.328249 | 1.252155 | 1.072796 | 8 |
11 | 0.764018 | 1.165476 | -0.639544 | 1.495258 | R |
12 | 0.571035 | -0.310537 | 0.582437 | -0.298765 | 1 |
13 | 2.317658 | 0.430710 | -1.334216 | 0.199679 | P |
14 | 1.547771 | -1.119753 | -2.277634 | 0.329586 | J |
15 | -1.310608 | 0.401719 | -1.000987 | 1.156708 | E |
16 | -0.088496 | 0.634712 | 0.153324 | 0.415335 | B |
17 | -0.018663 | -0.247487 | -1.446522 | 0.750938 | A |
18 | -0.070127 | -1.579097 | 0.120892 | 0.671432 | F |
19 | -0.194678 | -0.492039 | 2.359605 | 0.319810 | H |
20 | -0.248618 | 0.868707 | -0.492226 | -0.717959 | W |
21 | -1.091549 | -0.867110 | -0.647760 | -0.832562 | C |
22 | 0.641404 | -0.138822 | -0.621963 | -0.284839 | C |
23 | 1.216408 | 0.992687 | 0.165162 | -0.069619 | V |
24 | -0.564474 | 0.792832 | 0.747053 | 0.571675 | I |
25 | 1.759879 | -0.515666 | -0.230481 | 1.362317 | S |
26 | 0.126266 | 0.309281 | 0.382820 | -0.239199 | L |
27 | 1.334360 | -0.100152 | -0.840731 | -0.643967 | 6 |
28 | -0.737620 | 0.278087 | -0.053235 | -0.950972 | J |
29 | -1.148486 | -0.986292 | -0.144963 | 0.124362 | Y |
… | … | … | … | … | … |
9970 | 0.633495 | -0.186524 | 0.927627 | 0.143164 | 4 |
9971 | 0.308636 | -0.112857 | 0.762842 | -1.072977 | 1 |
9972 | -1.627051 | -0.978151 | 0.154745 | -1.229037 | Z |
9973 | 0.314847 | 0.097989 | 0.199608 | 0.955193 | P |
9974 | 1.666907 | 0.992005 | 0.496128 | -0.686391 | S |
9975 | 0.010603 | 0.708540 | -1.258711 | 0.226541 | K |
9976 | 0.118693 | -0.714455 | -0.501342 | -0.254764 | K |
9977 | 0.302616 | -2.011527 | -0.628085 | 0.768827 | H |
9978 | -0.098572 | 1.769086 | -0.215027 | -0.053076 | A |
9979 | -0.019058 | 1.964994 | 0.738538 | -0.883776 | F |
9980 | -0.595349 | 0.001781 | -1.423355 | -1.458477 | M |
9981 | 1.392170 | -1.396560 | -1.425306 | -0.847535 | H |
9982 | -0.896029 | -0.152287 | 1.924483 | 0.365184 | 6 |
9983 | -2.274642 | -0.901874 | 1.500352 | 0.996541 | N |
9984 | -0.301898 | 1.019906 | 1.102160 | 2.624526 | I |
9985 | -2.548389 | -0.585374 | 1.496201 | -0.718815 | D |
9986 | -0.064588 | 0.759292 | -1.568415 | -0.420933 | E |
9987 | -0.143365 | -1.111760 | -1.815581 | 0.435274 | 2 |
9988 | -0.070412 | -1.055921 | 0.338017 | -0.440763 | X |
9989 | 0.649148 | 0.994273 | -1.384227 | 0.485120 | Q |
9990 | -0.370769 | 0.404356 | -1.051628 | -1.050899 | 8 |
9991 | -0.409980 | 0.155627 | -0.818990 | 1.277350 | W |
9992 | 0.301214 | -1.111203 | 0.668258 | 0.671922 | A |
9993 | 1.821117 | 0.416445 | 0.173874 | 0.505118 | X |
9994 | 0.068804 | 1.322759 | 0.802346 | 0.223618 | H |
9995 | 2.311896 | -0.417070 | -1.409599 | -0.515821 | L |
9996 | -0.479893 | -0.650419 | 0.745152 | -0.646038 | E |
9997 | 0.523331 | 0.787112 | 0.486066 | 1.093156 | K |
9998 | -0.362559 | 0.598894 | -1.843201 | 0.887292 | G |
9999 | -0.096376 | -1.012999 | -0.657431 | -0.573315 | 0 |
10000 rows × 5 columns
通过nrows读取几行:
pd.read_csv('C:\\pytest\\ch06\\ex6.csv',nrows=5)
one | two | three | four | key | |
---|---|---|---|---|---|
0 | 0.467976 | -0.038649 | -0.295344 | -1.824726 | L |
1 | -0.358893 | 1.404453 | 0.704965 | -0.200638 | B |
2 | -0.501840 | 0.659254 | -0.421691 | -0.057688 | G |
3 | 0.204886 | 1.074134 | 1.388361 | -0.982404 | R |
4 | 0.354628 | -0.133116 | 0.283763 | -0.837063 | Q |
要逐块读取文件,需要设置chunksize(行数):
chunker=pd.read_csv('C:\\pytest\\ch06\\ex6.csv',chunksize=1000)
chunker
tot=Series([])
for piece in chunker:
tot=tot.add(piece['key'].value_counts(),fill_value=0)
tot=tot.order(ascending=False)
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:4: FutureWarning: order is deprecated, use sort_values(...)
tot[:10]
E 368.0
X 364.0
L 346.0
O 343.0
Q 340.0
M 338.0
J 337.0
F 335.0
K 334.0
H 330.0
dtype: float64
数据也可以被输出为分隔符格式的文本:
data=pd.read_csv('C:\\pytest\\ch06\\ex5.csv')
data
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | foo |
利用DataFrame的to_csv方法,将数据写到一个以逗号分隔的文件中:
data.to_csv('C:\\pytest\\ch06\\out.csv')
!type C:\\pytest\\ch06\\out.csv
,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo
import sys
data.to_csv(sys.stdout,sep='|')
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo
缺失值由空白字符串表示为别的标记值:
data.to_csv(sys.stdout,na_rep='NULL')
,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo
data.to_csv(sys.stdout,index=False,header=False)
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo
只写出一部分列,并以你指定的顺序排列:
data.to_csv(sys.stdout,index=False,cols=['a','b','c'])
something,a,b,c,d,message
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo
Series也有一个to_csv方法:
dates=pd.date_range('1/1/2000',periods=7)
ts=Series(np.arange(7),index=dates)
ts.to_csv('C:\\pytest\\ch06\\tseries.csv')
!type C:\\pytest\\ch06\\tseries.csv
2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6
Series.from_csv('C:\\pytest\\ch06\\tseries.csv',parse_dates=True)
2000-01-01 0
2000-01-02 1
2000-01-03 2
2000-01-04 3
2000-01-05 4
2000-01-06 5
2000-01-07 6
dtype: int64
表格型数据都能用pandas.read_table进行加载:
!type C:\\pytest\\ch06\\ex7.csv
"a","b","c"
"1","2","3"
"1","2","3","4"
对于任何单字符分隔符文件,可以直接使用Python内置的csv模块:
import csv
f=open('C:\\pytest\\ch06\\ex7.csv')
reader=csv.reader(f)
reader
for line in reader:
print(line)
['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3', '4']
lines=list(csv.reader(open('C:\\pytest\\ch06\\ex7.csv')))
header,values=lines[0],lines[1:]
data_dict={h: v for h,v in zip(header,zip(*values))}
data_dict
{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}
CSV文件的形式有很多,只需定义csv.Dialect的一个子类即可定义出新格式(如专门的分隔符、字符串引用约定、行结束符等):
class my_dialect(csv.Dialect):
lineterminator='\n'
delimiter=';'
quotechar='"'
reader=csv.reader(f,delimiter='|')
JSON是一种比表格型文本格式(如CSV)灵活得多的数据格式:
obj="""{"name":"Wes",
"place_lived":["United States","Spain","Germany"],
"pet":null,
"siblings":[{"name":"Scott","age":25,"pet":"Zuko"},{"name":"Katie","age":33,"pet":"Cisco"}]}"""
通过json.loads即可将JSON字符串转换为Python形式:
import json
result=json.loads(obj)
result
{'name': 'Wes',
'pet': None,
'place_lived': ['United States', 'Spain', 'Germany'],
'siblings': [{'age': 25, 'name': 'Scott', 'pet': 'Zuko'},
{'age': 33, 'name': 'Katie', 'pet': 'Cisco'}]}
json.dumps则将Python对象转换成JSON格式:
asjson=json.dumps(result)
将(一个或一组)JSON对象转换为DataFrame或其他便于分析的数据结构,向DataFrame构造器传入一组JSON对象,并选取数据字段的子集:
siblings=DataFrame(result['siblings'],columns=['name','age'])
siblings
name | age | |
---|---|---|
0 | Scott | 25 |
1 | Katie | 33 |
Python有许多可以读写HTML和XML格式数据的库,lxml就是其中之一,它能够高效且可靠地解析大文件。找到希望获取的URL,利用urllibs将其打开,然后用lxml解析得到的数据流:
from lxml.html import parse
from urllib.request import urlopen
parsed=parse(urlopen('https://finance.yahoo.com/q/op?s=AAPL+Options'))
doc=parsed.getroot()
通过这个对象,你可以获取特定类型的所有HTML标签(tag)。使用文档根节点的findall方法以及一个XPath:
links=doc.findall('.//a')
links[15:20]
[,
,
,
,
]
要得到URL和链接文本,你必须使用各对象的get方法(针对URL)和text_content方法(针对显示文本):
lnk=links[28]
lnk
lnk.get('href')
'https://itunes.apple.com/us/podcast/yahoo-finance-presents/id1241417570?mt=2'
lnk.text_content()
"LISTENMajor League Soccer: The bear and bull caseYahoo Finance's Jen Rogers speaks to Dan Roberts and Kevin Chupka about the business of MLS"
编写下面这条列表推导式(list comprehension)即可获取文档中的全部URL:
urls=[lnk.get('href') for lnk in doc.findall('.//a')]
urls[-10:]
['https://smallbusiness.yahoo.com',
'https://help.yahoo.com/kb/index?page=content&y=PROD_FIN_DESK&locale=en_US&id=SLN2310',
'https://help.yahoo.com/kb/index?page=content&y=PROD_FIN_DESK&locale=en_US',
'https://yahoo.uservoice.com/forums/382977',
'http://info.yahoo.com/privacy/us/yahoo/',
'http://info.yahoo.com/relevantads/',
'http://info.yahoo.com/legal/us/yahoo/utos/utos-173.html',
'https://twitter.com/YahooFinance',
'https://facebook.com/yahoofinance',
'http://yahoofinance.tumblr.com']
实现数据的二进制格式存储最简单的办法之一是使用Python内置的pickle序列化。为了使用方便,pandas对象都有一个用于将数据以pickle形式保存到磁盘上的save方法:
frame=pd.read_csv('C:\\pytest\\ch06\\ex1.csv')
frame
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
pandas有一个最小化的类似于字典的HDFStore类,它通过PyTable存储pandas对象:
store=pd.HDFStore('mydata.h5')
store['obj1']=frame
store['obj1_col']=frame['a']
store
store['obj1']
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
需要安装xlrd和openpyxl包
import requests
url = 'https://api.github.com/repos/pydata/pandas/milestones/28/labels'
resp = requests.get(url)
resp
data[:5]
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | foo |
issue_labels = DataFrame(data)
issue_labels
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | foo |
使用一款嵌入式的SQLite数据库:
import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
);"""
con = sqlite3.connect(':memory:')
con.execute(query)
con.commit()
data = [('Atlanta', 'Georgia', 1.25, 6),
('Tallahassee', 'Florida', 2.6, 3),
('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
con.commit()
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows
[('Atlanta', 'Georgia', 1.25, 6),
('Tallahassee', 'Florida', 2.6, 3),
('Sacramento', 'California', 1.7, 5)]
cursor.description
(('a', None, None, None, None, None, None),
('b', None, None, None, None, None, None),
('c', None, None, None, None, None, None),
('d', None, None, None, None, None, None))
DataFrame(rows, columns=zip(*cursor.description)[0])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in ()
----> 1 DataFrame(rows, columns=zip(*cursor.description)[0])
TypeError: 'zip' object is not subscriptable
import pandas.io.sql as sql
sql.read_sql('select * from test', con)
a | b | c | d | |
---|---|---|---|---|
0 | Atlanta | Georgia | 1.25 | 6 |
1 | Tallahassee | Florida | 2.60 | 3 |
2 | Sacramento | California | 1.70 | 5 |