《利用Python进行数据分析》学习笔记ch06(7)

第6章 数据加载、存储与文件格式

输入输出通常可以划分为几大类:读取文本文件和其他更高效的磁盘存储格式,加载数据库中的数据,利用Web API操作网络资源

读写文本格式的数据

import pandas as pd
import numpy as np
from pandas import Series,DataFrame
!type C:\\pytest\\ch06\\ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

由于该文件是逗号隔开,所以用read_csv将其读入一个DataFrame:

df=pd.read_csv('C:\\pytest\\ch06\\ex1.csv')
df
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

也可以使用read_table,只不过需要指定分隔符而已:

pd.read_table('C:\\pytest\\ch06\\ex1.csv',sep=',')
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

没有标题行文件:

!type C:\\pytest\\ch06\\ex2.csv
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
pd.read_csv('C:\\pytest\\ch06\\ex2.csv',header=None)
0 1 2 3 4
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
pd.read_csv('C:\\pytest\\ch06\\ex2.csv',names=['a','b','c','d','message'])
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
names=['a','b','c','d','message']
pd.read_csv('C:\\pytest\\ch06\\ex2.csv',names=names,index_col='message')
a b c d
message
hello 1 2 3 4
world 5 6 7 8
foo 9 10 11 12

将多个列做成一个层次化索引,只需传入由列编号或列名组成的列表即可:

!type C:\\pytest\\ch06\\csv_mindex.csv
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16
parsed=pd.read_csv('C:\\pytest\\ch06\\csv_mindex.csv',index_col=['key1','key2'])
parsed
value1 value2
key1 key2
one a 1 2
b 3 4
c 5 6
d 7 8
two a 9 10
b 11 12
c 13 14
d 15 16

有些表格可能不是用固定的分隔符去分隔字段的,对于这种情况,可以编写一个正则表达式来作为read_table的分隔符。

list(open('C:\\pytest\\ch06\\ex3.txt'))
['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

该文件各个字段由数量不定的空白符分隔,这个情况可以用正则表达式\s+表示:

result=pd.read_table('C:\\pytest\\ch06\\ex3.txt',sep='\s+')
result
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491

可以用skiprows跳过文件的第一行、第三行和第四行:

!type C:\\pytest\\ch06\\ex4.csv
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
pd.read_csv('C:\\pytest\\ch06\\ex4.csv',skiprows=[0,2,3])
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

pandas会用一组经常出现的标记值进行识别,如NA,-1.#IND以及NULL等:

!type C:\\pytest\\ch06\\ex5.csv
something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo
result=pd.read_csv('C:\\pytest\\ch06\\ex5.csv')
result
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
pd.isnull(result)
something a b c d message
0 False False False False False True
1 False False False True False False
2 False False False False False False

na_values可以接受一组用于表示缺失值的字符串:

result=pd.read_csv('C:\\pytest\\ch06\\ex5.csv',na_values=['NULL'])
result
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo

可以用一个字典为各列指定不同的NA标记值:

sentinels={'message':['foo','NA'],'something':['two']}
pd.read_csv('C:\\pytest\\ch06\\ex5.csv',na_values=sentinels)
something a b c d message
0 one 1 2 3.0 4 NaN
1 NaN 5 6 NaN 8 world
2 three 9 10 11.0 12 NaN

逐块读取文本文件

读取文件的一部分或逐块对文件进行迭代

result=pd.read_csv('C:\\pytest\\ch06\\ex6.csv')
result
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q
5 1.817480 0.742273 0.419395 -2.251035 Q
6 -0.776764 0.935518 -0.332872 -1.875641 U
7 -0.913135 1.530624 -0.572657 0.477252 K
8 0.358480 -0.497572 -0.367016 0.507702 S
9 -1.740877 -1.160417 -1.637830 2.172201 G
10 0.240564 -0.328249 1.252155 1.072796 8
11 0.764018 1.165476 -0.639544 1.495258 R
12 0.571035 -0.310537 0.582437 -0.298765 1
13 2.317658 0.430710 -1.334216 0.199679 P
14 1.547771 -1.119753 -2.277634 0.329586 J
15 -1.310608 0.401719 -1.000987 1.156708 E
16 -0.088496 0.634712 0.153324 0.415335 B
17 -0.018663 -0.247487 -1.446522 0.750938 A
18 -0.070127 -1.579097 0.120892 0.671432 F
19 -0.194678 -0.492039 2.359605 0.319810 H
20 -0.248618 0.868707 -0.492226 -0.717959 W
21 -1.091549 -0.867110 -0.647760 -0.832562 C
22 0.641404 -0.138822 -0.621963 -0.284839 C
23 1.216408 0.992687 0.165162 -0.069619 V
24 -0.564474 0.792832 0.747053 0.571675 I
25 1.759879 -0.515666 -0.230481 1.362317 S
26 0.126266 0.309281 0.382820 -0.239199 L
27 1.334360 -0.100152 -0.840731 -0.643967 6
28 -0.737620 0.278087 -0.053235 -0.950972 J
29 -1.148486 -0.986292 -0.144963 0.124362 Y
9970 0.633495 -0.186524 0.927627 0.143164 4
9971 0.308636 -0.112857 0.762842 -1.072977 1
9972 -1.627051 -0.978151 0.154745 -1.229037 Z
9973 0.314847 0.097989 0.199608 0.955193 P
9974 1.666907 0.992005 0.496128 -0.686391 S
9975 0.010603 0.708540 -1.258711 0.226541 K
9976 0.118693 -0.714455 -0.501342 -0.254764 K
9977 0.302616 -2.011527 -0.628085 0.768827 H
9978 -0.098572 1.769086 -0.215027 -0.053076 A
9979 -0.019058 1.964994 0.738538 -0.883776 F
9980 -0.595349 0.001781 -1.423355 -1.458477 M
9981 1.392170 -1.396560 -1.425306 -0.847535 H
9982 -0.896029 -0.152287 1.924483 0.365184 6
9983 -2.274642 -0.901874 1.500352 0.996541 N
9984 -0.301898 1.019906 1.102160 2.624526 I
9985 -2.548389 -0.585374 1.496201 -0.718815 D
9986 -0.064588 0.759292 -1.568415 -0.420933 E
9987 -0.143365 -1.111760 -1.815581 0.435274 2
9988 -0.070412 -1.055921 0.338017 -0.440763 X
9989 0.649148 0.994273 -1.384227 0.485120 Q
9990 -0.370769 0.404356 -1.051628 -1.050899 8
9991 -0.409980 0.155627 -0.818990 1.277350 W
9992 0.301214 -1.111203 0.668258 0.671922 A
9993 1.821117 0.416445 0.173874 0.505118 X
9994 0.068804 1.322759 0.802346 0.223618 H
9995 2.311896 -0.417070 -1.409599 -0.515821 L
9996 -0.479893 -0.650419 0.745152 -0.646038 E
9997 0.523331 0.787112 0.486066 1.093156 K
9998 -0.362559 0.598894 -1.843201 0.887292 G
9999 -0.096376 -1.012999 -0.657431 -0.573315 0

10000 rows × 5 columns

通过nrows读取几行:

pd.read_csv('C:\\pytest\\ch06\\ex6.csv',nrows=5)
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q

要逐块读取文件,需要设置chunksize(行数):

chunker=pd.read_csv('C:\\pytest\\ch06\\ex6.csv',chunksize=1000)
chunker
tot=Series([])
for piece in chunker:
    tot=tot.add(piece['key'].value_counts(),fill_value=0)
tot=tot.order(ascending=False)
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:4: FutureWarning: order is deprecated, use sort_values(...)
tot[:10]
E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

将数据写出到文本格式

数据也可以被输出为分隔符格式的文本:

data=pd.read_csv('C:\\pytest\\ch06\\ex5.csv')
data
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo

利用DataFrame的to_csv方法,将数据写到一个以逗号分隔的文件中:

data.to_csv('C:\\pytest\\ch06\\out.csv')
!type C:\\pytest\\ch06\\out.csv
,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo
import sys
data.to_csv(sys.stdout,sep='|')
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo

缺失值由空白字符串表示为别的标记值:

data.to_csv(sys.stdout,na_rep='NULL')
,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo
data.to_csv(sys.stdout,index=False,header=False)
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo

只写出一部分列,并以你指定的顺序排列:

data.to_csv(sys.stdout,index=False,cols=['a','b','c'])
something,a,b,c,d,message
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo

Series也有一个to_csv方法:

dates=pd.date_range('1/1/2000',periods=7)
ts=Series(np.arange(7),index=dates)
ts.to_csv('C:\\pytest\\ch06\\tseries.csv')
!type C:\\pytest\\ch06\\tseries.csv
2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6
Series.from_csv('C:\\pytest\\ch06\\tseries.csv',parse_dates=True)
2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
dtype: int64

手工处理分隔符格式

表格型数据都能用pandas.read_table进行加载:

!type C:\\pytest\\ch06\\ex7.csv
"a","b","c"
"1","2","3"
"1","2","3","4"

对于任何单字符分隔符文件,可以直接使用Python内置的csv模块:

import csv
f=open('C:\\pytest\\ch06\\ex7.csv')
reader=csv.reader(f)
reader
for line in reader:
    print(line)
['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3', '4']
lines=list(csv.reader(open('C:\\pytest\\ch06\\ex7.csv')))
header,values=lines[0],lines[1:]
data_dict={h: v for h,v in zip(header,zip(*values))}
data_dict
{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

CSV文件的形式有很多,只需定义csv.Dialect的一个子类即可定义出新格式(如专门的分隔符、字符串引用约定、行结束符等):

class my_dialect(csv.Dialect):
    lineterminator='\n'
    delimiter=';'
    quotechar='"'
reader=csv.reader(f,delimiter='|')

JSON数据

JSON是一种比表格型文本格式(如CSV)灵活得多的数据格式:

obj="""{"name":"Wes",
        "place_lived":["United States","Spain","Germany"],
        "pet":null,
        "siblings":[{"name":"Scott","age":25,"pet":"Zuko"},{"name":"Katie","age":33,"pet":"Cisco"}]}"""
通过json.loads即可将JSON字符串转换为Python形式:
import json
result=json.loads(obj)
result
{'name': 'Wes',
 'pet': None,
 'place_lived': ['United States', 'Spain', 'Germany'],
 'siblings': [{'age': 25, 'name': 'Scott', 'pet': 'Zuko'},
  {'age': 33, 'name': 'Katie', 'pet': 'Cisco'}]}

json.dumps则将Python对象转换成JSON格式:

asjson=json.dumps(result)
将(一个或一组)JSON对象转换为DataFrame或其他便于分析的数据结构,向DataFrame构造器传入一组JSON对象,并选取数据字段的子集:
siblings=DataFrame(result['siblings'],columns=['name','age'])
siblings
name age
0 Scott 25
1 Katie 33

XML和HTML:Web信息收集

Python有许多可以读写HTML和XML格式数据的库,lxml就是其中之一,它能够高效且可靠地解析大文件。找到希望获取的URL,利用urllibs将其打开,然后用lxml解析得到的数据流:

from lxml.html import parse
from urllib.request import urlopen
parsed=parse(urlopen('https://finance.yahoo.com/q/op?s=AAPL+Options'))
doc=parsed.getroot()
通过这个对象,你可以获取特定类型的所有HTML标签(tag)。使用文档根节点的findall方法以及一个XPath:
links=doc.findall('.//a')
links[15:20]
[,
 ,
 ,
 ,
 ]

要得到URL和链接文本,你必须使用各对象的get方法(针对URL)和text_content方法(针对显示文本):

lnk=links[28]
lnk

lnk.get('href')
'https://itunes.apple.com/us/podcast/yahoo-finance-presents/id1241417570?mt=2'
lnk.text_content()
"LISTENMajor League Soccer: The bear and bull caseYahoo Finance's Jen Rogers speaks to Dan Roberts and Kevin Chupka about the business of MLS"

编写下面这条列表推导式(list comprehension)即可获取文档中的全部URL:

urls=[lnk.get('href') for lnk in doc.findall('.//a')]
urls[-10:]
['https://smallbusiness.yahoo.com',
 'https://help.yahoo.com/kb/index?page=content&y=PROD_FIN_DESK&locale=en_US&id=SLN2310',
 'https://help.yahoo.com/kb/index?page=content&y=PROD_FIN_DESK&locale=en_US',
 'https://yahoo.uservoice.com/forums/382977',
 'http://info.yahoo.com/privacy/us/yahoo/',
 'http://info.yahoo.com/relevantads/',
 'http://info.yahoo.com/legal/us/yahoo/utos/utos-173.html',
 'https://twitter.com/YahooFinance',
 'https://facebook.com/yahoofinance',
 'http://yahoofinance.tumblr.com']

二进制数据格式

实现数据的二进制格式存储最简单的办法之一是使用Python内置的pickle序列化。为了使用方便,pandas对象都有一个用于将数据以pickle形式保存到磁盘上的save方法:

frame=pd.read_csv('C:\\pytest\\ch06\\ex1.csv')
frame
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

使用HDF5格式

pandas有一个最小化的类似于字典的HDFStore类,它通过PyTable存储pandas对象:

store=pd.HDFStore('mydata.h5')
store['obj1']=frame
store['obj1_col']=frame['a']
store
store['obj1']
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

读取Microsoft Excel文件

需要安装xlrd和openpyxl包

使用HTML和Web API

import requests
url = 'https://api.github.com/repos/pydata/pandas/milestones/28/labels'
resp = requests.get(url)
resp
data[:5]
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
issue_labels = DataFrame(data)
issue_labels
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo

使用数据库

使用一款嵌入式的SQLite数据库:

import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
 c REAL,        d INTEGER
);"""

con = sqlite3.connect(':memory:')
con.execute(query)
con.commit()
data = [('Atlanta', 'Georgia', 1.25, 6),
        ('Tallahassee', 'Florida', 2.6, 3),
        ('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"

con.executemany(stmt, data)
con.commit()
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows
[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]
cursor.description
(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))
DataFrame(rows, columns=zip(*cursor.description)[0])
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

 in ()
----> 1 DataFrame(rows, columns=zip(*cursor.description)[0])


TypeError: 'zip' object is not subscriptable
import pandas.io.sql as sql
sql.read_sql('select * from test', con)
a b c d
0 Atlanta Georgia 1.25 6
1 Tallahassee Florida 2.60 3
2 Sacramento California 1.70 5

存取MongoDB中的数据

你可能感兴趣的:(数据分析)