Pandas方法
1、读取文件
pandas有很多用来读取表格式数据作为dataframe的函数,下面列出来一些。其中read_csv和read_tabel是最经常用到的:
import pandas as pd
import numpy as np
df = pd.read_csv('../examples/ex1.csv')
df
|
a |
b |
c |
d |
message |
0 |
1 |
2 |
3 |
4 |
hello |
1 |
5 |
6 |
7 |
8 |
world |
2 |
9 |
10 |
11 |
12 |
foo |
pd.read_csv('../examples/ex2.csv', header=None)
|
0 |
1 |
2 |
3 |
4 |
0 |
1 |
2 |
3 |
4 |
hello |
1 |
5 |
6 |
7 |
8 |
world |
2 |
9 |
10 |
11 |
12 |
foo |
pd.read_csv('../examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
|
a |
b |
c |
d |
message |
0 |
1 |
2 |
3 |
4 |
hello |
1 |
5 |
6 |
7 |
8 |
world |
2 |
9 |
10 |
11 |
12 |
foo |
如果想要从多列从构建一个hierarchical index(阶层型索引),传入一个包含列名的list:
!type "csv_mindex.csv"
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16
parsed = pd.read_csv('csv_mindex.csv',
index_col=['key1', 'key2'])
parsed
|
|
value1 |
value2 |
key1 |
key2 |
|
|
one |
a |
1 |
2 |
b |
3 |
4 |
c |
5 |
6 |
d |
7 |
8 |
two |
a |
9 |
10 |
b |
11 |
12 |
c |
13 |
14 |
d |
15 |
16 |
pd.read_csv('../examples/ex6.csv', nrows=5)
|
one |
two |
three |
four |
key |
0 |
0.467976 |
-0.038649 |
-0.295344 |
-1.824726 |
L |
1 |
-0.358893 |
1.404453 |
0.704965 |
-0.200638 |
B |
2 |
-0.501840 |
0.659254 |
-0.421691 |
-0.057688 |
G |
3 |
0.204886 |
1.074134 |
1.388361 |
-0.982404 |
R |
4 |
0.354628 |
-0.133116 |
0.283763 |
-0.837063 |
Q |
chunker = pd.read_csv('../examples/ex6.csv', chunksize=1000)
chunker.get_chunk(10)
|
one |
two |
three |
four |
key |
0 |
0.467976 |
-0.038649 |
-0.295344 |
-1.824726 |
L |
1 |
-0.358893 |
1.404453 |
0.704965 |
-0.200638 |
B |
2 |
-0.501840 |
0.659254 |
-0.421691 |
-0.057688 |
G |
3 |
0.204886 |
1.074134 |
1.388361 |
-0.982404 |
R |
4 |
0.354628 |
-0.133116 |
0.283763 |
-0.837063 |
Q |
5 |
1.817480 |
0.742273 |
0.419395 |
-2.251035 |
Q |
6 |
-0.776764 |
0.935518 |
-0.332872 |
-1.875641 |
U |
7 |
-0.913135 |
1.530624 |
-0.572657 |
0.477252 |
K |
8 |
0.358480 |
-0.497572 |
-0.367016 |
0.507702 |
S |
9 |
-1.740877 |
-1.160417 |
-1.637830 |
2.172201 |
G |
result = pd.read_table('../examples/ex3.txt', sep='\s+')
result
|
A |
B |
C |
aaa |
-0.264438 |
-1.026059 |
-0.619500 |
bbb |
0.927272 |
0.302904 |
-0.032399 |
ccc |
-0.264273 |
-0.386314 |
-0.217601 |
ddd |
-0.871858 |
-0.348382 |
1.100491 |
frame = pd.read_excel('../examples/ex1.xlsx', 'Sheet1')
frame
|
a |
b |
c |
d |
message |
0 |
1 |
2 |
3 |
4 |
hello |
1 |
5 |
6 |
7 |
8 |
world |
2 |
9 |
10 |
11 |
12 |
foo |
2、保存文件
data = pd.read_csv('../examples/ex5.csv')
data
|
something |
a |
b |
c |
d |
message |
0 |
one |
1 |
2 |
3.0 |
4 |
NaN |
1 |
two |
5 |
6 |
NaN |
8 |
world |
2 |
three |
9 |
10 |
11.0 |
12 |
foo |
frame.to_excel('examples.xlsx')
data.to_csv('../examples/out.csv')
其他一些分隔符也可以使用(使用sys.stdout可以直接打印文本,方便查看效果)
import sys
data.to_csv(sys.stdout, sep='|', na_rep='NULL')
|something|a|b|c|d|message
0|one|1|2|3.0|4|NULL
1|two|5|6|NULL|8|world
2|three|9|10|11.0|12|foo
如果不指定,行和列会被自动写入。当然也可以设定为不写入:
data.to_csv(sys.stdout, index=False, header=False)
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo
3、JSON Data
import json
path = '../datasets/bitly_usagov/example.txt'
records = [json.loads(line) for line in open(path)]
records[0]
{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
'al': 'en-US,en;q=0.8',
'c': 'US',
'cy': 'Danvers',
'g': 'A6qOVH',
'gr': 'MA',
'h': 'wfLQtf',
'hc': 1331822918,
'hh': '1.usa.gov',
'l': 'orofrog',
'll': [42.576698, -70.954903],
'nk': 1,
'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
't': 1331923247,
'tz': 'America/New_York',
'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}
records[0]['tz']
'America/New_York'
pandas.read_json可以自动把JSON数据转变为series或DataFrame
data = pd.read_json('../examples/example.json')
data
|
a |
b |
c |
0 |
1 |
2 |
3 |
1 |
4 |
5 |
6 |
2 |
7 |
8 |
9 |
print(data.to_json())
{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}
print(data.to_json(orient='records'))
[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]
orient='records'
表示输出的数据结构是 列->值 的形式:
records : list like [{column -> value}, ... , {column -> value}]
4、XML and HTML: Web Scraping (网络爬取)
python有很多包用来读取和写入HTML和XML格式。比如:lxml, Beautiful Soup, html5lib。其中lxml比较快,其他一些包则能更好的处理一些复杂的HTML和XML文件。
pandas有一个内建的函数,叫read_html, 这个函数利用lxml和Beautiful Soup这样的包来自动解析HTML,变为DataFrame。这里我们必须要先下载这些包才能使用read_html:
tables = pd.read_html('../examples/fdic_failed_bank_list.html')
len(tables)
1
failures = tables[0]
failures.head()
|
Bank Name |
City |
ST |
CERT |
Acquiring Institution |
Closing Date |
Updated Date |
0 |
Allied Bank |
Mulberry |
AR |
91 |
Today’s Bank |
September 23, 2016 |
November 17, 2016 |
1 |
The Woodbury Banking Company |
Woodbury |
GA |
11297 |
United Bank |
August 19, 2016 |
November 17, 2016 |
2 |
First CornerStone Bank |
King of Prussia |
PA |
35312 |
First-Citizens Bank & Trust Company |
May 6, 2016 |
September 6, 2016 |
3 |
Trust Company Bank |
Memphis |
TN |
9956 |
The Bank of Fayette County |
April 29, 2016 |
September 6, 2016 |
4 |
North Milwaukee State Bank |
Milwaukee |
WI |
20364 |
First-Citizens Bank & Trust Company |
March 11, 2016 |
June 16, 2016 |
这里我们做一些数据清洗和分析,比如按年计算bank failure的数量:
close_timestamps = pd.to_datetime(failures['Closing Date'])
close_timestamps.dt.year.value_counts()
2010 157
2009 140
2011 92
2012 51
2008 25
2013 24
2014 18
2002 11
2015 8
2016 5
2004 4
2001 4
2007 3
2003 3
2000 2
Name: Closing Date, dtype: int64
XML(eXtensible Markup Language)是另一种常见的数据格式,支持阶层型、嵌套的数据。
使用lxml.objectify,我们可以解析文件,通过getroot,得到一个指向XML文件中root node的指针:
from lxml import objectify
path = '../datasets/mta_perf/Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()
data = []
skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ',
'DESIRED_CHANGE', 'DECIMAL_PLACES']
for elt in root.INDICATOR:
el_data = {}
for child in elt.getchildren():
if child.tag in skip_fields:
continue
el_data[child.tag] = child.pyval
data.append(el_data)
root.INDICATOR 返回一个生成器,每次调用能生成一个
XML元素。每一个记录,我们产生一个dict,tag name(比如YTD_ACTUAL)作为字典的key:
perf = pd.DataFrame(data)
perf.head()
|
AGENCY_NAME |
CATEGORY |
DESCRIPTION |
FREQUENCY |
INDICATOR_NAME |
INDICATOR_UNIT |
MONTHLY_ACTUAL |
MONTHLY_TARGET |
PERIOD_MONTH |
PERIOD_YEAR |
YTD_ACTUAL |
YTD_TARGET |
0 |
Metro-North Railroad |
Service Indicators |
Percent of commuter trains that arrive at thei… |
M |
On-Time Performance (West of Hudson) |
% |
96.9 |
95 |
1 |
2008 |
96.9 |
95 |
1 |
Metro-North Railroad |
Service Indicators |
Percent of commuter trains that arrive at thei… |
M |
On-Time Performance (West of Hudson) |
% |
95 |
95 |
2 |
2008 |
96 |
95 |
2 |
Metro-North Railroad |
Service Indicators |
Percent of commuter trains that arrive at thei… |
M |
On-Time Performance (West of Hudson) |
% |
96.9 |
95 |
3 |
2008 |
96.3 |
95 |
3 |
Metro-North Railroad |
Service Indicators |
Percent of commuter trains that arrive at thei… |
M |
On-Time Performance (West of Hudson) |
% |
98.3 |
95 |
4 |
2008 |
96.8 |
95 |
4 |
Metro-North Railroad |
Service Indicators |
Percent of commuter trains that arrive at thei… |
M |
On-Time Performance (West of Hudson) |
% |
95.8 |
95 |
5 |
2008 |
96.6 |
95 |
Numpy方法
np.load('examples.txt')
arr = np.arange(10)
np.save('example.txt', arr)
np.savez('example.txt', a=arr, b=arr)
np.savez_compressed('example.txt', a=arr, b=arr)
原始方法
!type "ex1.csv"
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
filename = '../datasets/bitly_usagov/example.txt'
f = open(filename)
f.readline()
'{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11 (KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1, "tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l": "orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r": "http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u": "http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc": 1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'
模式 |
描述 |
r |
单一读取模式 |
w |
单一写入模式;创建一个新文件(消除同名文件的数据) |
x |
单一写入模式;创建一个新文件,但如果path已经存在的话会失败 |
a |
添加内容到已经存在的文件里(如果文件不存在的话创建新文件) |
r+ |
读取和写入双模式 |
b |
针对二进制文件的模式(’rb’: 或 ‘wb’) |
t |
文本模式(自动编码bytes为Unicode)。如果不明说的话默认就是这种模式。添加t到其他模式后面(’rt’ or ‘xt’) |
f.close()
with open(filename) as f:
lines = [x.rstrip() for x in f]
二进制数据格式
最简单的以二进制的格式来存储数据的方法(也被叫做serialization,序列化),就是用python内建的pickle。所有的pandas object都有一个to_pickle方法,可以用来存储数据为pickle格式:
frame = pd.read_csv('../examples/ex1.csv')
frame
|
a |
b |
c |
d |
message |
0 |
1 |
2 |
3 |
4 |
hello |
1 |
5 |
6 |
7 |
8 |
world |
2 |
9 |
10 |
11 |
12 |
foo |
frame.to_pickle('frame_pickle')
pd.read_pickle('frame_pickle')
|
a |
b |
c |
d |
message |
0 |
1 |
2 |
3 |
4 |
hello |
1 |
5 |
6 |
7 |
8 |
world |
2 |
9 |
10 |
11 |
12 |
foo |
注意:pickle只推荐用于短期存储。因为这种格式无法保证长期稳定;比如今天pickled的一个文件,可能在库文件更新后无法读取。
python还支持另外两种二进制数据格式:HDF5和MessagePack。另外一些可用的存储格式有:bcolz 和 Feather。
网络相关的API交互
import requests
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
resp
data = resp.json()
data[0]['title']
'BUG: Fix isna cannot handle ambiguous typed list'
data[0]
{'assignee': None,
'assignees': [],
'author_association': 'CONTRIBUTOR',
'body': '- [x] closes #20675\r\n- [x] tests added / passed\r\n- [x] passes `git diff upstream/master -u -- "*.py" | flake8 --diff`\r\n- [ ] whatsnew entry\r\n',
'closed_at': None,
'comments': 1,
'comments_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/20971/comments',
'created_at': '2018-05-07T00:34:35Z',
'events_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/20971/events',
'html_url': 'https://github.com/pandas-dev/pandas/pull/20971',
'id': 320643434,
'labels': [],
'labels_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/20971/labels{/name}',
'locked': False,
'milestone': None,
'number': 20971,
'pull_request': {'diff_url': 'https://github.com/pandas-dev/pandas/pull/20971.diff',
'html_url': 'https://github.com/pandas-dev/pandas/pull/20971',
'patch_url': 'https://github.com/pandas-dev/pandas/pull/20971.patch',
'url': 'https://api.github.com/repos/pandas-dev/pandas/pulls/20971'},
'repository_url': 'https://api.github.com/repos/pandas-dev/pandas',
'state': 'open',
'title': 'BUG: Fix isna cannot handle ambiguous typed list',
'updated_at': '2018-05-07T07:33:15Z',
'url': 'https://api.github.com/repos/pandas-dev/pandas/issues/20971',
'user': {'avatar_url': 'https://avatars0.githubusercontent.com/u/1455030?v=4',
'events_url': 'https://api.github.com/users/Licht-T/events{/privacy}',
'followers_url': 'https://api.github.com/users/Licht-T/followers',
'following_url': 'https://api.github.com/users/Licht-T/following{/other_user}',
'gists_url': 'https://api.github.com/users/Licht-T/gists{/gist_id}',
'gravatar_id': '',
'html_url': 'https://github.com/Licht-T',
'id': 1455030,
'login': 'Licht-T',
'organizations_url': 'https://api.github.com/users/Licht-T/orgs',
'received_events_url': 'https://api.github.com/users/Licht-T/received_events',
'repos_url': 'https://api.github.com/users/Licht-T/repos',
'site_admin': False,
'starred_url': 'https://api.github.com/users/Licht-T/starred{/owner}{/repo}',
'subscriptions_url': 'https://api.github.com/users/Licht-T/subscriptions',
'type': 'User',
'url': 'https://api.github.com/users/Licht-T'}}
data中的每一个元素都是一个dict,这个dict就是在github上找到的issue页面上的信息。我们可以把data传给DataFrame并提取感兴趣的部分
issues = pd.DataFrame(data, columns=['number', 'title',
'html_url', 'state'])
issues
|
number |
title |
html_url |
state |
0 |
20971 |
BUG: Fix isna cannot handle ambiguous typed list |
https://github.com/pandas-dev/pandas/pull/20971 |
open |
1 |
20970 |
performance when “combining” categories |
https://github.com/pandas-dev/pandas/issues/20970 |
open |
2 |
20969 |
Getting a … in my CSV when using to_csv() |
https://github.com/pandas-dev/pandas/issues/20969 |
open |
3 |
20968 |
Sharey keyword for boxplot |
https://github.com/pandas-dev/pandas/pull/20968 |
open |
4 |
20967 |
Column can be extended beyond DataFrame’s length |
https://github.com/pandas-dev/pandas/issues/20967 |
open |
5 |
20966 |
BUG: Fix wrong khash method definition |
https://github.com/pandas-dev/pandas/pull/20966 |
open |
6 |
20965 |
BUG: Fix combine_first converts other columns … |
https://github.com/pandas-dev/pandas/pull/20965 |
open |
7 |
20964 |
BUG: Index with datetime64[ns, tz] dtype does … |
https://github.com/pandas-dev/pandas/issues/20964 |
open |
8 |
20962 |
df.rank() has different behavior for python2 a… |
https://github.com/pandas-dev/pandas/issues/20962 |
open |
9 |
20960 |
DOC: add reshaping visuals to the docs (Reshap… |
https://github.com/pandas-dev/pandas/pull/20960 |
open |
10 |
20959 |
BUG in .groupby.apply when applying a function… |
https://github.com/pandas-dev/pandas/pull/20959 |
open |
11 |
20958 |
Using apply on a grouper works only if done af… |
https://github.com/pandas-dev/pandas/issues/20958 |
open |
12 |
20956 |
ENH: Return DatetimeIndex or TimedeltaIndex bi… |
https://github.com/pandas-dev/pandas/pull/20956 |
open |
13 |
20955 |
Building documentation fails with `ImportError… |
https://github.com/pandas-dev/pandas/issues/20955 |
open |
14 |
20954 |
Correlation inconsistencies between Series and… |
https://github.com/pandas-dev/pandas/issues/20954 |
open |
15 |
20951 |
Addressing multiindex raises TypeError if indi… |
https://github.com/pandas-dev/pandas/issues/20951 |
open |
16 |
20949 |
pandas.core.groupby.GroupBy.apply fails |
https://github.com/pandas-dev/pandas/issues/20949 |
open |
17 |
20948 |
Indexing DataFrame with DateOffset is nearly i… |
https://github.com/pandas-dev/pandas/issues/20948 |
open |
18 |
20947 |
Allow drop bins when using the cut function |
https://github.com/pandas-dev/pandas/pull/20947 |
open |
19 |
20945 |
Data is mismatched with labels after stack wit… |
https://github.com/pandas-dev/pandas/issues/20945 |
open |
20 |
20944 |
Drop rows based on condition |
https://github.com/pandas-dev/pandas/issues/20944 |
open |
21 |
20943 |
pd.read_sql does not handle queries that retur… |
https://github.com/pandas-dev/pandas/issues/20943 |
open |
22 |
20940 |
unclear error with read_sas |
https://github.com/pandas-dev/pandas/issues/20940 |
open |
23 |
20939 |
BUG: cant modify df with duplicate index (#17105) |
https://github.com/pandas-dev/pandas/pull/20939 |
open |
24 |
20937 |
BUG: Series(DTI/TDI) loses the frequency infor… |
https://github.com/pandas-dev/pandas/issues/20937 |
open |
25 |
20936 |
__new__() missing 1 required positional argume… |
https://github.com/pandas-dev/pandas/issues/20936 |
open |
26 |
20933 |
Fixes Update docs on reserved attributes #20878 |
https://github.com/pandas-dev/pandas/pull/20933 |
open |
27 |
20932 |
API: expose `axis` keyword in pandas.core.algo… |
https://github.com/pandas-dev/pandas/issues/20932 |
open |
28 |
20930 |
DOC: Index.get_loc Cannot Accept List-like Tol… |
https://github.com/pandas-dev/pandas/issues/20930 |
open |
29 |
20928 |
Added script to fetch wheels [ci skip] |
https://github.com/pandas-dev/pandas/pull/20928 |
open |
与数据库的交互
import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
);"""
con = sqlite3.connect('mydata.sqlite')
con.execute(query)
con.commit()
data = [('Atlanta', 'Georgia', 1.25, 6),
('Tallahassee', 'Florida', 2.6, 3),
('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
con.commit()
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows
[('Atlanta', 'Georgia', 1.25, 6),
('Tallahassee', 'Florida', 2.6, 3),
('Sacramento', 'California', 1.7, 5)]
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])
|
a |
b |
c |
d |
0 |
Atlanta |
Georgia |
1.25 |
6 |
1 |
Tallahassee |
Florida |
2.60 |
3 |
2 |
Sacramento |
California |
1.70 |
5 |
import sqlalchemy as sqla
db = sqla.create_engine('sqlite:///mydata.sqlite')
pd.read_sql('select * from test', db)
|
a |
b |
c |
d |
0 |
Atlanta |
Georgia |
1.25 |
6 |
1 |
Tallahassee |
Florida |
2.60 |
3 |
2 |
Sacramento |
California |
1.70 |
5 |