利用Python进行数据分析笔记-读写数据

Pandas方法

1、读取文件

pandas有很多用来读取表格式数据作为dataframe的函数,下面列出来一些。其中read_csv和read_tabel是最经常用到的:

import pandas as pd
import numpy as np

# read_csv方法
df = pd.read_csv('../examples/ex1.csv')
df
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
# read_csv方法,没有head行文件
pd.read_csv('../examples/ex2.csv', header=None)
0 1 2 3 4
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
# read_csv方法,自定义head
pd.read_csv('../examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

如果想要从多列从构建一个hierarchical index(阶层型索引),传入一个包含列名的list:

# 利用type查看原始结构
!type "csv_mindex.csv"
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16
parsed = pd.read_csv('csv_mindex.csv',
                     index_col=['key1', 'key2'])
parsed
value1 value2
key1 key2
one a 1 2
b 3 4
c 5 6
d 7 8
two a 9 10
b 11 12
c 13 14
d 15 16
# 读取前五行
pd.read_csv('../examples/ex6.csv', nrows=5)
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q
# 读取文件一部分,可以指定chunksize
chunker = pd.read_csv('../examples/ex6.csv', chunksize=1000)

#get_chunk方法,能返回任意大小的数据片段
chunker.get_chunk(10)
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q
5 1.817480 0.742273 0.419395 -2.251035 Q
6 -0.776764 0.935518 -0.332872 -1.875641 U
7 -0.913135 1.530624 -0.572657 0.477252 K
8 0.358480 -0.497572 -0.367016 0.507702 S
9 -1.740877 -1.160417 -1.637830 2.172201 G
# read_table方法
result = pd.read_table('../examples/ex3.txt', sep='\s+')  # \s+正则表示空格
result
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491
# read_excel方法
frame = pd.read_excel('../examples/ex1.xlsx', 'Sheet1')
frame
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

2、保存文件

data = pd.read_csv('../examples/ex5.csv')
data
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
# 用to_excel方法
frame.to_excel('examples.xlsx')
# 用to_csv方法
data.to_csv('../examples/out.csv')
其他一些分隔符也可以使用(使用sys.stdout可以直接打印文本,方便查看效果)
import sys
#  to_csv默认分隔符为“,”
# 指定分隔符为“|”, 空格用“NULL”代替
data.to_csv(sys.stdout, sep='|',  na_rep='NULL')
|something|a|b|c|d|message
0|one|1|2|3.0|4|NULL
1|two|5|6|NULL|8|world
2|three|9|10|11.0|12|foo

如果不指定,行和列会被自动写入。当然也可以设定为不写入:

data.to_csv(sys.stdout, index=False, header=False)
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo

3、JSON Data

# 利用json
import json
path = '../datasets/bitly_usagov/example.txt'
records = [json.loads(line) for line in open(path)]
records[0]
{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
 'al': 'en-US,en;q=0.8',
 'c': 'US',
 'cy': 'Danvers',
 'g': 'A6qOVH',
 'gr': 'MA',
 'h': 'wfLQtf',
 'hc': 1331822918,
 'hh': '1.usa.gov',
 'l': 'orofrog',
 'll': [42.576698, -70.954903],
 'nk': 1,
 'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
 't': 1331923247,
 'tz': 'America/New_York',
 'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}
records[0]['tz']
'America/New_York'

pandas.read_json可以自动把JSON数据转变为series或DataFrame

data = pd.read_json('../examples/example.json')
data
a b c
0 1 2 3
1 4 5 6
2 7 8 9
# 如果想要输出结果为JSON,用to_json方法
print(data.to_json())
{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}
print(data.to_json(orient='records'))
[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]

orient='records'表示输出的数据结构是 列->值 的形式:

records : list like [{column -> value}, ... , {column -> value}]

4、XML and HTML: Web Scraping (网络爬取)

python有很多包用来读取和写入HTML和XML格式。比如:lxml, Beautiful Soup, html5lib。其中lxml比较快,其他一些包则能更好的处理一些复杂的HTML和XML文件。

pandas有一个内建的函数,叫read_html, 这个函数利用lxml和Beautiful Soup这样的包来自动解析HTML,变为DataFrame。这里我们必须要先下载这些包才能使用read_html:

tables = pd.read_html('../examples/fdic_failed_bank_list.html')
len(tables)
1
failures = tables[0]
failures.head()
Bank Name City ST CERT Acquiring Institution Closing Date Updated Date
0 Allied Bank Mulberry AR 91 Today’s Bank September 23, 2016 November 17, 2016
1 The Woodbury Banking Company Woodbury GA 11297 United Bank August 19, 2016 November 17, 2016
2 First CornerStone Bank King of Prussia PA 35312 First-Citizens Bank & Trust Company May 6, 2016 September 6, 2016
3 Trust Company Bank Memphis TN 9956 The Bank of Fayette County April 29, 2016 September 6, 2016
4 North Milwaukee State Bank Milwaukee WI 20364 First-Citizens Bank & Trust Company March 11, 2016 June 16, 2016

这里我们做一些数据清洗和分析,比如按年计算bank failure的数量:

close_timestamps = pd.to_datetime(failures['Closing Date'])
close_timestamps.dt.year.value_counts()
2010    157
2009    140
2011     92
2012     51
2008     25
2013     24
2014     18
2002     11
2015      8
2016      5
2004      4
2001      4
2007      3
2003      3
2000      2
Name: Closing Date, dtype: int64

XML(eXtensible Markup Language)是另一种常见的数据格式,支持阶层型、嵌套的数据。

使用lxml.objectify,我们可以解析文件,通过getroot,得到一个指向XML文件中root node的指针:

from lxml import objectify

path = '../datasets/mta_perf/Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()
data = []

skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ', 
               'DESIRED_CHANGE', 'DECIMAL_PLACES']

for elt in root.INDICATOR:
    el_data = {}
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
        el_data[child.tag] = child.pyval
    data.append(el_data)

root.INDICATOR 返回一个生成器,每次调用能生成一个XML元素。每一个记录,我们产生一个dict,tag name(比如YTD_ACTUAL)作为字典的key:

# 生成DataFrame
perf = pd.DataFrame(data)
perf.head()
AGENCY_NAME CATEGORY DESCRIPTION FREQUENCY INDICATOR_NAME INDICATOR_UNIT MONTHLY_ACTUAL MONTHLY_TARGET PERIOD_MONTH PERIOD_YEAR YTD_ACTUAL YTD_TARGET
0 Metro-North Railroad Service Indicators Percent of commuter trains that arrive at thei… M On-Time Performance (West of Hudson) % 96.9 95 1 2008 96.9 95
1 Metro-North Railroad Service Indicators Percent of commuter trains that arrive at thei… M On-Time Performance (West of Hudson) % 95 95 2 2008 96 95
2 Metro-North Railroad Service Indicators Percent of commuter trains that arrive at thei… M On-Time Performance (West of Hudson) % 96.9 95 3 2008 96.3 95
3 Metro-North Railroad Service Indicators Percent of commuter trains that arrive at thei… M On-Time Performance (West of Hudson) % 98.3 95 4 2008 96.8 95
4 Metro-North Railroad Service Indicators Percent of commuter trains that arrive at thei… M On-Time Performance (West of Hudson) % 95.8 95 5 2008 96.6 95

Numpy方法

# 用np.load来加载
np.load('examples.txt')
# 保存
arr = np.arange(10)
np.save('example.txt', arr)
# 用np.savez能保存多个数组
np.savez('example.txt', a=arr, b=arr)
# 用np.savez_compressed来压缩文件
np.savez_compressed('example.txt', a=arr, b=arr)

原始方法

# !cat "ex1.csv"   # cat是unix下的一个shell command
!type "ex1.csv"
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
# "../"表示返回上一个层级
# "./ 表示在当前目录下
filename = '../datasets/bitly_usagov/example.txt'
# 直接读取
# f = open(path, 'w')  #该操作会在目录下创建一个新的example.txt文件,如果有同名文件的话会被覆盖。w就表示写入的意思。
f = open(filename)
f.readline()
'{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11 (KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1, "tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l": "orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r": "http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u": "http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc": 1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'
模式 描述
r 单一读取模式
w 单一写入模式;创建一个新文件(消除同名文件的数据)
x 单一写入模式;创建一个新文件,但如果path已经存在的话会失败
a 添加内容到已经存在的文件里(如果文件不存在的话创建新文件)
r+ 读取和写入双模式
b 针对二进制文件的模式(’rb’: 或 ‘wb’)
t 文本模式(自动编码bytes为Unicode)。如果不明说的话默认就是这种模式。添加t到其他模式后面(’rt’ or ‘xt’)
# 当我们open文件后,一定要记得在结束后close文件,这样就不会占用操作系统的资源了
f.close()
# 用with语句,运行结束后自动关闭文件
with open(filename) as f:
    lines = [x.rstrip() for x in f]

二进制数据格式

最简单的以二进制的格式来存储数据的方法(也被叫做serialization,序列化),就是用python内建的pickle。所有的pandas object都有一个to_pickle方法,可以用来存储数据为pickle格式:

frame = pd.read_csv('../examples/ex1.csv')
frame
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
# 存为pickle格式
frame.to_pickle('frame_pickle')
# 读取pickle格式
pd.read_pickle('frame_pickle')
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

注意:pickle只推荐用于短期存储。因为这种格式无法保证长期稳定;比如今天pickled的一个文件,可能在库文件更新后无法读取。

python还支持另外两种二进制数据格式:HDF5和MessagePack。另外一些可用的存储格式有:bcolz 和 Feather。

网络相关的API交互

import requests

url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
resp
data = resp.json()
data[0]['title']
'BUG: Fix isna cannot handle ambiguous typed list'
data[0]
{'assignee': None,
 'assignees': [],
 'author_association': 'CONTRIBUTOR',
 'body': '- [x] closes #20675\r\n- [x] tests added / passed\r\n- [x] passes `git diff upstream/master -u -- "*.py" | flake8 --diff`\r\n- [ ] whatsnew entry\r\n',
 'closed_at': None,
 'comments': 1,
 'comments_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/20971/comments',
 'created_at': '2018-05-07T00:34:35Z',
 'events_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/20971/events',
 'html_url': 'https://github.com/pandas-dev/pandas/pull/20971',
 'id': 320643434,
 'labels': [],
 'labels_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/20971/labels{/name}',
 'locked': False,
 'milestone': None,
 'number': 20971,
 'pull_request': {'diff_url': 'https://github.com/pandas-dev/pandas/pull/20971.diff',
  'html_url': 'https://github.com/pandas-dev/pandas/pull/20971',
  'patch_url': 'https://github.com/pandas-dev/pandas/pull/20971.patch',
  'url': 'https://api.github.com/repos/pandas-dev/pandas/pulls/20971'},
 'repository_url': 'https://api.github.com/repos/pandas-dev/pandas',
 'state': 'open',
 'title': 'BUG: Fix isna cannot handle ambiguous typed list',
 'updated_at': '2018-05-07T07:33:15Z',
 'url': 'https://api.github.com/repos/pandas-dev/pandas/issues/20971',
 'user': {'avatar_url': 'https://avatars0.githubusercontent.com/u/1455030?v=4',
  'events_url': 'https://api.github.com/users/Licht-T/events{/privacy}',
  'followers_url': 'https://api.github.com/users/Licht-T/followers',
  'following_url': 'https://api.github.com/users/Licht-T/following{/other_user}',
  'gists_url': 'https://api.github.com/users/Licht-T/gists{/gist_id}',
  'gravatar_id': '',
  'html_url': 'https://github.com/Licht-T',
  'id': 1455030,
  'login': 'Licht-T',
  'organizations_url': 'https://api.github.com/users/Licht-T/orgs',
  'received_events_url': 'https://api.github.com/users/Licht-T/received_events',
  'repos_url': 'https://api.github.com/users/Licht-T/repos',
  'site_admin': False,
  'starred_url': 'https://api.github.com/users/Licht-T/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/Licht-T/subscriptions',
  'type': 'User',
  'url': 'https://api.github.com/users/Licht-T'}}

data中的每一个元素都是一个dict,这个dict就是在github上找到的issue页面上的信息。我们可以把data传给DataFrame并提取感兴趣的部分

issues = pd.DataFrame(data, columns=['number', 'title', 
                                    'html_url', 'state'])
issues
number title html_url state
0 20971 BUG: Fix isna cannot handle ambiguous typed list https://github.com/pandas-dev/pandas/pull/20971 open
1 20970 performance when “combining” categories https://github.com/pandas-dev/pandas/issues/20970 open
2 20969 Getting a … in my CSV when using to_csv() https://github.com/pandas-dev/pandas/issues/20969 open
3 20968 Sharey keyword for boxplot https://github.com/pandas-dev/pandas/pull/20968 open
4 20967 Column can be extended beyond DataFrame’s length https://github.com/pandas-dev/pandas/issues/20967 open
5 20966 BUG: Fix wrong khash method definition https://github.com/pandas-dev/pandas/pull/20966 open
6 20965 BUG: Fix combine_first converts other columns … https://github.com/pandas-dev/pandas/pull/20965 open
7 20964 BUG: Index with datetime64[ns, tz] dtype does … https://github.com/pandas-dev/pandas/issues/20964 open
8 20962 df.rank() has different behavior for python2 a… https://github.com/pandas-dev/pandas/issues/20962 open
9 20960 DOC: add reshaping visuals to the docs (Reshap… https://github.com/pandas-dev/pandas/pull/20960 open
10 20959 BUG in .groupby.apply when applying a function… https://github.com/pandas-dev/pandas/pull/20959 open
11 20958 Using apply on a grouper works only if done af… https://github.com/pandas-dev/pandas/issues/20958 open
12 20956 ENH: Return DatetimeIndex or TimedeltaIndex bi… https://github.com/pandas-dev/pandas/pull/20956 open
13 20955 Building documentation fails with `ImportError… https://github.com/pandas-dev/pandas/issues/20955 open
14 20954 Correlation inconsistencies between Series and… https://github.com/pandas-dev/pandas/issues/20954 open
15 20951 Addressing multiindex raises TypeError if indi… https://github.com/pandas-dev/pandas/issues/20951 open
16 20949 pandas.core.groupby.GroupBy.apply fails https://github.com/pandas-dev/pandas/issues/20949 open
17 20948 Indexing DataFrame with DateOffset is nearly i… https://github.com/pandas-dev/pandas/issues/20948 open
18 20947 Allow drop bins when using the cut function https://github.com/pandas-dev/pandas/pull/20947 open
19 20945 Data is mismatched with labels after stack wit… https://github.com/pandas-dev/pandas/issues/20945 open
20 20944 Drop rows based on condition https://github.com/pandas-dev/pandas/issues/20944 open
21 20943 pd.read_sql does not handle queries that retur… https://github.com/pandas-dev/pandas/issues/20943 open
22 20940 unclear error with read_sas https://github.com/pandas-dev/pandas/issues/20940 open
23 20939 BUG: cant modify df with duplicate index (#17105) https://github.com/pandas-dev/pandas/pull/20939 open
24 20937 BUG: Series(DTI/TDI) loses the frequency infor… https://github.com/pandas-dev/pandas/issues/20937 open
25 20936 __new__() missing 1 required positional argume… https://github.com/pandas-dev/pandas/issues/20936 open
26 20933 Fixes Update docs on reserved attributes #20878 https://github.com/pandas-dev/pandas/pull/20933 open
27 20932 API: expose `axis` keyword in pandas.core.algo… https://github.com/pandas-dev/pandas/issues/20932 open
28 20930 DOC: Index.get_loc Cannot Accept List-like Tol… https://github.com/pandas-dev/pandas/issues/20930 open
29 20928 Added script to fetch wheels [ci skip] https://github.com/pandas-dev/pandas/pull/20928 open

与数据库的交互

import sqlite3

# 创建数据库
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
 c REAL,        d INTEGER
);"""
con = sqlite3.connect('mydata.sqlite')
con.execute(query)
con.commit()
# 添加数据
data = [('Atlanta', 'Georgia', 1.25, 6),
        ('Tallahassee', 'Florida', 2.6, 3),
        ('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
con.commit()
# 读取数据
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows
[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]
# 转换为DataFrame
# cursor.description
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])
a b c d
0 Atlanta Georgia 1.25 6
1 Tallahassee Florida 2.60 3
2 Sacramento California 1.70 5
import sqlalchemy as sqla

# sqlalchemy连接数据库
db = sqla.create_engine('sqlite:///mydata.sqlite')
# 读取数据到pd
pd.read_sql('select * from test', db)
a b c d
0 Atlanta Georgia 1.25 6
1 Tallahassee Florida 2.60 3
2 Sacramento California 1.70 5

你可能感兴趣的:(数据分析,利用Python进行数据分析)