很多网站都有公开的API
,通过JSON
等格式提供数据流。有很多方法可以访问这些API
,这里推荐一个易用的requests
包。
找到github
里pandas
最新的30个issues
,制作一个GET HTTP request
, 通过使用requests
包:
import pandas as pd
import numpy as np
import requests
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
resp
response
的json
方法能返回一个dict
,包含可以解析为python object
的JSON
:
data = resp.json()
data[0]['title']
'Optimize data type'
data[0]
{'assignee': None,
'assignees': [],
'author_association': 'NONE',
'body': 'Hi guys, i\'m user of mysql\r\nwe have an "function" PROCEDURE ANALYSE\r\nhttps://dev.mysql.com/doc/refman/5.5/en/procedure-analyse.html\r\n\r\nit get all "dataframe" and show what\'s the best "dtype", could we do something like it in Pandas?\r\n\r\nthanks!',
'closed_at': None,
'comments': 1,
'comments_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/18272/comments',
'created_at': '2017-11-13T22:51:32Z',
'events_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/18272/events',
'html_url': 'https://github.com/pandas-dev/pandas/issues/18272',
'id': 273606786,
'labels': [],
'labels_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/18272/labels{/name}',
'locked': False,
'milestone': None,
'number': 18272,
'repository_url': 'https://api.github.com/repos/pandas-dev/pandas',
'state': 'open',
'title': 'Optimize data type',
'updated_at': '2017-11-13T22:57:27Z',
'url': 'https://api.github.com/repos/pandas-dev/pandas/issues/18272',
'user': {'avatar_url': 'https://avatars0.githubusercontent.com/u/2468782?v=4',
'events_url': 'https://api.github.com/users/rspadim/events{/privacy}',
'followers_url': 'https://api.github.com/users/rspadim/followers',
'following_url': 'https://api.github.com/users/rspadim/following{/other_user}',
'gists_url': 'https://api.github.com/users/rspadim/gists{/gist_id}',
'gravatar_id': '',
'html_url': 'https://github.com/rspadim',
'id': 2468782,
'login': 'rspadim',
'organizations_url': 'https://api.github.com/users/rspadim/orgs',
'received_events_url': 'https://api.github.com/users/rspadim/received_events',
'repos_url': 'https://api.github.com/users/rspadim/repos',
'site_admin': False,
'starred_url': 'https://api.github.com/users/rspadim/starred{/owner}{/repo}',
'subscriptions_url': 'https://api.github.com/users/rspadim/subscriptions',
'type': 'User',
'url': 'https://api.github.com/users/rspadim'}}
data
中的每一个元素都是一个dict
,这个dict
就是在github
上找到的issue
页面上的信息。我们可以把data
传给DataFrame
并提取感兴趣的部分:
issues = pd.DataFrame(data, columns=['number', 'title',
'labels', 'state'])
issues
number | title | labels | state | |
---|---|---|---|---|
0 | 18272 | Optimize data type | [] | open |
1 | 18271 | BUG: Series.rank(pct=True).max() != 1 for a la... | [] | open |
2 | 18270 | (Series|DataFrame) datetimelike ops | [] | open |
3 | 18268 | DOC: update Series.combine/DataFrame.combine d... | [] | open |
4 | 18266 | DOC: updated .combine_first doc strings | [{'url': 'https://api.github.com/repos/pandas-... | open |
5 | 18265 | Calling DataFrame.stack on an out-of-order col... | [] | open |
6 | 18264 | cleaned up imports | [{'url': 'https://api.github.com/repos/pandas-... | open |
7 | 18263 | Tslibs offsets paramd | [] | open |
8 | 18262 | DEPR: let's deprecate | [{'url': 'https://api.github.com/repos/pandas-... | open |
9 | 18258 | DEPR: deprecate (Sparse)Series.from_array | [{'url': 'https://api.github.com/repos/pandas-... | open |
10 | 18255 | ENH/PERF: Add cache='infer' to to_datetime | [{'url': 'https://api.github.com/repos/pandas-... | open |
11 | 18250 | Categorical.replace() unexpectedly returns non... | [{'url': 'https://api.github.com/repos/pandas-... | open |
12 | 18246 | pandas.MultiIndex.reorder_levels has no inplac... | [] | open |
13 | 18245 | TST: test tz-aware DatetimeIndex as separate m... | [{'url': 'https://api.github.com/repos/pandas-... | open |
14 | 18244 | RLS 0.21.1 | [{'url': 'https://api.github.com/repos/pandas-... | open |
15 | 18243 | DEPR: deprecate .ftypes, get_ftype_counts | [{'url': 'https://api.github.com/repos/pandas-... | open |
16 | 18242 | CLN: Remove days, seconds and microseconds pro... | [{'url': 'https://api.github.com/repos/pandas-... | open |
17 | 18241 | DEPS: drop 2.7 support | [{'url': 'https://api.github.com/repos/pandas-... | open |
18 | 18238 | BUG: Fix filter method so that accepts byte an... | [{'url': 'https://api.github.com/repos/pandas-... | open |
19 | 18237 | Deprecate Series.asobject, Index.asobject, ren... | [{'url': 'https://api.github.com/repos/pandas-... | open |
20 | 18236 | df.plot() very slow compared to explicit matpl... | [{'url': 'https://api.github.com/repos/pandas-... | open |
21 | 18235 | Quarter.onOffset looks fishy | [] | open |
22 | 18231 | Reduce copying of input data on Series constru... | [{'url': 'https://api.github.com/repos/pandas-... | open |
23 | 18226 | Patch __init__ to prevent passing invalid kwds | [{'url': 'https://api.github.com/repos/pandas-... | open |
24 | 18222 | DataFrame.plot() produces incorrect legend lab... | [{'url': 'https://api.github.com/repos/pandas-... | open |
25 | 18220 | DataFrame.groupy renames columns when given a ... | [] | open |
26 | 18217 | Deprecate Index.summary | [{'url': 'https://api.github.com/repos/pandas-... | open |
27 | 18216 | Pass kwargs from read_parquet() to the underly... | [{'url': 'https://api.github.com/repos/pandas-... | open |
28 | 18215 | DOC/DEPR: ensure that @deprecated functions ha... | [{'url': 'https://api.github.com/repos/pandas-... | open |
29 | 18213 | Deprecate Series.from_array ? | [{'url': 'https://api.github.com/repos/pandas-... | open |
如果在工作中,大部分数据并不会以text
或excel
的格式存储。最广泛使用的是SQL-based
的关系型数据库(SQL Server,PostgreSQL,MySQL
)。选择数据库通常取决于性能,数据整合性,实际应用的可扩展性。
读取SQL
到DataFrame
非常直观,pandas
中有一些函数能简化这个过程。举个例子,这里创建一个SQLite
数据库,通过使用python
内建的sqlite3 driver
:
import sqlite3
import pandas as pd
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
);"""
con = sqlite3.connect('../examples/mydata.sqlite')
con.execute(query)
con.commit()
然后我们插入几行数据:
data = [('Atlanta', 'Georgia', 1.25, 6),
('Tallahassee', 'Florida', 2.6, 3),
('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
con.commit()
大部分python
的SQL
驱动(PyODBC, psycopg2, MySQLdb, pymssql
, 等)返回a list of tuple
,当从一个表格选择数据的时候:
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows
[('Atlanta', 'Georgia', 1.25, 6),
('Tallahassee', 'Florida', 2.6, 3),
('Sacramento', 'California', 1.7, 5)]
我们可以把list of tuples
传递给DataFrame
,但是我们也需要column names
,包含cursor
的description
属性:
cursor.description
(('a', None, None, None, None, None, None),
('b', None, None, None, None, None, None),
('c', None, None, None, None, None, None),
('d', None, None, None, None, None, None))
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])
a | b | c | d | |
---|---|---|---|---|
0 | Atlanta | Georgia | 1.25 | 6 |
1 | Tallahassee | Florida | 2.60 | 3 |
2 | Sacramento | California | 1.70 | 5 |
我们不希望每次询问数据库的时候都重复以上步骤,这样对计算机很不好(逐步对计算机系统或文件做小改动导致大的损害)。SQLAlchemy
计划是一个六星的Python SQL
工具箱,它能抽象出不同SQL
数据库之间的不同。pandas
有一个read_sql
函数,能让我们从SQLAlchemy connection
从读取数据。这里我们用SQLAlchemy
连接到同一个SQLite
数据库,并从之前创建的表格读取数据:
import sqlalchemy as sqla
db = sqla.create_engine('sqlite:///../examples/mydata.sqlite')
pd.read_sql('select * from test', db)
a | b | c | d | |
---|---|---|---|---|
0 | Atlanta | Georgia | 1.25 | 6 |
1 | Tallahassee | Florida | 2.60 | 3 |
2 | Sacramento | California | 1.70 | 5 |