pandas教程:Interacting with Web APIs API和数据库的交互

文章目录

  • 6.3 Interacting with Web APIs (网络相关的API交互)
  • 6.4 Interacting with Databases(与数据库的交互)

6.3 Interacting with Web APIs (网络相关的API交互)

很多网站都有公开的API,通过JSON等格式提供数据流。有很多方法可以访问这些API,这里推荐一个易用的requests包。

找到githubpandas最新的30个issues,制作一个GET HTTP request, 通过使用requests包:

import pandas as pd
import numpy as np
import requests
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
resp

responsejson方法能返回一个dict,包含可以解析为python objectJSON

data = resp.json()
data[0]['title']
'Optimize data type'
data[0]
{'assignee': None,
 'assignees': [],
 'author_association': 'NONE',
 'body': 'Hi guys, i\'m user of mysql\r\nwe have an "function" PROCEDURE ANALYSE\r\nhttps://dev.mysql.com/doc/refman/5.5/en/procedure-analyse.html\r\n\r\nit get all "dataframe" and show what\'s the best "dtype", could we do something like it in Pandas?\r\n\r\nthanks!',
 'closed_at': None,
 'comments': 1,
 'comments_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/18272/comments',
 'created_at': '2017-11-13T22:51:32Z',
 'events_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/18272/events',
 'html_url': 'https://github.com/pandas-dev/pandas/issues/18272',
 'id': 273606786,
 'labels': [],
 'labels_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/18272/labels{/name}',
 'locked': False,
 'milestone': None,
 'number': 18272,
 'repository_url': 'https://api.github.com/repos/pandas-dev/pandas',
 'state': 'open',
 'title': 'Optimize data type',
 'updated_at': '2017-11-13T22:57:27Z',
 'url': 'https://api.github.com/repos/pandas-dev/pandas/issues/18272',
 'user': {'avatar_url': 'https://avatars0.githubusercontent.com/u/2468782?v=4',
  'events_url': 'https://api.github.com/users/rspadim/events{/privacy}',
  'followers_url': 'https://api.github.com/users/rspadim/followers',
  'following_url': 'https://api.github.com/users/rspadim/following{/other_user}',
  'gists_url': 'https://api.github.com/users/rspadim/gists{/gist_id}',
  'gravatar_id': '',
  'html_url': 'https://github.com/rspadim',
  'id': 2468782,
  'login': 'rspadim',
  'organizations_url': 'https://api.github.com/users/rspadim/orgs',
  'received_events_url': 'https://api.github.com/users/rspadim/received_events',
  'repos_url': 'https://api.github.com/users/rspadim/repos',
  'site_admin': False,
  'starred_url': 'https://api.github.com/users/rspadim/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/rspadim/subscriptions',
  'type': 'User',
  'url': 'https://api.github.com/users/rspadim'}}

data中的每一个元素都是一个dict,这个dict就是在github上找到的issue页面上的信息。我们可以把data传给DataFrame并提取感兴趣的部分:

issues = pd.DataFrame(data, columns=['number', 'title', 
                                    'labels', 'state'])
issues
number title labels state
0 18272 Optimize data type [] open
1 18271 BUG: Series.rank(pct=True).max() != 1 for a la... [] open
2 18270 (Series|DataFrame) datetimelike ops [] open
3 18268 DOC: update Series.combine/DataFrame.combine d... [] open
4 18266 DOC: updated .combine_first doc strings [{'url': 'https://api.github.com/repos/pandas-... open
5 18265 Calling DataFrame.stack on an out-of-order col... [] open
6 18264 cleaned up imports [{'url': 'https://api.github.com/repos/pandas-... open
7 18263 Tslibs offsets paramd [] open
8 18262 DEPR: let's deprecate [{'url': 'https://api.github.com/repos/pandas-... open
9 18258 DEPR: deprecate (Sparse)Series.from_array [{'url': 'https://api.github.com/repos/pandas-... open
10 18255 ENH/PERF: Add cache='infer' to to_datetime [{'url': 'https://api.github.com/repos/pandas-... open
11 18250 Categorical.replace() unexpectedly returns non... [{'url': 'https://api.github.com/repos/pandas-... open
12 18246 pandas.MultiIndex.reorder_levels has no inplac... [] open
13 18245 TST: test tz-aware DatetimeIndex as separate m... [{'url': 'https://api.github.com/repos/pandas-... open
14 18244 RLS 0.21.1 [{'url': 'https://api.github.com/repos/pandas-... open
15 18243 DEPR: deprecate .ftypes, get_ftype_counts [{'url': 'https://api.github.com/repos/pandas-... open
16 18242 CLN: Remove days, seconds and microseconds pro... [{'url': 'https://api.github.com/repos/pandas-... open
17 18241 DEPS: drop 2.7 support [{'url': 'https://api.github.com/repos/pandas-... open
18 18238 BUG: Fix filter method so that accepts byte an... [{'url': 'https://api.github.com/repos/pandas-... open
19 18237 Deprecate Series.asobject, Index.asobject, ren... [{'url': 'https://api.github.com/repos/pandas-... open
20 18236 df.plot() very slow compared to explicit matpl... [{'url': 'https://api.github.com/repos/pandas-... open
21 18235 Quarter.onOffset looks fishy [] open
22 18231 Reduce copying of input data on Series constru... [{'url': 'https://api.github.com/repos/pandas-... open
23 18226 Patch __init__ to prevent passing invalid kwds [{'url': 'https://api.github.com/repos/pandas-... open
24 18222 DataFrame.plot() produces incorrect legend lab... [{'url': 'https://api.github.com/repos/pandas-... open
25 18220 DataFrame.groupy renames columns when given a ... [] open
26 18217 Deprecate Index.summary [{'url': 'https://api.github.com/repos/pandas-... open
27 18216 Pass kwargs from read_parquet() to the underly... [{'url': 'https://api.github.com/repos/pandas-... open
28 18215 DOC/DEPR: ensure that @deprecated functions ha... [{'url': 'https://api.github.com/repos/pandas-... open
29 18213 Deprecate Series.from_array ? [{'url': 'https://api.github.com/repos/pandas-... open

6.4 Interacting with Databases(与数据库的交互)

如果在工作中,大部分数据并不会以textexcel的格式存储。最广泛使用的是SQL-based的关系型数据库(SQL Server,PostgreSQL,MySQL)。选择数据库通常取决于性能,数据整合性,实际应用的可扩展性。

读取SQLDataFrame非常直观,pandas中有一些函数能简化这个过程。举个例子,这里创建一个SQLite数据库,通过使用python内建的sqlite3 driver

import sqlite3
import pandas as pd
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
 c REAL,        d INTEGER
);"""
con = sqlite3.connect('../examples/mydata.sqlite')
con.execute(query)

con.commit()

然后我们插入几行数据:

data = [('Atlanta', 'Georgia', 1.25, 6),
        ('Tallahassee', 'Florida', 2.6, 3),
        ('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)

con.commit()

大部分pythonSQL驱动(PyODBC, psycopg2, MySQLdb, pymssql, 等)返回a list of tuple,当从一个表格选择数据的时候:

cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows
[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

我们可以把list of tuples传递给DataFrame,但是我们也需要column names,包含cursordescription属性:

cursor.description
(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])
a b c d
0 Atlanta Georgia 1.25 6
1 Tallahassee Florida 2.60 3
2 Sacramento California 1.70 5

我们不希望每次询问数据库的时候都重复以上步骤,这样对计算机很不好(逐步对计算机系统或文件做小改动导致大的损害)。SQLAlchemy计划是一个六星的Python SQL工具箱,它能抽象出不同SQL数据库之间的不同。pandas有一个read_sql函数,能让我们从SQLAlchemy connection从读取数据。这里我们用SQLAlchemy连接到同一个SQLite数据库,并从之前创建的表格读取数据:

import sqlalchemy as sqla
db = sqla.create_engine('sqlite:///../examples/mydata.sqlite')
pd.read_sql('select * from test', db)
a b c d
0 Atlanta Georgia 1.25 6
1 Tallahassee Florida 2.60 3
2 Sacramento California 1.70 5

你可能感兴趣的:(pandas使用教程,pandas,前端,数据库,python,开发语言,transformer,requests)