python 连接Mysql和hive

python 连接Mysql和hive

1、python 连接Mysql

import pymysql
import pandas as pd
# 连接database
conn = pymysql.connect(
    host=“你的数据库地址”,
    port=3306,
    user=“用户名”,password=“密码”,
    database=“数据库名”,
    charset=“utf8”)
# 得到一个可以执行SQL语句的光标对象
cursor = conn.cursor()  # 执行完毕返回的结果集默认以元组显示
# 得到一个可以执行SQL语句并且将结果作为字典返回的游标
#cursor = conn.cursor(cursor=pymysql.cursors.DictCursor)
# 定义要执行的SQL语句
sql = 
"""
show tables
"""
# 执行SQL语句
cursor.execute(sql)
# 获取数据方式一:
all_line = cursor.fetchall()
# 获取数据方式二:
result_df = pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, 
                    parse_dates=None, columns=None, chunksize=None)
# 关闭光标对象
cursor.close()
# 关闭数据库连接
conn.close()

常用的参数是sql:SQL命令或者表名字,con:连接数据库的引擎,可以用SQLAlchemy或者pymysql建立,从数据库读数据的基本用法给出sql和con就可以了。其它都是默认参数,有特殊需求才会用到,有兴趣的话可以查看文档。

2、python 写入mysql

# engine = create_engine('mysql+pymysql://user:password@hosts:port/database?charset=utf8')
# 查询语句,选出cp_settlepay_report表中的所有数据
# sql = 'select * from db_finance.cp_settlepay_report'
# # read_sql_query的两个参数: sql语句, 数据库连接
# df = pd.read_sql_query(sql, engine)
# 将新建的DataFrame储存为MySQL中的数据表,不储存index列(index=False)
# if_exists:
# 1.fail:如果表存在,啥也不做
# 2.replace:如果表存在,删了表,再建立一个新表,把数据插入
# 3.append:如果表存在,把数据插入,如果表不存在创建一个表!!
# pd.io.sql.to_sql(df, 'example', con=engine, index=False, if_exists='replace')
# df.to_sql('example', con=engine,  if_exists='replace')这种形式也可以
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://xiaowu:[email protected]:3306/db_finance?charset=utf8')
try:
    # 执行sql
    pd.io.sql.to_sql(df,'cp_settlepay_report', engine, schema='db_finance', if_exists='append',index = False)
except Exception as e:
    print(e)
    #有异常就回滚
    engine.rollback()

# 关闭链接
cursor.close()
conn.close()

3、python 连接hive

Win7平台Python3使用安装impyla
(1) pip install pure-sasl

(2) pip install thrift_sasl==0.2.1 --no-deps

(3) pip install thrift==0.9.3

(4) pip install impyla

执行数据库连接后,出现问题

ThriftParserError: ThriftPy does not support generating module with path in protocol ‘c’

定位到 \Lib\site-packages\thriftpy\parser\parser.py的

if url_scheme == '':
    with open(path) as fh:
        data = fh.read()
elif url_scheme in ('http', 'https'):
    data = urlopen(path).read()
else:
    raise ThriftParserError('ThriftPy does not support generating module '
                            'with path in protocol \'{}\''.format(
                                url_scheme))

更改为

if url_scheme == '':
    with open(path) as fh:
        data = fh.read()
elif url_scheme in ('c', 'd','e','f''):
    with open(path) as fh:
        data = fh.read()


elif url_scheme in ('http', 'https'):
    data = urlopen(path).read()
else:
    raise ThriftParserError('ThriftPy does not support generating module '
                            'with path in protocol \'{}\''.format(
                                url_scheme))

执行数据库连接后,再次出现问题

TypeError: can’t concat str to bytes

定位到错误的最后一条,在\lib\site-packages\thrift_sasl_init_.py第94行

...
header = struct.pack(">BI", status, len(body))
self._trans.write(header + body)
...

改为

...
header = struct.pack(">BI", status, len(body))
if(type(body) is str):
	body = body.encode() 
self._trans.write(header + body)
...

python 连接Mysql和hive_第1张图片
连接hive

from impala.dbapi import connect
from impala.util import as_pandas

conn = connect(host='***', port=10000, auth_mechanism='PLAIN', user='***', password='***',
                      database='***')
cursor = conn.cursor()
cursor.execute('show databases')
result_df = as_pandas(cursor)
# all_line = cursor.fetchall()

4、python 写入hive
关键流程主要分为两步:

(1)将pandas dataframe转换为sparkdataframe:这一步骤主要使用spark自带的接口:

spark_df = spark.createDataFrame(pd_df)

(2)将spark_df写入到hive的几种方式

spark_df.write.mode('overwrite').format("hive").saveAsTable("dbname.tablename")

完整代码:

import pandas as pd
import numpy as np
from pyspark import SparkContext,SparkConf
from pyspark.sql import HiveContext,SparkSession
from pyspark.sql import SQLContext

pd_df = pd.DataFrame(np.random.randint(0,10,(3,4)),columns=['a','b','c'])

spark = SparkSession.builder.appName('pd_2_hive').master('local').enableHiveSupport().getOrCreate()
spark_df = spark.createDataFrame(pd_df)

#spark dataframe 有接口可以直接写入到hive
spark_df.write.mode('overwrite').format("hive").saveAsTable("dbname.tablename")
'''
其中 overwrite 代表如果表中存在数据,那么新数据会将原来的数据覆盖,此外还有append等模式,详细介绍如下:
        * `append`: Append contents of this :class:`DataFrame` to existing data.
        * `overwrite`: Overwrite existing data.
        * `error` or `errorifexists`: Throw an exception if data already exists.
        * `ignore`: Silently ignore this operation if data already exists.
'''


#此外还可以将spark_df 注册为临时表,之后通过sql的方式写到hive里
spark_df.registerTempTable('tmp_table')
tmp_sql = '''create table dbname.tablename as select * from tmp_table'''
spark.sql(tmp_sql)
spark.stop()

你可能感兴趣的:(python代码,python,mysql,数据库)