python访问hive数据库踩坑指南

最近做数据ETL系统的更新,需要将原有ETL任务迁移到新的系统中,并验证数据的准确性。

目录

         安装依赖包

核心代码

遇到的坑

依赖包版本

安装依赖包

因为本人电脑是win本,所以只能使用impyla连接;其他系统还可以使用PyHive包进行连接

pip install impyla
pip install pure-sasl
pip install thrift_sasl
pip install thrift
pip install sasl

注意点:

直接使用pip安装sasl时,一般会报错!可以直接在前往https://www.lfd.uci.edu/~gohlke/pythonlibs/#sasl下载对应版本安装。(目前最高支持python3.7,更高的版本无法安装,后续是否支持待定)

核心代码

from impala.dbapi import connect
from pandas.testing import assert_frame_equal
import pandas as pd

# 连接hive
hive_conn = connect(host='127.0.0.1', port=12446, database=db_name,
                        user=user, password=password, auth_mechanism='PLAIN')
cursor = hive_conn.cursor()
# 查询数据量
cursor.execute('select count(1) from %s where %s = %s' % (table_name, pt_col, date))
ret = cursor.fetchall()
for j in ret:
    print(j)
ret = pd.DataFrame(ret)
print(ret)

遇到的坑

1、

TypeError: can’t concat str to bytes

 根据报错信息定位错误在\lib\site-packages\thrift_sasl\__init__.py第94行


header = struct.pack(">BI", status, len(body))
self._trans.write(header + body)

修改为

header = struct.pack(">BI", status, len(body))
if(type(body) is str):
	body = body.encode() 
self._trans.write(header + body)

2、

thrift.transport.TTransport.TTransportException: Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: Unable to find a callback: 2'

此种错误是sasl包版本原因, 直接将原来安装sasl包卸载,然后重新安装0.2.0 版本的thrift-sasl即可

3、

ThriftParserError: ThriftPy does not support generating module with path in protocol ‘c’

根据报错信息定位到 \Lib\site-packages\thriftpy\parser\parser.py

if url_scheme == '':
    with open(path) as fh:
        data = fh.read()
elif url_scheme in ('http', 'https'):
    data = urlopen(path).read()
else:
    raise ThriftParserError('ThriftPy does not support generating module '
                            'with path in protocol \'{}\''.format(
                                url_scheme))

 修改为

if url_scheme == '':
    with open(path) as fh:
        data = fh.read()
elif url_scheme in ('c', 'd','e','f''):
    with open(path) as fh:
        data = fh.read()


elif url_scheme in ('http', 'https'):
    data = urlopen(path).read()
else:
    raise ThriftParserError('ThriftPy does not support generating module '
                            'with path in protocol \'{}\''.format(
                                url_scheme))

依赖包版本

thrift                  0.13.0
thrift-sasl             0.2.1
thriftpy                 0.3.9
thriftpy2               0.4.0
bit-array               0.1.0
bitarray                 2.2.3
pure-sasl               0.6.2
impyla                   0.15a1


 

你可能感兴趣的:(数据仓库,python,hive,python)