aws之athena使用

有两种方式第一种通过调用boto3使用,第二种pyathenajdbc使用,相对而言两种方式各有优劣。
第一种

import boto3
import pandas as pd
import io
import time
athena = boto3.client('athena', region_name='ap-southeast-1')
s3 = boto3.client('s3')
def get_data_by_boto3(sql):
    bucket='bigdata-athena'
    print(sql)
    s3_output="s3://bigdata-athena/staging/parquet/mysql/"
    response = athena.start_query_execution(
        QueryString=sql,
        QueryExecutionContext={
            'Database': 'systemlogs'
        },
        ResultConfiguration={
            'OutputLocation': s3_output
        }
    )
    print(response)
    file_name = response['QueryExecutionId'] + '.csv'
    file_name = response['QueryExecutionId'] + '.txt'
    key="staging/parquet/mysql/"+file_name
    print(key)
    obj = None
    for i in range(2):
        try:
            time.sleep(5)
            obj = s3.get_object(Bucket=bucket, Key=key)['Body'].read().decode("utf-8")
            df = pd.read_csv(io.StringIO(obj))
            obj = s3.get_object(Bucket=bucket, Key=key)['Body'].read()
            df = pd.read_csv(io.BytesIO(obj))
            df.to_csv("txt",index=0)
            print(df)
            break
        except Exception as e:
            print(e)
if __name__ == '__main__':
    sql="show tables"
    get_data_by_boto3(sql)

查询时间比较长,这种方式适合查询大批量数据,当数据比较多时间增长也不太多。查询table时返回的是文件是txt,若使用csv则读不出来。
第二种

from pyathenajdbc import connect
def get_data_by_athena(sql):
    # schema_name 命名空间
    conn = connect(s3_staging_dir="s3://bigdata-athena/staging/parquet/mysql/",region_name="ap-southeast-1",schema_name='systemlogs')
    #一种方式较为简单
    df=pd.read_sql(sql,con=conn)
    #第二种
    try:
        with conn.cursor() as cursor:
            cursor.execute(sql)
            conn.commit()
            results = cursor.fetchall()
            indexs = cursor.description
            print(results,indexs)
            ins=[]
            for index in indexs:
                ins.append(index[0])
            return {"results":results,"ins":ins}
    finally:
        conn.close()
if __name__ == '__main__':
    sql="show tables"
    get_data_by_athena(sql)

查询时间根据数量而定,第二种可以使用df=pandas.read_sql(sql,con=conn)。数据多的时候时间过于长,适用于少量数据

你可能感兴趣的:(athena)