python读取hdfs并返回dataframe

不多说,直接上代码

from hdfs import Client

import pandas as pd

 

HDFSHOST = "http://xxx:50070"

FILENAME = "/tmp/preprocess/part-00000" #hdfs文件路径

COLUMNNAMES = [xx']


 

def readHDFS():

'''

读取hdfs文件

 

Returns:

df:dataframe hdfs数据

'''

 

client = Client(HDFSHOST)

# 目前读取hdfs文件采用方式:

# 1. 先从hdfs读取二进制数据流文件

# 2. 将二进制文件另存为.csv

# 3. 使用pandas读取csv文件

with client.read(FILENAME) as fs:

content = fs.read()

s = str(content, 'utf-8')

file = open("data/tmp/data.csv", "w")

file.write(s)

df = pd.read_csv("data/tmp/data.csv", names=COLUMNNAMES)

return df

知乎: https://zhuanlan.zhihu.com/albertwang

微信公众号:AI-Research-Studio

https://img-blog.csdnimg.cn/20190110102516916.png ​​

下面是赞赏码

python读取hdfs并返回dataframe_第1张图片

你可能感兴趣的:(Machine,Learning,python,Deep,Learning)