1.问题一描述:在用python的hdfs库操作HDFS时,可以正常的获取到hdfs的文件目录
from hdfs import *
client = Client("http://10.0.30.9:50070")
print(client.list('/'))
['test.txt']
但是在读取文件时,出现了hdfs.util.HdfsError: File /user/dr.who/test.txt not found.的错误,尝试使用pyhdfs也是同样的问题,包括下面说的第二个问题
from hdfs import *
client = Client("http://10.0.30.9:50070")
print(client.list('/'))
with client.read('test.txt') as reader:
content = reader.read()
print(content)
Traceback (most recent call last):
File "E:/pycharm/workspace/hadoopforwin/myhdfs.py", line 5, in
with client.read('test.txt') as reader:
File "D:\python3.6\lib\contextlib.py", line 81, in __enter__
return next(self.gen)
File "D:\python3.6\lib\site-packages\hdfs\client.py", line 678, in read
buffersize=buffer_size,
File "D:\python3.6\lib\site-packages\hdfs\client.py", line 112, in api_handler
raise err
File "D:\python3.6\lib\site-packages\hdfs\client.py", line 107, in api_handler
**self.kwargs
File "D:\python3.6\lib\site-packages\hdfs\client.py", line 210, in _request
_on_error(response)
File "D:\python3.6\lib\site-packages\hdfs\client.py", line 50, in _on_error
raise HdfsError(message, exception=exception)
hdfs.util.HdfsError: File /user/dr.who/test.txt not found.
2.问题一解决方法:出现这个问题是因为没有指定根路径(root path),需要在调用Client方法连接hdfs时指定root path
from hdfs import *
client = Client("http://10.0.30.9:50070", root='/')
print(client.list('/'))
with client.read('test.txt') as reader:
content = reader.read()
print(content)
执行代码,又出现了新的问题。。。。。
3.问题二描述:报错内容的最后一行如下,这里的hmaster是hadoop主机的主机名,说明程序没有将主机名映射到正确的ip
requests.exceptions.ConnectionError: HTTPConnectionPool(host='hmaster', port=50075): Max retries exceeded with url: /webhdfs/v1/test.txt?op=OPEN&namenoderpcaddress=hMaster:9000&offset=0 (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
4.问题二解决方法:在运行python程序的主机的hosts文件中加上主机名和ip的映射,对于我所使用的windows系统,hosts文件的路径是C://Windows/System32/drivers/etc/hosts,在文件末尾加上
ip 主机名
以本文的情况为例,则是
10.0.30.9 hmaster
修改完记得保存,运行程序成功读取文件。
5.在使用hdfs和pyhdfs库时,除了读取文件,还有一些方法也会出现这种情况,解决方法相同