pyarrow读写hdfs

官方API文档:https://arrow.apache.org/docs/python/index.html
1、测试服务器能够正确连接hdfs:

>hadoop fs -ls /
Found 5 items
drwxrwxrwx   - hbase supergroup          0 2021-09-15 13:58 /hbase
drwxr-xr-x   - root  root                0 2021-12-08 09:38 /hive
drwxrwxrwx   - root  root                0 2021-12-02 15:15 /system
drwxrwxrwx   - hdfs  supergroup          0 2021-12-08 15:54 /tmp
drwxrwxrwx   - hdfs  supergroup          0 2022-03-09 10:28 /user

2、安装 pyarrow:

pip install pyarrow

3、读写hdfs文件
功能:将hdfs上的文件’/user/data/df_raw_label.csv’读出来,转成pandas的dataframe(就可以使用pandas做进一步分析处理),然后以parquet格式再写回’/user/data’
csv_2_parquet.py

import pyarrow as pa
import pyarrow.parquet as parquet
import pyarrow.csv as csv


if __name__ == '__main__':
    hdfs_host = '你的hdfs服务器ip'
    hdfs_port = 8020
    file = '/user/data/df_raw_label.csv'
    fs = pa.hdfs.connect(host=hdfs_host, port=hdfs_port, user='hdfs')
    with fs.open(file, mode='rb') as f:
        arrow_table = csv.read_csv(f)
    df = arrow_table.to_pandas()
    print(df.head())
    parquet_file = '/user/data/df_raw_label.parquet'
    with fs.open(parquet_file, mode="wb") as f:
        parquet.write_table(arrow_table, f)        
    file_list = fs.ls('/user/data')
    print(file_list)

报错:/usr/local/lib64/python3.6/site-packages/pyarrow/下找不到libhdfs.so

__main__:1: FutureWarning: pyarrow.hdfs.HadoopFileSystem is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
Traceback (most recent call last):
  File "", line 1, in <module>
  File "/usr/local/lib64/python3.6/site-packages/pyarrow/hdfs.py", line 49, in __init__
    self._connect(host, port, user, kerb_ticket, extra_conf)
  File "pyarrow/_hdfsio.pyx", line 85, in pyarrow._hdfsio.HadoopFileSystem._connect
  File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
OSError: Unable to load libhdfs: ./libhdfs.so: cannot open shared object file: No such file or directory

解决方法:
(1)在系统中查找该文件的位置:

>find /opt/ -name libhdfs.so
/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib64/libhdfs.so

(2)简单方案
将libhdfs.so复制到所需位置:

cp /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib64/libhdfs.so /usr/local/lib64/python3.6/site-packages/pyarrow

(3)官方方案(暂未验证成功):
参考:https://arrow.apache.org/docs/python/filesystems_deprecated.html?highlight=arrow_libhdfs_dir
添加环境变量

vi /etc/profile

在文件尾添加:

export CLASSPATH=`hadoop classpath --glob`
export ARROW_LIBHDFS_DIR="/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib64/"

使环境变量生效:

>source /etc/profile
>export
......
declare -x ARROW_LIBHDFS_DIR="/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib64/"
declare -x CLASSPATH="/etc/hadoop/conf:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop/libexec/../../hadoop/lib/kerb-client-1.0.0.jar
......

你可能感兴趣的:(python基础教程,hdfs,hadoop,big,data,python)