Python : Arrow、Pyarrow库、以及与Julia互读

test.csv 样例:
在这里插入图片描述
一、从csv =>table=>生成arrow文件

import pyarrow as pa
from pyarrow import csv
csv_path = 'C:\\Users\\songroom\\Desktop\\test.csv'
table = csv.read_csv(csv_path)
#df = table.to_pandas()
#table = pa.Table.from_pandas(df)
path = 'C:\\Users\\songroom\\Desktop\\py.arrow'
writer = pa.RecordBatchFileWriter(path, table.schema)
writer.write_table(table)
writer.close()

二、读出arrow文件,并转成DataFrame

用下例方式,读出pyarrow写的py.arrow文件:


import pyarrow as pa;
path = 'C:\\Users\\songroom\\Desktop\\py.arrow'
df = pa.ipc.open_file(path).read_pandas()
print(df)

三、julia与python arrow文件的互读

1、pyarrrow: 读julia生成的一个test.arrow文件。

值得注意的是,以下两种方式在与julia文件交互上有较大不同:

# 不能读出julia对应的test.arrow文件
def read_arrow_to_df_julia_not_ok(path):
    df = pa.ipc.open_file(path).read_pandas()
    return df
# 可以读出julia对应的test.arrow文件
def read_arrow_to_df_julia_ok(path):
    with open(path,"rb") as f:
        r = pa.ipc.RecordBatchStreamReader(f)
        df = r.read_pandas()
    return df
>>> path = 'C:\\Users\\songroom\\Desktop\\test.arrow'
>>> t0 = t.time()
>>> df = read_arrow_to_df_julia_ok(path)
>>> t1 = t.time()
>>> print("read julia arrow file cost time: ",t1-t0)
read julia arrow file cost time:  0.23099970817565918

但好象速度很慢!0.23s.

2、julia:读出pyarrow库生成的py.arrow文件

代码如下:

using DataFrames;
using Arrow

arrow_path = "C:\\Users\\songroom\\Desktop\\py.arrow"
@time df = read_arrow_file(arrow_path)
function read_arrow_file(arrow_path::String)
    println("read arrow file ")
    df = DataFrame(Arrow.Table(arrow_path))
    return df
end

可以正常读出。

Python : Arrow、Pyarrow库、以及与Julia互读_第1张图片

你可能感兴趣的:(Arrow,python,Julia)