test.csv 样例:
一、从csv =>table=>生成arrow文件
import pyarrow as pa
from pyarrow import csv
csv_path = 'C:\\Users\\songroom\\Desktop\\test.csv'
table = csv.read_csv(csv_path)
#df = table.to_pandas()
#table = pa.Table.from_pandas(df)
path = 'C:\\Users\\songroom\\Desktop\\py.arrow'
writer = pa.RecordBatchFileWriter(path, table.schema)
writer.write_table(table)
writer.close()
二、读出arrow文件,并转成DataFrame
用下例方式,读出pyarrow写的py.arrow文件:
import pyarrow as pa;
path = 'C:\\Users\\songroom\\Desktop\\py.arrow'
df = pa.ipc.open_file(path).read_pandas()
print(df)
三、julia与python arrow文件的互读
1、pyarrrow: 读julia生成的一个test.arrow文件。
值得注意的是,以下两种方式在与julia文件交互上有较大不同:
# 不能读出julia对应的test.arrow文件
def read_arrow_to_df_julia_not_ok(path):
df = pa.ipc.open_file(path).read_pandas()
return df
# 可以读出julia对应的test.arrow文件
def read_arrow_to_df_julia_ok(path):
with open(path,"rb") as f:
r = pa.ipc.RecordBatchStreamReader(f)
df = r.read_pandas()
return df
>>> path = 'C:\\Users\\songroom\\Desktop\\test.arrow'
>>> t0 = t.time()
>>> df = read_arrow_to_df_julia_ok(path)
>>> t1 = t.time()
>>> print("read julia arrow file cost time: ",t1-t0)
read julia arrow file cost time: 0.23099970817565918
但好象速度很慢!0.23s.
2、julia:读出pyarrow库生成的py.arrow文件
代码如下:
using DataFrames;
using Arrow
arrow_path = "C:\\Users\\songroom\\Desktop\\py.arrow"
@time df = read_arrow_file(arrow_path)
function read_arrow_file(arrow_path::String)
println("read arrow file ")
df = DataFrame(Arrow.Table(arrow_path))
return df
end
可以正常读出。