11个100w条dataframe数据写一张工作表多张工作簿优化记录

问题:目前有11个dataframe,最多的dataframe有100w条数据,现在需要把这11个dataframe写入同一个表格的11个工作簿

解决历程:

1.使用 pd.ExcelWriter方法,默认为xlwriter引擎,修改为 openpyxl:

ps: 在本地写入100w条数据大概2分多,但是服务器上大概要20分钟。。。

with pd.ExcelWriter(self.file_name, engine='openpyxl') as file_writer:
    sheet_name1 = 'aaa'
    frame1.to_excel(writer, engine='openpyxl', encoding='utf-8', sheet_name=sheet_name1, index=False)

    sheet_name2 = 'bbb'
    frame2.to_excel(writer, engine='openpyxl', encoding='utf-8', sheet_name=sheet_name2, index=False)

     ...

上述方法只能简单的写入数据,效率较慢

2.在上述的基础上,使用多线程写入,效率提升一小半

with pd.ExcelWriter(self.file_name, engine="openpyxl") as file_writer:
    """采用多线程方式,提高写入速度"""
        thread1 = threading.Thread(target=data_to_excel, args=(frame1, file_writer, 'aaa', 1))
        thread2 = threading.Thread(target=data_to_excel, args=(frame2, file_writer, 'bbb')
        
       ...

        thread1.start()
        thread2.start()
       ...

        thread1.join()
        thread2.join()
       ...

3.虽然使用了多线程,但是保存时最后的save动作耗费的时间还是很长,所以继续优化,直接使用openpyxl库来进行写入:

from openpyxl.utils.dataframe import dataframe_to_rows
from openpyxl import Workbook


file_writer = Workbook()
thread1 = threading.Thread(target=data_to_excel, args=(cell_del_frame, file_writer, 'aaa', 1))
thread2 = threading.Thread(target=data_to_excel, args=(cell_data, file_writer, 'bbb'))

...


thread1.start()
thread2.start()
...

thread1.join()
thread2.join()
...

file_writer.save('data.xlsx')



"""""""""""""
data_to_excel
"""""""""""""

def data_to_excel(frame, writer, sheet_name, target=None):
    """ frame写入 excel工作簿"""
    print("开始写入:", sheet_name)
    print("监控报告数据长度:", len(frame))
    # frame.to_excel(writer, engine='openpyxl', encoding='utf-8', sheet_name=sheet_name, index=False)
    if target == 1:
        sheet = writer.active
        sheet.title = sheet_name
    else:
        sheet = writer.create_sheet(sheet_name)
    for row in dataframe_to_rows(frame, index=False):
        sheet.append(row)
    print("写入完成:", sheet_name)

整体写入效率提升1/3,18分钟--> 12分钟完成写入

4.再次更新,效率继续提升,跟上面比再提升1/3的效率,时间缩短1/3。使用 pyexcelerate 库来实现,这里贴一下这个库的官方说明:

PyExcelerate库说明

下面是示例代码:

import pyexcelerate as pe
import datetime

t1 = datetime.datetime.now()
workbook = pe.Workbook()
for i, df in enumerate([data1, data2, data3]):
    sheet_name = f"sheet{i + 1}"
    df_columns = df.columns.tolist()
    col_length = len(df_columns)
    # df.values的结果是一个,没有表头的二维数组,需要手动把表头拼在最前面!
    value = np.concatenate(([np.array(df_columns)], df.values), axis=0)
    workbook.new_sheet(sheet_name, data=value)
workbook.save("output.xlsx")
t2 = datetime.datetime.now()
print('pyexcelerate花费时间:', t2 - t1)

你可能感兴趣的:(python,pandas,大数据)