利用python对巨量数据排序

需求背景

我们有一份100G左右的数据需要根据关键字进行排序,当时想的是直接从数据库select出来的时候直接order by,但是爆内存了,于是考虑导出后直接利用python进行排序。

算法设计

直接利用切割排序,再合并的方式,将100G文件分为40个2.5G的数据文件,分别排序后再归并,思想和leetcode合并n个有序数组的想法如出一辙

归并代码

import glob
import heapq

if __name__ == '__main__':   
    csv_list = glob.glob('./csv/*.csv') 
    print('find %s CSV files'% len(csv_list))
    # if csv file less than 2,we don't need to merge, exit the script
    if len(csv_list) < 2:
        return 0
        
    # open csv file, store the file_handler
    print('processing............')
    file_handler = []
    for i in csv_list: 
        print('opening '+str(i))
        fr = open(i, 'rt')
        file_handler.append(fr)

    # merge all files, sort by ad_id whose index is 120 
    res = heapq.merge(file_handler[0], file_handler[1], key = lambda x:int(x.split(',')[120]))
    for i in range(2, len(file_handler)):
        res = heapq.merge(res, file_handler[i], key = lambda x:int(x.split(',')[120]))

    # cnt: count the record that had been written to the file
    # file_ptr: pointer of opening file to be wrriten
    # contfile: new file number
    cnt = 0
    file_ptr = ''
    cntfile = 0
    for line in res:
        if cnt == 0:
            print("creating file "+str(cntfile))
            file_ptr = open('./csv_sort/file_'+str(cntfile)+'.csv', 'w')
        file_ptr.write(line)
        cnt+=1
        if cnt%20000 == 0:
            print("already writing : "+str(cnt))
        if cnt == 540000:
            print('file '+str(cntfile)+' done')
            cnt=0
            file_ptr.close()
            cntfile+=1
    # close the last file
    if cnt!=0:
        print('file '+str(cntfile)+' done')
        file_ptr.close()
    # close all input files
    for fr in file_handler:
        fr.close()

    print('done')

你可能感兴趣的:(python,算法)