有如下三文件:
wc -l breakfast_all cheap_all receptions_all
3345271 breakfast_all
955890 cheap_all
505504 receptions_all
4806665 总用量
head -3 cheap_all
a true
b true
c true
三个文件的结构都类似,第一列为uid。现在想统计三个文件中总共有多少不重复的uid。特意用python与awk分别写了代码,测试两者处理文本的速度。
python代码:
#!/usr/bin/env python #coding:utf-8 import time def t1(): dic = {} filelist = ["breakfast_all","receptions_all","cheap_all"] start = time.clock() for each in filelist: f = open(each,'r') for line in f.readlines(): key = line.strip().split()[0] if key not in dic: dic[key] = 1 end = time.clock() print len(dic) print 'cost time is: %f' %(end - start) def t2(): uid_set = set() filelist = ["breakfast_all","receptions_all","cheap_all"] start = time.clock() for each in filelist: f = open(each,'r') for line in f.readlines(): key = line.strip().split()[0] uid_set.add(key) end = time.clock() print len(uid_set) print 'cost time is: %f' %(end - start) t1() t2()
用awk处理
#!/bin/bash function handle() { start=$(date +%s%N) start_ms=${start:0:16} awk '{a[$1]++} END{print length(a)}' breakfast_all receptions_all cheap_all end=$(date +%s%N) end_ms=${end:0:16} echo "cost time is:" echo "scale=6;($end_ms - $start_ms)/1000000" | bc } handle
运行python脚本
./test.py
3685715
cost time is: 4.890000
3685715
cost time is: 4.480000
运行sh脚本
./zzz.sh
3685715
cost time is:
4.865822
由此可见,python里头的set结构比dic稍微快一点点。整体上,awk的处理速度与python的处理速度大致相当!