原帖见:
http://topic.csdn.net/u/20100917/22/b7186188-f403-463e-974e-1ce827c14c28.html?42239
大意为在一个至少有10个亿IP的文件中找出重复数相同次数最高的IP。不能用文件做中间存储,要考虑内存受限的情况。下面是我的两个解法:
第一个解法:
IpCount = {} for a in open('ip.txt', 'r'): if a in IpCount: IpCount[a] += 1 else: IpCount[a] = 1 MaxIpCount = max(IpCount.iteritems(), key = lambda x : x[1]) MaxCount = MaxIpCount[1] print 'Max count is %d' % MaxCount print 'Below is the ip list that appeared %d times:' % MaxCount for key in IpCount: if IpCount[key] == MaxCount: print '/t%s' % key,
这个算法的优点, 简单明了, 利用了python的内置数据结构字典的key来做ip的hash, 只读一次文件, 速度快, 如果不同的ip个数不多的话, 占用的内存也相当的少.
如果10亿个ip里有上亿个不同的ip, 那么解法1的缺点就需要维护一个含1亿个元素的字典, 会占用大量的内存空间. 下面是一个占用少量内存的解法.
import sys IpCount = {} ip_dict = {} #produce key for line in open('ip.txt', 'r'): linelist = line.split('.') if (linelist[0]) not in ip_dict: ip_dict[linelist[0]] = 0 #got each subsection number of replications and ip list for ipkey in ip_dict: print 'start to count %s.xxx.xxx.xxx' % (ipkey) Ip1stCount = {} Ip2ndCount = {} for line in open('ip.txt', 'r'): linelist = line.split('.') if (ipkey == linelist[0]): if (linelist[1], linelist[2], linelist[3]) in Ip2ndCount: Ip2ndCount[linelist[1], linelist[2], linelist[3]] += 1 else: Ip2ndCount[linelist[1], linelist[2], linelist[3]] = 1 if len(Ip2ndCount) > 0: MaxItems = max(Ip2ndCount.iteritems(), key = lambda x : x[1]) MaxNum = MaxItems[1] Ip2ndList = [] for line in Ip2ndCount.iteritems(): if line[1] == MaxNum: Ip2ndList.append(line[0]) Ip1stCount[MaxNum] = Ip2ndList ip_dict[ipkey] = Ip1stCount print '%s.xxx.xxx.xxx %d/n %s' % (ipkey, MaxNum, ip_dict[ipkey]) else: pass #compare each subsection and output the max number of replication and ip list IpCount = max(ip_dict.iteritems(), key = lambda x : x[1]) MaxIpCount = list(IpCount[1])[0] for ipkey in ip_dict: Count = list(ip_dict[ipkey])[0] if Count == MaxIpCount: for ip in ip_dict[ipkey][Count]: print '%s.%s.%s.%s %d' %(ipkey, ip[0], ip[1], ip[2].strip(), Count)
对一个ip: a.b.c.d, 首先生成一个以ip的第一个字段为key的字典, 然后每读一次文件,找出这个ip段,出现次数最多的ip列表,及次数. 最后只需比较所有段内ip重复次数最多的ip,就可以得到整个文件重复次数最多的ip列表.
对于第二个算法, 唯一的好处就是占用内存变小, 其它算法复杂度变大, 需要多次读ip文件, 性能大大下降等都不可忍受.