实际工作中,恰好需要处理一个nginx日志,做个简单的分析:
引子:
开发已经有日志分析平台和工具,但为了查一个问题,需要分析原始日志。
要求:
原始日志的倒数第二个字段不为空且不为'-'的情况下,统计倒数第四个字段不为空且不为'-'的且不重复的个数。
python脚本如下:
#!/usr/bin/env python #encoding=utf-8 # nginx_log_analysis.py FileHd = open('aaa.com_access.log-20160506','r') FileText = FileHd.readlines() FileTextTemp = [] FileTextTempSplit = [] AAA_UID = [] FileHd.close() for i in range(len(FileText)): FileTextTemp.append(FileText[i]) FileTextTempSplit.append(FileTextTemp[i].split(' ')) for i in range(len(FileTextTempSplit)): for j in range(len(FileTextTempSplit[i])): length = len(FileTextTempSplit[i]) if FileTextTempSplit[i][length-2] != '-'\ and len(FileTextTempSplit[i][length-2]) != 0\ and FileTextTempSplit[i][length-4] != '-'\ and len(FileTextTempSplit[i][length-4]) != 0: AAA_UID.append(FileTextTempSplit[i][length-4]) ''' 这样的aaa_uid统计是未去重的 STATS_FD = open('stats.txt','w') for AAA_uid in AAA_UID: STATS_FD.writelines(aaa_uid+'\n') STATS_FD.close() ''' ''' 这是aaa_uid去重统计 ''' count = 0 STATS_FD = open('stats_uniq.txt','w') AAA_UID_UNIQ = list(set(AAA_UID)) for aaa_uid in AAA_UID_UNIQ: STATS_FD.writelines(aaa_uid+'\n') count += 1 STATS_FD.close() print count
这样处理一个不到280MB的日志,time运行脚本:
time nginx_log_analysis.py
需要14秒多(一台资源是2核,4GB内存的虚拟机上运行)