今天同事向我抛了一个文件,让我帮他将一些特定的行筛选出来。由于时间紧急,我首先想到的处理方法是shell的grep命令,但很快发现grep实现不了。迅速转为python实现,python实现交工后,琢磨了下可以用awk命令能更快的实现。记录下此次实战过程。
同事要求将以下文件数据(示例,已脱敏)按要求拆分为3个文件:
3,鱼涌公园,B200FFABGOSU
3,鲤鱼门咀,2200
3,鲗鱼涌,BS22325433
3,鲗鱼涌,BV2200324333
3,鸟湖,B20073C200XJ2L
3,鸭兜排,200
3,鸭洲,200
3,鹤咀,200
3,鸡公排,200
3,鸡公头,200
3,鸡宜环,200
3,鸡山,200
3,鸡洲,200
3,鸡脷排,200
3,鸡脷洲,200
3,鸡髀下,200
3,鸬鹚排,200
3,鸿日升科技,B200FFJE2F7I
3,鸿升办馆,B20073C2004U22
3,鹿湖郊游径,B20073C20035LT
3,麦径3段,B20073C2002QI3
3,麦理浩径,B20073C2002VDY
3,麦理浩径,B20073C20035LW
3,黄埔,BS22737322
3,黄埔,BV2200789232
3,黄大仙,BS2200327829
3,黃大仙,BV220032423200
...
python实现
#! /usr/bin/env python
# -*-coding:utf-8 -*-
import sys
import re
def main():
origin_filepath = sys.argv[1]
filepath1 = '{}.{}'.format(origin_filepath, '0')
filepath2 = '{}.{}'.format(origin_filepath, '1')
filepath3 = '{}.{}'.format(origin_filepath, '2')
with open(origin_filepath,'r',encoding='utf-8') as f:
lines = f.readlines()
print(lines)
for line in lines:
flag = True
if line.split(',')[2].startswith(('BS','BV','BX','BT')):
flag = False
with open(filepath1,'a+') as f1:
f1.write(line)
pattern = re.compile('鸡|鱼')
if len(pattern.findall(line.split(',')[1])) > 0:
flag = False
with open(filepath2,'a+') as f2:
f2.write(line)
if flag:
with open(filepath3,'a+') as f3:
f3.write(line)
if __name__ == '__main__':
main()
with open(origin_filepath,'r',encoding='utf-8')
if line.split(',')[2].startswith(('BS','BV','BX','BT')):
import re
pattern = re.compile('鸡|鱼')
if len(pattern.findall(line.split(',')[1])) > 0:
with open(filepath2,'a+') as f2:
# 筛选出第3列以"BS"或"BV"或"BX"或"BT"开头的行
cat test.txt | awk -F "," '{if($3~"^BS|^BV|^BX|^BT") print $0}' > ./test.txt.0
# 筛选出第2列中包含“鸡”或"鱼"的行
cat test.txt | awk -F "," '{if($2~"鸡|鱼") print $0}' > ./test.txt.1
# 排除以上一、二要求剩余的行
cat test.txt | awk -F "," '{if($2!~"鸡|鱼" && $3!~"^BS|^BV|^BX|^BT") print $0}' > ./test.txt.2