笔者需要批量修改一些文件中的汉字,因此调研正则替换的工具,试用后发现:
[root@Centos ~]# echo Hello World | sed 's/Hello/hi/g'
hi World
[root@Centos ~]# echo Hello World | sed 's/Hello \(\w*\)/\1/g'
World
[root@Centos ~]# echo 你好World | sed 's/[\u4e00-\u9fa5]/ /g'
sed: -e expression #1, char 21: Invalid range end
perl命令与之类似,处理中文字符时会乱码。>>> import re
>>> re.sub('(Hello).', 'hi', 'Hello World')
'hiWorld'
>>> re.sub('(Hello).', '$1', 'Hello World')
'$1World'
综上,笔者决定基于Python的re模块自定义一个正则替换的函数,如下:
import re
def replace(string, src: str, dst: str) -> str:
"""
Replace `src` with `dst` in `string`, based on regular expressions.
Sample:
>>> replace('Hello World', 'Hello', 'hi')
'hi World'
>>> replace('Hello World', '(Hello).', 'hi')
'hiWorld'
>>> replace('Hello World', '(Hello).', '$1,')
'Hello,World'
>>> replace('Hello World', 'Hello', '$1')
ValueError: group id out of range : $1
>>> replace('你好World', '([\\u4e00-\\u9fa5])(\w)', '$1 $2')
'你好 World'
"""
# Check the element group
src_group_num = min(len(re.findall(r'\(', src, re.A)), len(re.findall(r'\)', src, re.A)))
dst_group_ids = re.findall(r'\$(\d)', dst, re.A)
if dst_group_ids:
dst_group_ids = list(set(dst_group_ids)) # Remove duplicate id
dst_group_ids.sort()
max_group_id = int(dst_group_ids[-1])
if max_group_id > src_group_num:
raise ValueError('group id out of range : ${}'.format(max_group_id))
# replace
if dst_group_ids:
pattern = re.compile('({})'.format(src), re.A)
result = string[:]
for match in pattern.findall(string):
_dst = dst[:]
for i in dst_group_ids:
i = int(i)
_dst = _dst.replace('${}'.format(i), match[i])
result = result.replace(match[0], _dst)
else:
pattern = re.compile(src, re.A)
result = pattern.sub(dst, string)
return result
把它做成脚本:
import argparse
parser = argparse.ArgumentParser(description=r"""This script is use to replace string in a file. Sample: python replace.py --file 1.py --src "([\u4e00-\u9fa5])(\w)" --dst "$1 $2" """)
parser.add_argument('--file', help='a valid file path', type=str, required=True)
parser.add_argument('--src', help='the source string, which is a regular expression.', type=str, required=True)
parser.add_argument('--dst', help='the destination string', type=str, required=True)
parser.add_argument('--encoding', help='the encoding of the original file, which is utf-8 by default.', type=str, default='utf-8')
args = parser.parse_args()
try:
# read the file
with open(args.file, 'r', encoding=args.encoding) as f:
text = f.read()
print('Handling file: {} ...'.format(args.file), end='\t\t')
# handling
result = replace(text, args.src, args.dst)
# save the result
with open(args.file, 'w', encoding=args.encoding) as f:
f.write(result)
print('done')
except Exception as e:
print('Error: {}'.format(str(e)))
使用时,相当于可以处理中文字符的sed命令:
for file in `find . -name "*.md"`
do
python3 replace.py --file $file --src '([\u4e00-\u9fa5])(\w)' --dst '$1 $2'
done