【Python】正则表达式过滤文本中的html标签、url超链接、img链接

测试文本:

"给大家看看原始文本。。。 ----------------------------@vczh 轮子哥求扩散。

---------------”
 

代码:

# coding: utf-8
import re, os

def filter_file(path, filename):

    def filter_text(text):
        re_tag = re.compile(']*>')  # HTML标签
        new_text = re.sub(re_tag, '', text)
        new_text = re.sub(",+", ",", new_text)   # 合并逗号
        new_text = re.sub(" +", " ", new_text)   # 合并空格
        new_text = re.sub("[...|…|。。。]+", "...", new_text)  # 合并句号
        new_text = re.sub("-+", "--", new_text)  # 合并-
        new_text = re.sub("———+", "———", new_text)  # 合并-
        return new_text

    print("Start!")
    filw_path = os.path.join(path, filename)
    with open(filw_path, "r+", encoding="utf-8") as fr:
        data = fr.readlines()
        print(len(data))
    with open(filw_path + ".filter", "w+", encoding="utf-8") as fw:
        for line in data:
            new_line = filter_text(line)
            fw.write(new_line)
    print("Done!")

 

 

你可能感兴趣的:(【Python】正则表达式过滤文本中的html标签、url超链接、img链接)