Python基础教程(第3版)中文版 第20章 项目1: 自动添加标签(纯文本转HTML格式) (笔记)

                     第20章 项目1: 自动添加标签(纯文本转HTML格式)


1.问题描述


给纯文本文件添加HTML标签,变成HTML格式。
任务是将文本元素分类,然后标记。
目标:
输入无需包含人工编码或标签
能处理不同的文本块
可扩展,及支持其他标记语言。

2.有用的工具


必须要:读写文件,输出
可能:迭代输入行,字符串处理,生成器,re

3.准备工作


一个用于测试的纯文本文件 test_input.txt

Welcome to World Wide Spam, Inc.

These are the corporate web pages of *World Wide Spam*, Inc. We hope
you find your stay enjoyable, and that you will sample many of our
products.

A short history of the company

World Wide Spam was started in the summer of 2000. The business
concept was to ride the dot-com wave and to make money both through
bulk email and by selling canned meat online.

After receiving several complaints from customers who weren't
satisfied by their bulk email, World Wide Spam altered their profile,
and focused 100 on canned goods. Today, they rank as the world's

13,892nd online supplier of SPAM.

Destinations

From this page you may visit several of our interesting web pages:
 - What is SPAM? (http://wwspam.fu/whatisspam)

 - How do they make it? (http://wwspam.fu/howtomakeit)

 - Why should I eat it? (http://wwspam.fu/whyeatit)

How to get in touch with us

You can get in touch with us in *many* ways: By phone (555-1234), by
email ([email protected]) or by visiting our customer feedback page
(http://wwspam.fu/feedback).


4.初次实现


首先将文本分成段落。即找出文本块。

从文本可知,段落之间有一个或多个空行。
因此,可以通过收集空行前的行来得到文本块。创建util.py,用来得到文本块

#line生成器,在文件末尾添加1空行
def lines(file):
    for line in file: yield line
    yield '\n'

#block生成器,去除两端空白
def blocks(file):
    block = []
    for line in lines(file):
        if line.strip():
            block.append(line)
        elif block:
            yield ''.join(block).strip()
            block = []

接着对文本块添加标记
创建标记程序simple_markup.py:

import sys, re
from util import *

print('...')

title = True
for block in blocks(sys.stdin):
    block = re.sub(r'\*(.+?)\*', r'\1', block)
    if title:
        print('

')         print(block)         print('

')         title = False     else:         print('

')         print(block)         print('

') print('

 

cmd中运行命令,执行标记程序:python simple_markup.py < test_input.txt > test_output.html 

得到test_output.html,用浏览器打开,就可以看到有标题和段落的一个文章。

你可能感兴趣的:(python,学习笔记)