文本字符串清理

由于收集来源的问题（比如，表单文本数据录入错误，甚至于有意录入错误的数据），文本字符串往往需要先进行清理才能够在后续的需求中发挥正常且正确的作用。

删除字符串中多余的字符

在文本字符串中，经常会遇到开头，结尾或者中间不需要的字符，例如空白符。

strip()、lstrip()、rstrip()

strip() 方法用于移除开始或结尾的字符。该方法接受一个参数 chars，该参数为指定要移除的字符。若缺省或为 None，默认指定移除空白符。

lstrip()，rstrip() 方法参数同 strip()，lstrip() 从左执行移除操作，而 rstrip() 从右执行移除操作。

举例说明三者的用法及效果：

>>> # 空白符的移除
... s = ' hello world \n'
>>> s.strip()  # 移除前后空白符
'hello world'
>>> s.lstrip()  # 移除前空白符
'hello world \n'
>>> s.rstrip()  # 移除后空白符
' hello world'
>>>
>>> # 指定其他字符的移除
... s = 'www.example.com'
>>> s.strip('cmowz.')
'example'
>>> s.lstrip('cmowz.')
'example.com'
>>> s.rstrip('cmowz.')
'www.example'

对于上述代码，移除空白字符比较好理解。指定移除其他字符，实际上执行的过程：移除的过程中遇到一个字符未包含于 chars 所指定的字符集中时，将停止操作。（如 lstrip() 示例中遇到 e 字符并未包含于指定字符集 cmowz. 中，此时执行停止，返回剩余部分的内容。）

在处理数据以备后续使用，这些 strip() 方法往往会被频繁调用。但是，这里需要注意的是，这些方法并不能够对字符串中间的文本产生任何影响。如下示例：

>>> s = ' hello      world \n'
>>> s.strip()
'hello      world'

如果需要移除中间的空格，可以考虑使用 replace() 方法或者是用正则表达式中的 sub 进行替换。示例如下：

>>> s = 'hello      world'
>>> s.replace(' ','')
'helloworld'
>>> import re
>>> re.sub('\s+', ' ', s)
'hello world'

清洗文本字符串

有些用户，会恶作剧在网站页面表单输入 Unicode 文本，例如 Un̄ićŏdè ，现在的需求是将这些字符进行清理。

文本清理会涉及文本解析和数据处理等系列问题。较为简单的情形下，可以选择将文本转为标准格式（例如：str.upper() 和 str.lower()），然后利用替换操作（比如：str.replace() 或者 re.sub()），删除或者替换指定字符。

str.translate()

str.translate(table) 主要的作用是返回原字符串的副本，其中每个字符按给定的转换表进行映射。参数 table 必须是使用 __getitem__() 实现索引操作的对象，通常为 mapping 或 sequence。如下示例：

>>> s = 'Un̄ićŏdè, the\fworld standard\tfor text and emoji\r\n'
>>> given_map = {
...     ord('\t'): ' ',
...     ord('\f'): ' ',
...     ord('\r'): None
... }
>>> a = s.translate(given_map)
>>> a
'Un̄ićŏdè, the world standard for text and emoji\n'

返回的结果中，空白字符 \t 和 \f 被重新映射为一个空格。而返回 None 这部分，表示删除字符 \r。

unicodedata 模块

unicodedata.normalize()

上面提及的 Un̄ićŏdè，这种带有和音符的字符串。可以使用尝试使用 str.translate() 方法构建更完整的转换表，用来删除和音符。如下示例：

>>> import sys
>>> import unicodedata
>>> a = 'Un̄ićŏdè, the world standard for text and emoji\n'
>>> cmb_chrs = dict.fromkeys(c for c in range(sys.maxunicode) if unicodedata.combining(chr(c)))
>>> b = unicodedata.normalize('NFD', a)
>>> b
'Un̄ićŏdè, the world standard for text and emoji\n'
>>> b.translate(cmb_chrs)
'Unicode, the world standard for text and emoji\n'
>>>

dict.fromkeys(seq[, value]) 方法，返回一个新字典，value 默认为 None。在这个例子中，dict.fromkeys() 方法用于构造一个字典，每个 Unicode 和音符作为键，对应的值全部为 None。

unicodedata.normalize(form, unistr) 返回 Unicode 字符串 unistr 的正常形式 form。 form 的有效值为 NFC、NFKC、NFD，NFKD。combining() 方法用于确认字符是否为和音符。示例中的 Unicode 文本是由字母与和音符组合而成的，使用 NFD 将字符转换为分解形式，然后用 translate 函数删除所有的和音符。

encode() 和 decode()

还有另外一种方法，涉及到 I/O 解码与编码函数。如下示例：

>>> a
'Un̄ićŏdè, the world standard for text and emoji\n'
>>> b = unicodedata.normalize('NFD', a)
>>> b.encode('ascii', 'ignore').decode('ascii')
'Unicode, the world standard for text and emoji\n'

代码的执行流程：同样是先用 normalize() 方法先分解和音符。然后在 ASCII 编码/解码的过程中丢弃掉这些字符。这种方法只有在获取文本到对应 ASCII 的时候才会生效。

关于性能方面的问题，简单替换操作，str.replace() 体现的优势更明显。但，如果需要清理复杂字符，对字符重新映射或删除，translate() 的表现更好。具体使用哪种方法，要多方面尝试评估再采用。

参考资料

来源

[1] David M. Beazley;Brian K. Jones.Python Cookbook, 3rd Edtioni.O'Reilly Media.2013.
[2] "4. Built-in Types — Python 3.6.10 documentation".python.org. Retrieved 13 November 2019.
[3] "6.5. unicodedata — Unicode Database".Docs.python.org. Retrieved 3 January 2020.

以上就是本篇的主要内容

Python 文本字符串清理