[将句子切分]split

  • 看到fairseq里面是这样写的有点不理解然后看了下官方的文档发现

If you use .split() instead, the program will take into account multiple spaces, newlines, tabs and all other forms of whitespace. That should get you what you're looking for.

fairseq中的切分方法

  • re.compile(r'\s+').sub(' ', line)是将有几个连续的空格tab \n替换成空格然后用标准的split进行切分。
import re

SPACE_NORMALIZER = re.compile(r"\s+")


def tokenize_line(line):
    line = SPACE_NORMALIZER.sub(" ", line)
    line = line.strip()
    return line.split()

其实就等价一个split()

>>> teststr = "a   v w   ef sdv   \n   wef"
>>> print teststr
a   v w   ef sdv   
   wef
>>> teststr.split()
['a', 'v', 'w', 'ef', 'sdv', 'wef']
>>> teststr.split(" ")
['a', '', '', 'v', 'w', '', '', 'ef', 'sdv', '', '', '\n', '', '', 'wef']

你可能感兴趣的:([将句子切分]split)