In [1]: import re
In [2]: text
Out[2]: " \nALL this shows is that YOU don't know much about SCSI.\n\nSCSI-1 {with a SCSI-1 controler chip} range is indeed 0-5MB/s\nand that is ALL you have right about SCSI\nSCSI-1 {With a SCSI-2 controller chip}: 4-6MB/s with 10MB/s burst {8-bit}\n Note the INCREASE in SPEED, the Mac Quadra uses this version of SCSI-1\n so it DOES exist. Some PC use this set up too.\nSCSI-2 {8-bit/SCSI-1 mode}: "
In [3]: re.sub(r'[^A-Za-z0-9]+',' ',text)
Out[3]: ' ALL this shows is that YOU don t know much about SCSI SCSI 1 with a SCSI 1 controler chip range is indeed 0 5MB s and that is ALL you have right about SCSI SCSI 1 With a SCSI 2 controller chip 4 6MB s with 10MB s burst 8 bit Note the INCREASE in SPEED the Mac Quadra uses this version of SCSI 1 so it DOES exist Some PC use this set up too SCSI 2 8 bit SCSI 1 mode '
In [4]: re.sub('\W+', ' ',text)
Out[4]: ' ALL this shows is that YOU don t know much about SCSI SCSI 1 with a SCSI 1 controler chip range is indeed 0 5MB s and that is ALL you have right about SCSI SCSI 1 With a SCSI 2 controller chip 4 6MB s with 10MB s burst 8 bit Note the INCREASE in SPEED the Mac Quadra uses this version of SCSI 1 so it DOES exist Some PC use this set up too SCSI 2 8 bit SCSI 1 mode '
这两种方法都可以用来去除字符串中的特殊字符,空格和数字,第二种方法的速度大概是第一种方法的两倍。但是这样处理以后,句子中的逗号,句号等句子分隔符没有了,对于处理文本数据来说,这些分隔符也许是有用的,故我们这里也可以保留这些常见的分隔符。
In [5]: re.sub(r'[^a-zA-Z0-9,.\'!?]+',' ',text)
Out[5]: " ALL this shows is that YOU don't know much about SCSI. SCSI 1 with a SCSI 1 controler chip range is indeed 0 5MB s and that is ALL you have right about SCSI SCSI 1 With a SCSI 2 controller chip 4 6MB s with 10MB s burst 8 bit Note the INCREASE in SPEED, the Mac Quadra uses this version of SCSI 1 so it DOES exist. Some PC use this set up too. SCSI 2 8 bit SCSI 1 mode "
Reference:
https://stackoverflow.com/questions/5843518/remove-all-special-characters-punctuation-and-spaces-from-string