当文本分析遇到乱码(ง'⌣')ง怎么办?

做文本分析经常遇到数据乱码问题,一般遇到编码问题我们无能为力,都是忽略乱码的文本。

text = open(file, errors='ignore').read()

但是这样会遗失掉一些信息,那到底怎么治文本分析时经常为非作歹的妖魔鬼怪?

心里默念python大法好!ftfy(fixes text for you)可以为我们整理的乱码数据。

安装

!pip3 install ftfy==5.6

乱码(ง'⌣')ง例子

只我在官方文档上找到这些奇形怪状的字符串,相信大家可能有的也见过这些数据。

(ง'⌣')ง	
ünicode	
Broken text… it’s flubberific!	
HTML entities <3	
¯\\_(ã\x83\x84)_/¯	
\ufeffParty like\nit’s 1999!	
LOUD NOISES	
This — should be an em dash	
This text was never UTF-8 at all\x85	
\033[36;44mI'm blue, da ba dee da ba doo...\033[0m	
\u201chere\u2019s a test\u201d	
This string is made of two things:\u2029 1. Unicode\u2028 2. Spite

ftfy.fix_text:专治各种不符

使用ftfy中的fix_text函数可以制伏绝大多数(ง'⌣')à

from ftfy import fix_text	
fix_text("(ง'⌣')ง")
"(ง'⌣')ง"

fix_text('ünicode')
'ünicode'

fix_text('Broken text… it’s flubberific!')
"Broken text… it's flubberific!"

fix_text('HTML entities <3')
'HTML entities <3'

fix_text("¯\\_(ã\x83\x84)_/¯")
'¯\\_(ツ)_/¯'

fix_text('\ufeffParty like\nit’s 1999!')
"Party like\nit's 1999!"

fix_text('LOUD NOISES')
'LOUD NOISES'

fix_text('único')
'único'

fix_text('This — should be an em dash')
'This — should be an em dash'

fix_text('This text is sad .â\x81”.')
'This text is sad .⁔.'

fix_text('The more you know 🌠')
'The more you know ?'

fix_text('This text was never UTF-8 at all\x85')
'This text was never UTF-8 at all…'

fix_text("\033[36;44mI'm blue, da ba dee da ba doo...\033[0m")
"I'm blue, da ba dee da ba doo..."

fix_text('\u201chere\u2019s a test\u201d')
'"here\'s a test"'

text = "This string is made of two things:\u2029 1. Unicode\u2028 2. Spite"	
fix_text(text)dd
'This string is made of two things:\n 1. Unicode\n 2. Spite'

ftfy.fix_file:专治各种不符的文件

上面的例子都是制伏字符串,实际上ftfy还可以直接处理乱码的文件。这里我就不做演示了,大家以后遇到乱码就知道有个叫fixes text for you的ftfy库可以帮助我们fix_text 和 fix_file。

近期文章

你可能感兴趣的:(当文本分析遇到乱码(ง'⌣')ง怎么办?)