Python版html_entity_decode(text):

code过程中需要Convert all HTML entities to their applicable characters

google了一下,发现PHP有现成的函数

string html_entity_decode ( string $string [, int $quote_style = ENT_COMPAT [, string $charset ]] ),

而自己正在使用修改python程序,因而写了python版的同样的方法:


def html_entity_decode(text):

052      """
053      Removes HTML or XML character references and entities from a text string.
054        
055      @param text The HTML (or XML) source text.
056      @return The plain text, as a Unicode string, if necessary.
057      """
058      import re, htmlentitydefs
059      def fixup(m):
060          text = m.group( 0 )
061          if text[: 2 ] = = "&#" :
062              # character reference
063              try :
064                  if text[: 3 ] = = "&#x" :
065                      return unichr ( int (text[ 3 : - 1 ], 16 ))
066                  else :
067                      return unichr ( int (text[ 2 : - 1 ]))
068              except ValueError:
069                  pass
070          else :
071              # named entity
072              try :
073                  text = unichr (htmlentitydefs.name2codepoint[text[ 1 : - 1 ]])
074              except KeyError:
075                  pass
076          return text # leave as is
077      return re.sub( "&#?/w+;" , fixup, text)

你可能感兴趣的:(html,xml,String,python,character,reference)