在PYTHON中使用UNESCAPE HTML实体

这段代码可能对某些人有用,

def parsefile(path):

   try:

      file = open(path, "r")

      fileread = file.read()

      fileread = unescape(fileread.decode('utf-8')).encode('utf-8')

      file.close()

   except:

      print "Reading File Bug"

      sys.exit(1)

   return ET.fromstring(fileread)

该吃UNESCAPE HTML实体程序已于弗雷德里克Lundh开发网站上找到。代码做得太多了,因为它正在转换&,& gt;而且<。我希望将这些保存在URL中以及我已转义代码段的位置。所以我稍微修改了它以满足我自己的需要。

def unescape(text):

   """Removes HTML or XML character references

      and entities from a text string.

      keep &,& gt; <in the source code.

   from Fredrik Lundh

   http://effbot.org/zone/re-sub.htm#unescape-html

   """

   def fixup(m):

      text = m.group(0)

      if text[:2] == "&#":

         # character reference

         try:

            if text[:3] == "&#x":

               return unichr(int(text[3:-1], 16))

            else:

               return unichr(int(text[2:-1]))

         except ValueError:

            print "erreur de valeur"

            pass

      else:

         # named entity

         try:

            if text[1:-1] == "amp":

               text = "&"

            elif text[1:-1] == "gt":

               text = ">"

            elif text[1:-1] == "lt":

               text = "<"

            else:

               print text[1:-1]

               text = unichr(htmlentitydefs.name2codepoint])

         except KeyError:

            print "keyerror"

            pass

      return text # leave as is

   return re.sub("&#?w+;", fixup, text)

希望能帮助到你。

你可能感兴趣的:(在PYTHON中使用UNESCAPE HTML实体)