网页中的字符编码(html的unicode实体编码)

1、编码转换(to Unicode)

(程序代码来源于网络)

 

Js版



vbs版

Function Unicode(str1)
      Dim str,temp
      str = ""
      For i=1     to len(str1)
       temp = Hex(AscW(Mid(str1,i,1)))
       If len(temp) < 5 Then     temp = right("0000" & temp, 4)
       str = str & "\u" & temp
      Next
      Unicode = str
End Function


Function htmlentities(str)
      For i = 1 to Len(str)
          char = mid(str, i, 1)
          If Ascw(char) > 128 then
              htmlentities = htmlentities & "&#" & Ascw(char) & ";"
          Else
              htmlentities = htmlentities & char
          End if
      Next
End Function

 

coldfusion

 

function nochaoscode(str)
{
      var new_str = “”;
      for(i=1; i lte len(str);i=i+1){
          if(asc(mid(str,i,1)) lt 128){
              new_str = new_str & mid(str,i,1);
          }else{
              new_str = new_str & “&##” & asc(mid(str,i,1));
          }
      }
      return new_str;
}

 


 

附:

在php中我们可以用mbstring的mb_convert_encoding函数实现这个正向及反向的转化。 如:

 

mb_convert_encoding ("你好", "HTML-ENTITIES", "gb2312"); //输出:你好
mb_convert_encoding ("你好", "gb2312", "HTML-ENTITIES"); //输出:你好

 

如果需要对整个页面转化,则只需要在php文件的头部加上这三行代码:

 

mb_internal_encoding("gb2312"); // 这里的gb2312是你网站原来的编码
mb_http_output("HTML-ENTITIES");
ob_start('mb_output_handler');


如果没有打开mbstring扩展,可以参考coolcode.cn上的这两篇文章:
在任意字符集下正常显示网页的方法
在任意字符集下正常显示网页的方法(续)


 

2、HTML实体

 

HTML 4.01 支持 ISO 8859-1 (Latin-1) 字符集。

提示 实体名是区分大小写的。

备注 同一个符号,可以用“实体名称”和“实体编号”两种方式引用,“实体名称”的优势在于便于记忆,但不能保证所有的浏览器都能顺利识别它,而“实体编号”则没有这种担忧,但它实在不方便记忆。


ASCII中部分实体的新名字

显示

描述

实体名称

实体编号

"

quotation mark

" "
' apostrophe

' (IE下无效)

'
& ampersand & &
< less-than < <
> greater-than > >

ISO 8859-1 符号实体

显示

描述

实体名称

实体编号

 

non-breaking space

   
¡

inverted exclamation mark

¡ ¡
¤ currency ¤ ¤

cent ¢ ¢

pound £ £

yen ¥ ¥
¦

broken vertical bar

¦ ¦
§ section § §
¨

spacing diaeresis

¨ ¨
© copyright © ©
a

feminine ordinal indicator

ª ª
«

angle quotation mark (left)

« «
? negation ¬ ¬
-

soft hyphen

­ ­
®

registered trademark

® ®
trademark
ˉ

spacing macron

¯ ¯
° degree ° °
± plus-or-minus ± ±
2

superscript 2

² ²
3

superscript 3

³ ³

spacing acute

´

´
μ micro µ µ
? paragraph
·

middle dot

· ·
?

spacing cedilla

¸ ¸
1

superscript 1

¹ ¹
o

masculine ordinal indicator

º º
»

angle quotation mark (right)

» »
?

fraction 1/4

¼ ¼
?

fraction 1/2

½ ½
?

fraction 3/4

¾ ¾
?

inverted question mark

¿ ¿
× multiplication × ×
÷ division ÷ ÷

ISO 8859-1 字符实体

显示

描述

实体名称

实体编号

À

capital a, grave accent

À À
Á

capital a, acute accent

Á Á
Â

capital a, circumflex accent

 Â
Ã

capital a, tilde

à Ã
Ä

capital a, umlaut mark

Ä Ä
Å

capital a, ring

Å Å
Æ

capital ae

Æ Æ
Ç

capital c, cedilla

Ç Ç
È

capital e, grave accent

È È
É

capital e, acute accent

É É
Ê

capital e, circumflex accent

Ê Ê
Ë

capital e, umlaut mark

Ë Ë
Ì

capital i, grave accent

Ì Ì
Í

capital i, acute accent

Í Í
Î

capital i, circumflex accent

Î Î
Ï

capital i, umlaut mark

Ï Ï
Ð

capital eth, Icelandic

Ð Ð
Ñ

capital n, tilde

Ñ Ñ
Ò

capital o, grave accent

Ò Ò
Ó

capital o, acute accent

Ó Ó
Ô

capital o, circumflex accent

Ô Ô
Õ

capital o, tilde

Õ Õ
Ö

capital o, umlaut mark

Ö Ö
Ø

capital o, slash

Ø Ø
ù

capital u, grave accent

Ù Ù
ú

capital u, acute accent

Ú Ú
?

capital u, circumflex accent

Û Û
ü

capital u, umlaut mark

Ü Ü
Y

capital y, acute accent

Ý Ý
T

capital THORN, Icelandic

Þ Þ
?

small sharp s, German

ß ß
à

small a, grave accent

à à
á

small a, acute accent

á á
a

small a, circumflex accent

â â
?

small a, tilde

ã ã
?

small a, umlaut mark

ä ä
?

small a, ring

å å
?

small ae

æ æ
?

small c, cedilla

ç ç
è

small e, grave accent

è è
é

small e, acute accent

é é
ê

small e, circumflex accent

ê ê
?

small e, umlaut mark

ë ë
ì

small i, grave accent

ì ì
í

small i, acute accent

í í
?

small i, circumflex accent

î î
?

small i, umlaut mark

ï ï
e

small eth, Icelandic

ð ð
?

small n, tilde

ñ ñ
ò

small o, grave accent

ò ò
ó

small o, acute accent

ó ó
?

small o, circumflex accent

ô ô
?

small o, tilde

õ õ
?

small o, umlaut mark

ö ö
?

small o, slash

ø ø
ù

small u, grave accent

ù ù
ú

small u, acute accent

ú ú
?

small u, circumflex accent

û û
ü

small u, umlaut mark

ü ü
y

small y, acute accent

ý ý
t

small thorn, Icelandic

þ þ
?

small y, umlaut mark

ÿ ÿ

其它一些 HTML 所支持的实体

显示

描述

实体名称

实体编号

Œ

capital ligature OE

Œ Œ
œ

small ligature oe

œ œ
Š

capital S with caron

Š Š
š

small S with caron

š š
Ÿ

capital Y with diaeres

Ÿ Ÿ
ˆ

modifier letter circumflex accent

ˆ ˆ
˜

small tilde

˜ ˜

en space

em space

thin space

zero width non-joiner

zero width joiner

left-to-right mark

right-to-left mark

en dash

em dash

left single quotation mark

right single quotation mark

single low-9 quotation mark

left double quotation mark

right double quotation mark

double low-9 quotation mark

dagger

double dagger

horizontal ellipsis

per mille

single left-pointing angle quotation

single right-pointing angle quotation

  euro

转载于:https://www.cnblogs.com/zccee/archive/2012/02/04/2338515.html

你可能感兴趣的:(爬虫,php,人工智能)