HTML Url 编码(Encode 和 Url Decode)

URL 编码 - 从 %00 到 %8f

ASCII Value URL-encode ASCII Value URL-encode ASCII Value URL-encode
? %00 0 %30 ` %60
  %01 1 %31 a %61
  %02 2 %32 b %62
  %03 3 %33 c %63
  %04 4 %34 d %64
  %05 5 %35 e %65
  %06 6 %36 f %66
  %07 7 %37 g %67
backspace %08 8 %38 h %68
tab %09 9 %39 i %69
linefeed %0a : %3a j %6a
  %0b ; %3b k %6b
  %0c < %3c l %6c
c return %0d = %3d m %6d
  %0e > %3e n %6e
  %0f ? %3f o %6f
  %10 @ %40 p %70
  %11 A %41 q %71
  %12 B %42 r %72
  %13 C %43 s %73
  %14 D %44 t %74
  %15 E %45 u %75
  %16 F %46 v %76
  %17 G %47 w %77
  %18 H %48 x %78
  %19 I %49 y %79
  %1a J %4a z %7a
  %1b K %4b { %7b
  %1c L %4c | %7c
  %1d M %4d } %7d
  %1e N %4e ~ %7e
  %1f O %4f   %7f
space %20 P %50 ? %80
! %21 Q %51   %81
" %22 R %52 ? %82
# %23 S %53 ? %83
$ %24 T %54 ? %84
% %25 U %55 %85
& %26 V %56 ? %86
' %27 W %57 ? %87
( %28 X %58 ? %88
) %29 Y %59 %89
* %2a Z %5a ? %8a
+ %2b [ %5b ? %8b
, %2c \ %5c ? %8c
- %2d ] %5d   %8d
. %2e ^ %5e ? %8e
/ %2f _ %5f   %8f

URL 编码 - 从 %90 到 %ff

ASCII Value URL-encode ASCII Value URL-encode ASCII Value URL-encode
  %90 ? %c0 ? %f0
%91 ? %c1 ? %f1
%92 ? %c2 ò %f2
%93 ? %c3 ó %f3
%94 ? %c4 ? %f4
? %95 ? %c5 ? %f5
�C %96 ? %c6 ? %f6
%97 ? %c7 ÷ %f7
? %98 ? %c8 ? %f8
? %99 ? %c9 ù %f9
? %9a ? %ca ú %fa
? %9b ? %cb ? %fb
? %9c ? %cc ü %fc
  %9d ? %cd ? %fd
? %9e ? %ce ? %fe
? %9f ? %cf ? %ff
  %a0 ? %d0    
? %a1 ? %d1    
? %a2 ? %d2    
? %a3 ? %d3    
  %a4 ? %d4    
? %a5 ? %d5    
| %a6 ? %d6    
§ %a7   %d7    
¨ %a8 ? %d8    
? %a9 ? %d9    
? %aa ? %da    
? %ab ? %db    
? %ac ? %dc    
? %ad ? %dd    
? %ae ? %de    
? %af ? %df    
° %b0 à %e0    
± %b1 á %e1    
? %b2 ? %e2    
? %b3 ? %e3    
? %b4 ? %e4    
? %b5 ? %e5    
? %b6 ? %e6    
%b7 ? %e7    
? %b8 è %e8    
? %b9 é %e9    
? %ba ê %ea    
? %bb ? %eb    
? %bc ì %ec    
? %bd í %ed    
? %be ? %ee    
? %bf ? %ef    
from : http://www.w3school.com.cn/tags/html_ref_urlencode.html


关于Url Encode 和 Url Decode

对于url中的中文字符,大多数网站都会做编码的处理,这里我们来探讨常用的2中编码和解码在perl中实现。

常用的编码方式有2种,GBK和UTF-8,因此URL编码也使用GBK的URL编码和UTF-8的URL编码。

1:GBK进行URL Encode。

1)先对字符串进行GBK编码。请注意,汉字本身采用的就是GBK编码,因此对于汉字,不应该再使用GBK编码。所以实际上如果是针对URL有汉字的URL进行URL编码,就直接使用URL编码函数即可。

2)然后进行URL编码

while(<>){
        chomp;
        my $gbkec = Encode::encode("gbk",$_); #对字符串进行GBK编码,如果是汉字要省略掉这一步,否则为重复编码。
        my $gbkuec = URI::Escape::uri_escape($gbkec); #对已经进行GBK编码的字符串进行URL编码
        my $encode = URI::Escape::uri_escape($_); #如果是汉字,可以直接进行URL编码
        print "$_ ->[GBK]$gbkec ->[URLencode]$gbkuec\n";
        print "$_ -> [URLencode]$encode\n";
}
测试输出:
@# ->[GBK]@# ->[URLencode]%40%23
@# -> [URLencode]%40%23

然后进行URL的GBK解码
解码的过程也要注意,汉字是不是重复使用了GBK解密,代码如下

while(<>){
        chomp;
        my $gbkec = URI::Escape::uri_unescape($_); #对URL进行解码
        my $chstr = Encode::decode("gbk",$gbkec); #对已经解码的url进行GBK解码,汉字本身为GBK编码,不用在进行GBK解码。

        my $decode = URI::Escape::uri_unescape($_); #汉字编码的url可直接进行这一步
        print "$_ ->[URLdecode]$gbkec ->[GBKDecode]$chstr\n";
        print "$_ -> [URLdecode]$decode\n";
}

测试输出:
%40%23
%40%23 ->[URLdecode]@# ->[GBKDecode]@#
%40%23 -> [URLdecode]@#

2:UTF-8进行URL编码。

因为汉字采用的是GBK编码,因此不论在URL编码前还是在URL解码后,都需要调用相应函数进行操作。
对于UTF-8的编码和解码,我们使用UTF-8模块中的函数。
如下是编码后又解码的过程:

while(<>){
        chomp;

        my $utf8str = $_;
        utf8::encode($utf8str); #进行UTF-8编码,函数直接操作变量
        my $encode = URI::Escape::uri_escape($utf8str); #对已经经过UTF-8编码的对象进行URL编码
        print "UTF-8 Encode:[$_] -> [$utf8str] -> [$encode]\n";

        my $urldecode = URI::Escape::uri_unescape($encode); #对经过UTF-8编码的URL进行URL解码
        my $decode = $urldecode;
        utf8::decode($decode); #对经过URL解码后的字符串进行UTF-8解码
        print "UTF-8 Decode:[$encode] -> [$urldecode] -> [$decode]\n";
}

测试输出:
测试
UTF-8 Encode:[测试] -> [???????”] -> [%C2%B2%C3%A2%C3%8A%C3%94]
UTF-8 Decode:[%C2%B2%C3%A2%C3%8A%C3%94] -> [???????”] -> [测试]

总结:大多数国内网站对中文编码都采用GBK,因为这样会少一步编码处理。比如百度,即采用GBK编码。

但是国外网站大多采用UTF-8编码,因为UTF-8相对GBK更广泛。


from :  http://hi.baidu.com/youzhch/item/744df75338a741948c12ed96

  • html encode包含了252个字符,格式为‘&name;’,其中name为大小写敏感的;
  • xml encode只包含了5个字符,它们是$,<,>,',", 其格式与html相同;
  • url encode主要是ASCII的控制符号,Non-ASCII,url里的特殊字符(如/,?,&等),不安全字符(会引起二义性的),而encode规则是,使用%和16进制的2位数字(对应的ISO-Lation位置)组成。如空格就是%20,其位置为32.
  • HTML character references

    Character entity references have the format &name; where "name" is a case-sensitive alphanumericstring.

    The character entity references &lt;, &gt;, &quot; and &amp; are predefined in HTML and SGML, because <, >, " and & are already used to delimit markup.

    XML character references

    Unlike traditional HTML with its large range of character entity references, in XML there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:[7]

    • &amp; → & (ampersand, U+0026)
    • &lt; → < (less-than sign, U+003C)
    • &gt; → > (greater-than sign, U+003E)
    • &quot; → " (quotation mark, U+0022)
    • &apos; → ' (apostrophe, U+0027)

    &amp; has the special problem that it starts with the character to be escaped. A simple Internet search finds thousands of sequences &amp;amp;amp;amp; … in HTML pages for which the algorithm to replace an ampersand by the corresponding character entity reference was applied too often.

    http://en.wikipedia.org/wiki/Character_encodings_in_HTML

    List of XML and HTML character entity references

    • The XML specification defines five "predefined entities" representing special characters, and requires that all XML processors honor them.
    • The HTML 4 DTDs define 252 named entities, references to which act as mnemonic aliases for certain Unicode characters.

    http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

    URL encode

    ASCII Control characters
    Why: These characters are not printable.

    Non-ASCII characters
    Why: These are by definition not legal in URLs since they are not in the ASCII set.

    "Reserved characters"
    Why: URLs use some characters for special use in defining their syntax. When these characters are not used in their special role inside a URL, they need to be encoded.

    "Unsafe characters"
    Why: Some characters present the possibility of being misunderstood within URLs for various reasons. These characters should also always be encoded.

    How are characters URL encoded?
    URL encoding of a character consists of a "%" symbol, followed by the two-digit hexadecimal representation (case-insensitive) of the ISO-Latin code point for the character.

    Example

    • Space = decimal code point 32 in the ISO-Latin set.
    • 32 decimal = 20 in hexadecimal
    • The URL encoded representation will be "%20"

    XSS (Cross Site Scripting) Prevention Cheat Sheet
    http://www.owasp.org/index.php/XSS_%28Cross_Site_Scripting%29_Prevention_Cheat_Sheet

你可能感兴趣的:(HTML Url 编码(Encode 和 Url Decode))