Encode: html, xml, url转义

今天遇到一个问题,因为HTML中的一个URL被encode了,所以测试工具无法识别url.因此去了解了下html, xml, url encode

 

简单来说

  • html encode包含了252个字符,格式为‘&name;’,其中name为大小写敏感的;
  • xml encode只包含了5个字符,它们是$,<,>,',", 其格式与html相同;
  • url encode主要是ASCII的控制符号,Non-ASCII,url里的特殊字符(如/,?,&等),不安全字符(会引起二义性的),而encode规则是,使用%和16进制的2位数字(对应的ISO-Lation位置)组成。如空格就是%20,其位置为32.

 

HTML character references

Character entity references have the format &name; where "name" is a case-sensitive alphanumeric string.

The character entity references <, >, " and & are predefined in HTML and SGML, because <, >, " and & are already used to delimit markup.

XML character references

Unlike traditional HTML with its large range of character entity references, in XML there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:[7]

  • & → & (ampersand, U+0026)
  • < → < (less-than sign, U+003C)
  • > → > (greater-than sign, U+003E)
  • " → " (quotation mark, U+0022)
  • ' → ' (apostrophe, U+0027)

& has the special problem that it starts with the character to be escaped. A simple Internet search finds thousands of sequences &amp;amp;amp; … in HTML pages for which the algorithm to replace an ampersand by the corresponding character entity reference was applied too often.

http://en.wikipedia.org/wiki/Character_encodings_in_HTML

List of XML and HTML character entity references

  • The XML specification defines five "predefined entities" representing special characters, and requires that all XML processors honor them.
  • The HTML 4 DTDs define 252 named entities, references to which act as mnemonic aliases for certain Unicode characters.

http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

URL encode

ASCII Control characters
Why: These characters are not printable.

Non-ASCII characters
Why: These are by definition not legal in URLs since they are not in the ASCII set.

"Reserved characters"
Why: URLs use some characters for special use in defining their syntax. When these characters are not used in their special role inside a URL, they need to be encoded.

"Unsafe characters"
Why: Some characters present the possibility of being misunderstood within URLs for various reasons. These characters should also always be encoded.

How are characters URL encoded?
URL encoding of a character consists of a "%" symbol, followed by the two-digit hexadecimal representation (case-insensitive) of the ISO-Latin code point for the character.

Example

  • Space = decimal code point 32 in the ISO-Latin set.
  • 32 decimal = 20 in hexadecimal
  • The URL encoded representation will be "%20"

XSS (Cross Site Scripting) Prevention Cheat Sheet
http://www.owasp.org/index.php/XSS_%28Cross_Site_Scripting%29_Prevention_Cheat_Sheet

你可能感兴趣的:(html,xml,url,character,scripting,reference)