chapter1~4简单介绍了http协议的overview,主要介绍了与tcp建立连接的方式。
chapter5~10介绍了http协议中的web server,http proxy,web cache,http加密,web client,未来的http。
2.4节
1.为什么需要urlencoding?
Applications need to walk a fine line.It is best for client applications to convert any unsafe or restricted characters before sending any URL to any other applications. Once all the unsafe characters have been encoded , the URL is in a canonical(正规,正则的) form that can be shared between applications;there is no need to worry about the other application getting confused by any of the characters' special meanings.
2.扩展的原则:
“Be conservative in what you send; be liberal in what you accept. –The Robustness principle”
“对于自己输出要严格; 对于他人的输入要灵活. –鲁棒性原则”
6.1.1proxies分类
public proxies公共代理,private proxies个人代理(搜狗浏览器的代理)
6.1.2 proxies versus gateway
Strictly speaking ,proxies connect two or more applications that speak the same protocol,while gateways hook up two or more parties that speak different protocols.A gateway acts as a "protocol converter",allowing a client to complete a transaction with a server ,even when the client and server speak different protocols.
In practice,the difference between proxies and gateway is blurry.Because browsers and servers implement different versions of Http,proxies often do some amount of protocol conversion .And commercial proxy servers implement gateway functionality to support SSL security protocols,SOCKS firewalls,FTP access,and web-based applications.
9.1.5
A good reference book for implementing huge data structures is Managing Gigabytes:Compressing and Indexing Documents and Images, by written ,et.al(Morgan Kaufmann).This book is full of tricks and techniques for managing large amounts of data.
9.4 Excluding Robots
The robot community understood the problems that robotic web site access could cause.In 1994 ,a simple, voluntary technique was purposed to keep robots out of where they don't belong and provide webmasters with a mechanism to better control their behavior. The standard was named the "Robots Exclusion Standard" but is often just called robot.txt, after the file where the access-control information is stored.
The idea of robots.txt is simple.Any web server can provide an optional file named robots.txt in the document root of the server.This file contains information about what robots can access what parts of the server.If a robot follows this voluntary standard, it will request the robot.txt file from the web site before accessing any other resource from the site.
Before visiting any URLs on a web site,a robot must retrieve and process the robots.txt file on the web site ,if it is present.There is a single robots.txt resource for the entire web site defined by the hostname and port number.If the site is virtually hosted , there can be a different robots.txt file for each virtual docroot, as with any other file .
9.4.2.1Fetching robots.txt
Robots fetch the robots.txt resource using the HTTP GET method, like any other file on the web server.The server returns the robots.txt file ,if present, in a text/plain body.If the server responds with a 404 NOT FOUND HTTP status code, the robot can assume that there are no robotic access restrictions and that it can request any file.
Robots should pass along identifying information in the From and User-Agent headers to help site administrators track robotic accesses and to provide contact information in the event that the site administrator needs to inquire or complain about the robot.Here's an example HTTP crawler request from a commercial web robot:
GET /robots.txt HTTP /1.0 CRLF
HOST:www.joes-hardware.com CRLF
USER-AGENT:Slurp/2.0 CRLF
DATE: Web Oct 3 20:22:48 EST 2001 CRLF
9.4.3 robots.txt File Format
The robots.txt file has a very simple ,line-oriented syntax . There are three types of lines in a robots.txt file:blank lines,comment lines,and rule lines.Rule lines look like HTTP headers(<Field>:<value>) and are used for pattern matching.For example:
User-Agent:slurp
User-Agent:webcrawler
Disallow:/private
User-Agent:*
Disallow:
The example shows a robots.txt file that allows the Slurp and Webcrawler robots to access and file except those files in the private subdirectory.The same file also prevents any other robots from accessing anything on the site.
9.4.3.1The User-Agent line
Each robots record starts with one or more User-Agent lines,of the form:
User-Agent:<robot-name>
or:
User-Agent:*
The robot name (chosen by the robot implementor) is sent in the User-Agent header of the robot's HTTP GET request.
When a robot processes a robots.txt file,it must obey the record with either:
The first robot name that is a case-insensitive substring of the robot's name
The first robot name that is "*"
If the robot can't find a User-Agent line that matches its name,and can't find a wildcarded "User-Agent:*"line ,no record matches, and access is unlimited.
Because the robot name matches case-insensitive substrings, be careful about false matches.For example,"User-Agent:bot" matches all the robots named Bot,Robot,Bottom-Feeder,Spambot and Dont-Bother-Me.
11.6.4Different Cookies for Different Sites
In general, a browser sends to a server only those cookies that the server generated.Cookies generated by joes-hardware.com are sent to joes-hardware.com and not to bobs-books.com or marys-movies.com.
Many web sites contract with third-part vendors to manage advertisements.These advertisements are made to look like they are integral parts of the web site and do push persistent cookies.When the user goes to a different web site serviced by the same advertisement company, the persistent cookie set earlier is sent back again by the browser(because the domain match).A marketing company could use this technique,combined with the Referer header,to potentially build an exhaustive data set of user profiles and browsing habits.Modern browsers allow you to configure privacy settings to restrict third-part cookies.
11.6.4.1 Cookie Domain attribute
A server generating a cookie can control which sites get to see that cookie by adding a Domain attribute to the Set-Cookie response header.For example,the following HTTP response header tells the browser to send the cookie user="mary17" to any site in the domain.airtravelbargains.com:
Set-cookie:user="mary17";domain="airtravelbargains.com"
If the user visits www.airtravelbargains.com,specials.airtravelbargains.com,or any site ending in .airtravelbargains.com,the following Cookie header will issued:
Cookie:user="mary17"
11.6.5 Cookie Ingredients
11.6.10Cookies,Security,and Privacy
Still,it is good to be cautious when dealing with privacy and user tracking,because there is always potential for abuse.The biggest misuse comes from third-party web sites using persistent cookies to track users.This practice, combined with IP addresses and information from the Referer header,has enabled these marketing companies to build fairly accurate user profiles and browsing patterns.
chaptor 15.Entities and Encodings
chaptor 16.Internationalization --charset and character-encoding
16.2.1Charset is a Character-to-bits Encoding
16.2.5Content-type Charset Header and Meta Tags
Web servers send the client the MIME charset tag in the Content-type header ,using the charset parameter: content-type:text/html;charset=iso-2022-jp
If no charset is explicitly listed,the receiver may try to infer the character set from the document contents.For HTML content ,character sets might be found in <META HTTP-EQUIV="Content-Type">Tags that describe the charset.
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=iso-2022-jp">指定html文件的编码和entity body的bit流的编码
shows how HTML META tags set the charset to the Japanese encoding iso-2022-jp.If the document is not HTML,or there is no META Content-type tag,software may attempt to infer the charset encoding by scanning the actual text for common patterns indicative of languages and encodings.
If a client cannot infer a character encoding,it assumes iso-8859-1.
16.2.6The Accept -charset Header
There are thousands of defined character encoding and decoding methods,developed over the past several decades.Most clients do not support all the various character coding and mapping systems.HTTP clients can tell servers precisely which character systems they support,using the Accept-Charset request header.The Accept-Charset header value provides a list of character encoding schemes that the client supports.For example ,the following HTTP request header indicates that a client accepts the Western European iso-8859-1 character System as well as the UTF-8 variable-length Unicode compatibility system.A server is free to return content in either of these character encoding schemes.
Accept-charset : iso-8859-1,utf-8
Note that there is no content-type response header to match the Accept-charset request header .The response character set is carried back from the server by the charset parameter of the Content-Type response header,to be compatible with MIME .It's too bad this isn't symmetric,but all the information still is there.
16.3.5.1 US-ASCII:The mother of all character sets
HTTP messages (headers,URIs,etc) use US-ASCII.
16.3.5.2 iso-8859
iso-8859-1, also known as Latin-1 , is the default character set for HTML.
16.5 Internationalized URIs
Today's URIs are comprised of a restricted subset of US-ASCII characters.(basic Latin alphabet letters ,digits,and a few special characters).其他格式必须经过urlencoding
16.5.2 URI Character Reportoire
The subset of US-ASCII characters permitted in URLs can be divided into reserved, unreserved, and escape character classes.
URI character syntax
unreserved:[A-Za-z0-9]|"-"|"_"|"."|"!"|"~"|"*"|""|"("|")"|
Reserved:";"|"/"|"?"|":"|"@"|"&"|"="|"+"|"$"|","
Escape :"%"<HEX><HEX>
16.5.3 Escaping(在ascII范围内(0~127),不过不允许直接使用,需要escape %<hex><hex>) and Unescaping
URI "escape" provide a way to safely insert reserved characters and other unsupported characters(such as spaces)inside URIs.An escape is a three- character sequence ,consisting of a percent character(%) followed by two hexadecimal digit characters.The two hex digits represent the code for a US-ASCII character.
For example, to insert a space (ASCII 32) in a URL, you could use the escape "%20", because 20 is the hexadecimal representation of 32.Similarity, if you wanted to include a percent sign and have it not be treated as an escape,you could enter "%25",where 25 is the hexadecimal value of the ASCII code for percent.
Internally,HTTP applications should transport and forward URIs with the escapes in place.HTTP applicaionts should unescape the URIs only when the data is needed.And,more importantly, the applications should ensure that no URI ever is unescaped twice,because percent signs that might have been encoded in an escape will themselves be unescaped,leading to loss of data.
URLEncoding:URLs can only be sent over the internet using the ASCII characters.Since URLs ofter contains characters outside the ASCII set,the URL has to be converted.URLencoding converts the URL into a valid ASCII format.URL encoding replaces unsafe ASCII characters with "%" followed by two hexadecimal digits corresponding to the character values in the ISO-8859-1 character-set.
“中文” ==encodeURI==> ”%E4%B8%AD%E6%96%87″ (页面的编码) ==encodeURI(页面编码转为iso8859-1编码,http默认传输编码,%被编码)==> ”%25E4%25B8%25AD%25E6%2596%2587″ (iso-8859-1) ==Tomcat解码(ISO-8859-1)==> ”%E4%B8%AD%E6%96%87″ ==Java decode(UTF-8)==> ”中文”
16.5.4 Escaping International Characters(不在ASCII范围内,需要urlencoding,应使用文档的保存时的编码方式encoding)
Note that escape values should be in the range of US-ASCII codes(0-127).Some applications attemp to use escape values to represent iso-8859-1 extended characters (128-255)--for example, web servers might erroneously use escapes to code filenames that contain international character.This is incorrent and may cause problems with some applications.
For example ,the filename Sven 大家.html(containing an umlaut)might be encoded by a web server as Seven%20%D6lssen.html.It's fine to encode the space with %20,but is technically illegal to encode the
16.6.1Headers and Out-of-Spec Data
HTTP headers must consist of characters from the US-ASCII character set.
However ,not all clients and servers implement this correctly, so you may on occasion receive illegal characters with code values larger than 127.
chapter 18~21 talks all about the technology for publishing and disseminating web content:
chapter 18 discusses the ways people deploy servers in modern web hosting environments, HTTP support for virtual web hosting ,and how to replicating content across geographically distant servers.
chapter 19 discusses the technology for creating web content and installing it onto web servers.
chapter 20 surveys the tools and techniques for distributing incoming web traffic among a collection of servers.
chapter 21 covers log formats and common questions.