HyperText Transport Protocol - HTTP
The HyperText Transport Protocol is described in the following document: http://www.w3.org/Protocols/rfc2616/rfc2616.txt This is a long and complex 176-page document with a lot of detail. If you find it interesting, feel free to read it all. But if you take a look around page 36 of RFC2616 you will find the syntax for the GET request. To request a document from a web server, we make a connection to the www.py4inf.com server on port 80, and then send a line of the form
GET http://www.py4inf.com/code/romeo.txt HTTP/1.0
where the second parameter is the web page we are requesting, and then we also send a blank line. The web server will respond with some header informationabout the document and a blank line followed by the document content.
//makes a connection to a web server and follows the rules of the HTTP protocol to requests a document and display what the server sends back.
1 import socket 2 mysock = socket.socket(socket.AF_INET,socket.SOCK_STREAM) 3 mysock.connect(('www.py4inf.com',80)) 4 mysock.send('GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n') 5 while True: 6 data = mysock.recv(512) 7 if(len(data)<1): 8 break 9 print data 10 mysock.close()
OUTPUT:
HTTP/1.1 200 OK Date: Sat, 12 Dec 2015 14:22:51 GMT Server: Apache Last-Modified: Fri, 04 Dec 2015 19:05:04 GMT ETag: "e103c2f4-a7-526172f5b5d89" Accept-Ranges: bytes Content-Length: 167 Cache-Control: max-age=604800, public Access-Control-Allow-Origin: * Access-Control-Allow-Headers: origin, x-requested-with, content-type Access-Control-Allow-Methods: GET Connection: close Content-Type: text/plain
But soft what light through yonder window breaks It is the east and Juliet is the sun Arise fai r sun and kill the envious moon Who is already sick and pale with grief
Description:
First the program makes a connection to port 80 on the server www.py4inf.com.Since our program is playing the role of the “web browser”, the HTTP protocol says we must send the GET command followed by a blank line.Once we send that blank line, we write a loop that receives data in 512-character chunks from the socket and prints the data out until there is no more data to read (i.e., the recv() returns an empty string).
The output starts with headers which the web server sends to describe the document. For example, the Content-Type header indicates that the document is a plain text document (text/plain). After the server sends us the headers, it adds a blank line to indicate the end of the headers, and then sends the actual data of the file romeo.txt
Retrieving an image over HTTP
//accumulate the data in a string, trim off the headers, and then save the image data to a file
1 import socket 2 import time 3 mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 4 mysock.connect(('www.py4inf.com',80)) 5 mysock.send('GET http://www.py4inf.com/cover.jpg HTTP/1.0 \n\n') 6 count = 0 7 picture = "" 8 while True: 9 data = mysock.recv(5120) 10 if(len(data)<1):break 11 #time.sleep(0.25) 12 count = count + len(data) 13 print len(data),count 14 picture = picture + data 15 mysock.close() 16 # Look for the end of the header 17 pos = picture.find("\r\n\r\n"); 18 print 'Header length',pos 19 print picture[:pos] 20 #Skip past the header and save the picture data 21 picture = picture[pos+4:] 22 fhand = open("stuff.jpg","wb") 23 fhand.write(picture) 24 fhand.close()
OUTPUT:
>>> 5120 5120 5120 10240 5120 15360 1920 17280 5120 22400 5120 27520 4160 31680 5120 36800 5120 41920 4160 46080 2880 48960 5120 54080 5120 59200 5120 64320 5120 69440 863 70303 Header length 242 HTTP/1.1 200 OK Date: Sat, 12 Dec 2015 15:32:56 GMT Server: Apache Last-Modified: Fri, 04 Dec 2015 19:05:04 GMT ETag: "b294001f-111a9-526172f5b7cc9" Accept-Ranges: bytes Content-Length: 70057 Connection: close Content-Type: image/jpeg
stuff.jpg as follows:
Description:
You can see that for this url, the Content-Type header indicates that body of the document is an image (image/jpeg).
As the program runs, you can see that we don’t get 5120 characters each time we call the recv() method. We get as many characters as have been transferred across the network to us by the web server at the moment we call recv(). In this example, we either get 1460 or 2920 characters each time we request up to 5120 characters of data.
Your results may be different depending on your network speed. Also note that on the last call to recv() we get 1681 bytes, which is the end of the stream, and in the next call to recv() we get a zero-length string that tells us that the server has called close() on its end of the socket and there is no more data forthcoming.
We can slow down our successive recv() calls by uncommenting the call to time.sleep(). This way, we wait a quarter of a second after each call so that the server can “get ahead” of us and send more data to us before we call recv() again. With the delay, in place the program executes as follows:
1460 1460 5120 6580 5120 11700 ... 5120 62900 5120 68020 2281 70301 Header length 240 HTTP/1.1 200 OK Date: Sat, 02 Nov 2013 02:22:04 GMT Server: Apache Last-Modified: Sat, 02 Nov 2013 02:01:26 GMT ETag: "19c141-111a9-4ea280f8354b8" Accept-Ranges: bytes Content-Length: 70057 Connection: close Content-Type: image/jpeg
Retrieving web pages with urllib
Using urllib, you can treat a web page much like a file. You simply indicate which web page you would like to retrieve and urllib handles all of the HTTP protocol and header details.
1 import urllib 2 fhand = urllib.urlopen('http://www.py4inf.com/code/remeo.txt') 3 for line in fhand: 4 print line.strip()
OUTPUT:
>>> But soft what light through yonder window breaks It is the east and Juliet is the sun Arise fair sun and kill the envious moon Who is already sick and pale with grief >>>
we only see the output of the contents of the file. The headers are still sent, but the urllib code consumes the headers and only returns
the data to us.
//retrieve the data for romeo.txt and compute the frequency of each word
1 import urllib 2 fhand = urllib.urlopen('http://www.py4inf.com/code/remeo.txt') 3 new_dic = dict() 4 for line in fhand: 5 line = line.split() 6 for word in line: 7 new_dic[word] = new_dic.get(word,0)+1;
OUTPUT:
>>> print new_dic {'and': 3, 'envious': 1, 'already': 1, 'fair': 1, 'is': 3, 'through': 1, 'pale': 1, 'yonder': 1, 'what': 1, 'sun': 2, 'Who': 1, 'But': 1, 'moon': 1, 'window': 1, 'sick': 1, 'east': 1, 'breaks': 1, 'grief': 1, 'with': 1, 'light': 1, 'It': 1, 'Arise': 1, 'kill': 1, 'the': 3, 'soft': 1, 'Juliet': 1}
Parsing HTML and scraping the web
//using regular expressions
1 import urllib 2 import re 3 url = raw_input('Enter-') 4 html = urllib.urlopen(url).read() 5 links = re.findall('href="(http://.+?)"',html) 6 for link in links: 7 print link
OUTPUT:
>>> Enter-http://www.dr-chuck.com/page1.htm http://www.dr-chuck.com/page2.htm >>> Enter-http://www.py4inf.com/book.htm http://amzn.to/1KkULF3 http://amzn.to/1KkULF3 http://amzn.to/1hLcoBy http://amzn.to/1KkV42z http://amzn.to/1fNOnbd http://amzn.to/1N74xLt http://do1.dr-chuck.com/py4inf/EN-us/book.pdf http://do1.dr-chuck.com/py4inf/ES-es/book.pdf http://www.xwmooc.net/python/ http://fanwscu.gitbooks.io/py4inf-zh-cn/ http://itunes.apple.com/us/book/python-for-informatics/id554638579?mt=13 http://www-personal.umich.edu/~csev/books/py4inf/ibooks//python_for_informatics.ibooks http://www.py4inf.com/code http://www.greenteapress.com/thinkpython/thinkCSpy/ http://allendowney.com/
Reading binary files using urllib
// download img file
1 img = urllib.urlopen('http://www.py4inf.com/cover.jpg').read() 2 fhand = open('cover.jpg', 'w') 3 fhand.write(img) 4 fhand.close()
However if this is a large audio or video file, this program may crash or at least run extremely slowly when your computer runs out of memory. In order to avoid running out of memory, we retrieve the data in blocks (or buffers) and then write each block to your disk before retrieving the next block. This way the program can read any size file without using up all of the memory you have in your computer.
1 import urllib 2 img = urllib.urlopen('http://www.py4inf.com/cover.jpg') 3 fhand = open('cover.jpg', 'w') 4 size = 0 5 while True: 6 info = img.read(100000) 7 if len(info) < 1 : break 8 size = size + len(info) 9 fhand.write(info) 10 print size,'characters copied.' 11 fhand.close()
Using Web Services
There are two common formats that we use when exchanging data across the web.The “eXtensible Markup Language” or XML has been in use for a very long time and is best suited for exchanging document-style data. When programs just want to exchange dictionaries, lists, or other internal information with each other, they use JavaScript Object Notation or JSON (see www.json.org). We will look at both formats.
eXtensible Markup Language - XML
XML looks very similar to HTML, but XML is more structured than HTML. Here is a sample of an XML document:
<person> <name>Chuck</name> <phone type="intl"> +1 734 303 4456 </phone> <email hide="yes"/> </person>
Often it is helpful to think of an XML document as a tree structure where there is a top tag person and other tags such as phone are drawn as children of their parent nodes.
//Here is a simple application that parses some XML and extracts some data elements from the XML:
1 import xml.etree.ElementTree as ET 2 data = ''' 3 <person> 4 <name>Chuck</name> 5 <phone type="intl"> 6 +1 734 303 4456 7 </phone> 8 <email hide = "yes"/> 9 </person>''' 10 tree = ET.fromstring(data) 11 print 'Name:', tree.find('name').text 12 print 'Attr:', tree.find('email').get('hide')
OUTPUT:
>>>
Name: Chuck
Attr: yes
Calling fromstring converts the string representation of the XML into a “tree” of XML nodes. When the XML is in a tree, we have a series of methods we can call to extract portions of data from the XML.
The find function searches through the XML tree and retrieves a node that matches the specified tag. Each node can have some text, some attributes (like hide), and some “child” nodes. Each node can be the top of a tree of nodes.
Using an XML parser such as ElementTree has the advantage that while the XML in this example is quite simple, it turns out there are many rules regarding valid XML and using ElementTree allows us to extract data from XML without worrying about the rules of XML syntax.
//Often the XML has multiple nodes and we need to write a loop to process all of the nodes.
1 import xml.etree.ElementTree as ET 2 data = ''' 3 <stuff> 4 <users> 5 <user x="2"> 6 <id>001</id> 7 <name>Chuck</name> 8 </user> 9 <user x="7"> 10 <id>009</id> 11 <name>Brent</name> 12 </user> 13 </users> 14 </stuff>''' 15 stuff = ET.fromstring(data) 16 lst = stuff.findall('users/user') 17 print 'User count:', len(lst) 18 for item in lst: 19 print 'Name',item.find('name').text 20 print 'Id', item.find('id').text 21 print 'Attribute',item.get('x')
OUTPUT:
>>> User count: 2 Name Chuck Id 001 Attribute 2 Name Brent Id 009 Attribute 7
The findall method retrieves a Python list of subtrees that represent the user structures in the XML tree. Then we can write a for loop that looks at each of the user nodes, and prints the name and id text elements as well as the x attribute from the user node.