Python Networked programs (网络编程)

HyperText Transport Protocol - HTTP
The HyperText Transport Protocol is described in the following document: http://www.w3.org/Protocols/rfc2616/rfc2616.txt  This is a long and complex 176-page document with a lot of detail. If you find it interesting, feel free to read it all. But if you take a look around page 36 of RFC2616 you will find the syntax for the GET request. To request a document from a web server, we make a connection to the www.py4inf.com server on port 80, and then send a line of the form
GET http://www.py4inf.com/code/romeo.txt HTTP/1.0


where the second parameter is the web page we are requesting, and then we also send a blank line. The web server will respond with some header informationabout the document and a blank line followed by the document content.

//makes a connection to a web server and follows the rules of the HTTP protocol to requests a document and display what the server sends back.

 1 import socket
 2 mysock = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
 3 mysock.connect(('www.py4inf.com',80))
 4 mysock.send('GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n')
 5 while True:
 6     data = mysock.recv(512)
 7     if(len(data)<1):
 8         break
 9     print data
10 mysock.close()

OUTPUT:

HTTP/1.1 200 OK
Date: Sat, 12 Dec 2015 14:22:51 GMT
Server: Apache
Last-Modified: Fri, 04 Dec 2015 19:05:04 GMT
ETag: "e103c2f4-a7-526172f5b5d89"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=604800, public
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: origin, x-requested-with, content-type
Access-Control-Allow-Methods: GET
Connection: close
Content-Type: text/plain
But soft what light through yonder window breaks It
is the east and Juliet is the sun Arise fai r sun and kill the envious moon Who is already sick and pale with grief

Description:

  First the program makes a connection to port 80 on the server www.py4inf.com.Since our program is playing the role of the “web browser”, the HTTP protocol says we must send the GET command followed by a blank line.Once we send that blank line, we write a loop that receives data in 512-character chunks from the socket and prints the data out until there is no more data to read (i.e., the recv() returns an empty string).

The output starts with headers which the web server sends to describe the document. For example, the Content-Type header indicates that the document is a plain text document (text/plain). After the server sends us the headers, it adds a blank line to indicate the end of the headers, and then sends the actual data of the file romeo.txt

Retrieving an image over HTTP
//accumulate the data in a string, trim off the headers, and then save the image data to a file

 1 import socket
 2 import time
 3 mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
 4 mysock.connect(('www.py4inf.com',80))
 5 mysock.send('GET http://www.py4inf.com/cover.jpg HTTP/1.0 \n\n')
 6 count = 0
 7 picture = ""
 8 while True:
 9     data = mysock.recv(5120)
10     if(len(data)<1):break
11     #time.sleep(0.25)
12     count = count + len(data)
13     print len(data),count
14     picture = picture + data
15 mysock.close()
16 # Look for the end of the header
17 pos = picture.find("\r\n\r\n");
18 print 'Header length',pos
19 print picture[:pos]
20 #Skip past the header and save the picture data
21 picture = picture[pos+4:]
22 fhand = open("stuff.jpg","wb")
23 fhand.write(picture)
24 fhand.close()

OUTPUT:

>>> 
5120 5120
5120 10240
5120 15360
1920 17280
5120 22400
5120 27520
4160 31680
5120 36800
5120 41920
4160 46080
2880 48960
5120 54080
5120 59200
5120 64320
5120 69440
863 70303
Header length 242
HTTP/1.1 200 OK
Date: Sat, 12 Dec 2015 15:32:56 GMT
Server: Apache
Last-Modified: Fri, 04 Dec 2015 19:05:04 GMT
ETag: "b294001f-111a9-526172f5b7cc9"
Accept-Ranges: bytes
Content-Length: 70057
Connection: close
Content-Type: image/jpeg

stuff.jpg as follows:

Python Networked programs (网络编程)_第1张图片

Description:
You can see that for this url, the Content-Type header indicates that body of the document is an image (image/jpeg).
 As the program runs, you can see that we don’t get 5120 characters each time we call the recv() method. We get as many characters as have been transferred across the network to us by the web server at the moment we call recv(). In this example, we either get 1460 or 2920 characters each time we request up to 5120  characters of data.
Your results may be different depending on your network speed. Also note that on the last call to recv() we get 1681 bytes, which is the end of the stream, and in the next call to recv() we get a zero-length string that tells us that the server has called close() on its end of the socket and there is no more data forthcoming.

We can slow down our successive recv() calls by uncommenting the call to time.sleep(). This way, we wait a quarter of a second after each call so that the server can “get ahead” of us and send more data to us before we call recv() again. With the delay, in place the program executes as follows:

1460 1460
5120 6580
5120 11700
...
5120 62900
5120 68020
2281 70301
Header length 240
HTTP/1.1 200 OK
Date: Sat, 02 Nov 2013 02:22:04 GMT
Server: Apache
Last-Modified: Sat, 02 Nov 2013 02:01:26 GMT
ETag: "19c141-111a9-4ea280f8354b8"
Accept-Ranges: bytes
Content-Length: 70057
Connection: close
Content-Type: image/jpeg

 

Retrieving web pages with urllib
Using urllib, you can treat a web page much like a file. You simply indicate  which web page you would like to retrieve and urllib handles all of the HTTP protocol and header details.

1 import urllib
2 fhand = urllib.urlopen('http://www.py4inf.com/code/remeo.txt')
3 for line in fhand:
4     print line.strip()

OUTPUT:

>>> 
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
>>> 

we only see the output of the contents of the file. The headers are still sent, but the urllib code consumes the headers and only returns
the data to us.

//retrieve the data for romeo.txt and compute the frequency of each word

1 import urllib
2 fhand = urllib.urlopen('http://www.py4inf.com/code/remeo.txt')
3 new_dic = dict()
4 for line in fhand:
5     line = line.split()
6     for word in line:
7         new_dic[word] = new_dic.get(word,0)+1;

OUTPUT:

>>> print new_dic
{'and': 3, 'envious': 1, 'already': 1, 'fair': 1, 'is': 3, 'through': 1, 'pale': 1, 'yonder': 1, 'what': 1, 'sun': 2, 'Who': 1, 'But': 1, 'moon': 1, 'window': 1, 'sick': 1, 'east': 1, 'breaks': 1, 'grief': 1, 'with': 1, 'light': 1, 'It': 1, 'Arise': 1, 'kill': 1, 'the': 3, 'soft': 1, 'Juliet': 1}

 

Parsing HTML and scraping the web
//using regular expressions

1 import urllib
2 import re
3 url = raw_input('Enter-')
4 html = urllib.urlopen(url).read()
5 links = re.findall('href="(http://.+?)"',html)
6 for link in links:
7     print link

OUTPUT:

>>> 
Enter-http://www.dr-chuck.com/page1.htm
http://www.dr-chuck.com/page2.htm

>>> 
Enter-http://www.py4inf.com/book.htm
http://amzn.to/1KkULF3
http://amzn.to/1KkULF3
http://amzn.to/1hLcoBy
http://amzn.to/1KkV42z
http://amzn.to/1fNOnbd
http://amzn.to/1N74xLt
http://do1.dr-chuck.com/py4inf/EN-us/book.pdf
http://do1.dr-chuck.com/py4inf/ES-es/book.pdf
http://www.xwmooc.net/python/
http://fanwscu.gitbooks.io/py4inf-zh-cn/
http://itunes.apple.com/us/book/python-for-informatics/id554638579?mt=13
http://www-personal.umich.edu/~csev/books/py4inf/ibooks//python_for_informatics.ibooks
http://www.py4inf.com/code
http://www.greenteapress.com/thinkpython/thinkCSpy/
http://allendowney.com/

 

Reading binary files using urllib
// download img file

1 img = urllib.urlopen('http://www.py4inf.com/cover.jpg').read()
2 fhand = open('cover.jpg', 'w')
3 fhand.write(img)
4 fhand.close()

However if this is a large audio or video file, this program may crash or at least run extremely slowly when your computer runs out of memory. In order to avoid running out of memory, we retrieve the data in blocks (or buffers) and then write each block to your disk before retrieving the next block. This way the program can read any size file without using up all of the memory you have in your computer.

 1 import urllib
 2 img = urllib.urlopen('http://www.py4inf.com/cover.jpg')
 3 fhand = open('cover.jpg', 'w')
 4 size = 0
 5 while True:
 6 info = img.read(100000)
 7 if len(info) < 1 : break
 8 size = size + len(info)
 9 fhand.write(info)
10 print size,'characters copied.'
11 fhand.close()

 

Using Web Services
 There are two common formats that we use when exchanging data across the web.The “eXtensible Markup Language” or XML has been in use for a very long time and is best suited for exchanging document-style data. When programs just want to exchange dictionaries, lists, or other internal information with each other, they use JavaScript Object Notation or JSON (see www.json.org). We will look at both formats.
eXtensible Markup Language - XML
XML looks very similar to HTML, but XML is more structured than HTML. Here is a sample of an XML document:


    Chuck
    "intl">
      +1 734 303 4456
    
    "yes"/>

Often it is helpful to think of an XML document as a tree structure where there is a top tag person and other tags such as phone are drawn as children of their parent nodes.

//Here is a simple application that parses some XML and extracts some data elements from the XML:

 1 import xml.etree.ElementTree as ET
 2 data = '''
 3 
 4     Chuck
 5     
 6         +1 734 303 4456
 7     
 8     
 9 '''
10 tree = ET.fromstring(data)
11 print 'Name:', tree.find('name').text
12 print 'Attr:', tree.find('email').get('hide')

OUTPUT:

>>> 
Name: Chuck
Attr: yes

 Calling fromstring converts the string representation of the XML into a “tree” of XML nodes. When the XML is in a tree, we have a series of methods we can call to extract portions of data from the XML.

 The find function searches through the XML tree and retrieves a node that matches the specified tag. Each node can have some text, some attributes (like hide), and some “child” nodes. Each node can be the top of a tree of nodes.

Using an XML parser such as ElementTree has the advantage that while the XML in this example is quite simple, it turns out there are many rules regarding valid XML and using ElementTree allows us to extract data from XML without worrying about the rules of XML syntax.

//Often the XML has multiple nodes and we need to write a loop to process all of the nodes.

 1 import xml.etree.ElementTree as ET
 2 data = '''
 3 
 4     
 5         
 6             001
 7             Chuck
 8         
 9         
10             009
11             Brent
12         
13     
14 '''
15 stuff = ET.fromstring(data)
16 lst = stuff.findall('users/user')
17 print 'User count:', len(lst)
18 for item in lst:
19     print 'Name',item.find('name').text
20     print 'Id', item.find('id').text
21     print 'Attribute',item.get('x')

OUTPUT:

>>> 
User count: 2
Name Chuck
Id 001
Attribute 2
Name Brent
Id 009
Attribute 7

The findall method retrieves a Python list of subtrees that represent the user structures in the XML tree. Then we can write a for loop that looks at each of the user nodes, and prints the name and id text elements as well as the x attribute from the user node.













 

转载于:https://www.cnblogs.com/peng-vfx/p/5042015.html

你可能感兴趣的:(Python Networked programs (网络编程))