Python爬虫开发与项目实战 1:回顾Python编程

https://github.com/qiyeboy/SpiderBook

第一章 回顾Python编程

      本书采用的是Python 2.7版本

       sudo apt-get install python-pip python-dev

      搭建Eclipse + PyDev : 通过扩展PyDev插件,Eclipse就具有了编写Python程序的功能。

               启动Eclipse, 点击Help -> Install New Software ...

               Add:  name : Pydev , location: http://pydev.org/updates

               Pydev解释器配置:window -> Pydev --> Interpreters --> Python Interpreter  添加python路径
     读文件

try:

f = open(r'qiye.txt', 'r')

print f.read()      

finally:

if f:

f.close()

        上面代码略长,使用简单的写法,用with语句来代替try ... finally和close()

with open(r'qiye.txt', 'r') as fileReader:

print fileReader.read()

      序列化操作:用dict 对象, 和CPickle模块(用C语言编写,速度快)和pickle模块

# 优先导入cPickle
try:
     import cPickle as picker
except ImportError:
     import pickle
import cPickle as pickle
d = dict(url='index.html', title='首页', content='内容')
pickel.dumps(d)    #dumps可以将任意对象序列化成一个str
f = open(r'dump.txt', 'wb')
pickle.dump(d, f)  # dump直接将对象写入文件
f.close()
反序列化:loads方法或load方法

f = open(r'dump.txt', 'rb')
d = pickle.load(f)
f.close()
        进程和线程:

taskManager.py

# coding: utf-8
import random, time, Queue
from multiprocessing.managers import BaseManager
# 1
task_queue = Queue.Queue()
result_queue = Queue.Queue()

class Queuemanager(BaseManager):
	pass

# 2 register
Queuemanager.register('get_task_queue', callable=lambda:task_queue);
Queuemanager.register('get_result_queue', callable=lambda:result_queue);

# 3 bind port, set the password "qiye"
manager = Queuemanager(address=('', 8001), authkey='qiye')

# 4 
manager.start();

# 5
task = manager.get_task_queue()
result = manager.get_result_queue()

# 6
for url in ['ImageUrl_'+bytes(i) for i in range(10)]:
	print 'put task %s ...' % url
	task.put(url)

#
print 'try get result...'
for i in range(10):
	print 'result is %s' % result.get(timeout=10)

#
manager.shutdown()

taskWorker.py

# coding: utf-8
import time
from multiprocessing.managers import BaseManager

# 0
class QueueManager(BaseManager):
	pass

# 1
QueueManager.register('get_task_queue')
QueueManager.register('get_result_queue')
# 2
server_addr = '127.0.0.1'
print('Connect to server %s...' % server_addr)
m = QueueManager(address=(server_addr, 8001), authkey='qiye')
m.connect()
# 3 获取Queue的对象
task = m.get_task_queue()
result = m.get_result_queue()
# 4
while (not task.empty()):
	image_url = task.get(True, timeout=5)
	print('run task download %s...' % image_url)
	time.sleep(1)
	result.put('%s--->success'%image_url)
print('worker exit.')

       网络编程

            Python提供了两个基本的Socket模块:

                   Socket, 提供了标准的BSD Sockets API

                   SocketServer, 提供了服务器中心类,可以简化网络服务器的开发

 Socket类型

Python爬虫开发与项目实战 1:回顾Python编程_第1张图片         
Python爬虫开发与项目实战 1:回顾Python编程_第2张图片      



你可能感兴趣的:(爬虫)