Python可自动登录爬取图片的网络爬虫

        最近在学习网络爬虫相关的东西,偶然在CSDN中看到了一个非常简单的网络爬虫程序。但是该程序存在一个问题,爬取的图片除了第一页的之外都是小图片,文中没有给出解决办法。于是就想尝试解决这个问题,毕竟实践出真知。

        原文地址:http://blog.csdn.net/qqzhoufei521/article/details/19570971。

        因为程序本身比较简单,而且没什么技术难度,这里只简单记录下卡住我的两个问题,分享的同时也算是给自己备忘。

        1. authenticity_token。

        最开始尝试自动登录的时候只是按照网页呈现的form结构填写了用户名、密码和remember_me构造出http消息的content部分,但是发现这样无法登录。通过抓包对比正常登录和程序自动登录发送的消息体发现我发送的content部分缺少了一个叫做authenticity_token的部分,这个东西在网页界面上是看不到的。通过查看资料发现这个东西是存在于form中的一个隐藏项,详见:http://stackoverflow.com/questions/941594/understand-rails-authenticity-token。找到了问题所在,用正则表达式把他从/account/sign_in中取出,放入content中再次post,从返回的消息体中就可以看出登录问题已经解决了(当然需要实现对于http重定向的支持)。

具体代码:

SignIn = '/account/sign_in'	# const
GetAuthKeyExp = r''	# const
AuthKeyPattern = re.compile(GetAuthKeyExp)	# const

def getAuthKey():
	global headers
	page = myRequest(url = SignIn, headers = headers)
	authKey = AuthKeyPattern.findall(page)
	return authKey[0]

        2. 保持登录状态继续访问其他页面。

        通过查看网页的源代码发现要想获得除了第一页之外的页面的大图,必须点击图片进入网站的下一级目录,类似:/items/802。再未登录状态下点击图片会被自动重定向到/account/sign_in页面,这就需要程序实现保持登录状态继续访问其他页面的功能。在学习爬虫之前我对http协议的掌握仅限于了解的程度,还是继续google了解到:http本身是短连接无状态的,为了给http增加状态以使其可以以会话的形式交互,就在http中加入了session的概念,而session的概念也是通过cookie实现的。

        通过抓包,发现在cs交互过程中http请求的cookie头域中始终会存在一个叫做_Loudatui_Session的项,而ack的set-cookie头域中也会携带这样一项,下次请求就会携带上一次ack返回的session值。资料中的session就是这样实现的,想必这个网站的会话保持也是通过_Loudatui_Session实现的。于是,按照上述思路实现了之后,文章开头所述的问题就被解决掉了。

        这中间还有个小问题就是这个网站登录会话的建立是从访问/account/sign_in开始的。也就是说如果在/account/sign_in页面上点击登录按钮之前清空cookie,虽然可以正常登录,但是后续的访问又会被重定向到/account/sign_in界面,不知道这个算不算是网站的一个小bug,希望有经验的童鞋可以帮忙解释下。这个问题也卡住我好长时间,修改程序从第一次获取authenticity_token就开始记录session值后,整个程序才算是完全解决了问题。

完整代码奉上:

# spider
import urllib, httplib
import re
import threading
import os
import sys
import datetime

Pages = 10	# no. of pages want to download.

Today = datetime.date.today().isoformat()
S = os.sep	# const
Root = "d:" + S + "ludatui" + S	# const
	
Prefix = "/?page="	# const
HOST = "loudatui.com"	# const
SignIn = '/account/sign_in'	# const
GetZoomLinkExp = r''	# const
ZoomLinkPattern = re.compile(GetZoomLinkExp)	# const
GetImageExp = r'.*?'	# const
ImagePattern = re.compile(GetImageExp)	# const
GetAuthKeyExp = r''	# const
AuthKeyPattern = re.compile(GetAuthKeyExp)	# const

## some sites de-spider. use headers to disguise as browser.
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1)\
							AppleWebKit/537.36 (KHTML, like Gecko)\
							Chrome/28.0.1500.72 Safari/537.36',
			'Connection' : 'Keep-Alive',
			'Cookie' : '__utma=233318019.434537196.1394773301.1394773301.1394773301.1; __utmb=233318019.5.6.1394773301; __utmc=233318019; __utmz=233318019.1394773301.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)' }

def getAuthKey():
	global headers
	page = myRequest(url = SignIn, headers = headers)
	authKey = AuthKeyPattern.findall(page)
	return authKey[0]

## get the url used in request.
def parseUrl(url = ''):
	pos = url.find(HOST) + len(HOST)
	return url[pos:]

def setHTTPSession(session = ''):
	global headers
	if not session:
		return
	tmpCookie = str(headers.get('Cookie'))
	sessionName = '_LouDaTui_session'
	pos = tmpCookie.find(sessionName)
	if -1 != pos:
		tmpCookie = tmpCookie[0:pos - 2]
	tmpCookie += '; ' + session
	headers['Cookie'] = tmpCookie

## this method will return body.	
def myRequest(method = 'GET', url = '', body = '', headers = {}):
	## init connection
	connection = httplib.HTTPConnection(HOST)
	## send request
	connection.request(method = method, url = url, body = body, headers = headers)
	## buffer new cookie and other info
	resp = connection.getresponse()
	HTTPSession = ''
	if resp.getheader('Set-Cookie') != None:
		HTTPSession = resp.getheader('Set-Cookie').split(';')[0]
	status = resp.status
	location = resp.getheader('location')
	text = resp.read()
	## set HTTPSession
	setHTTPSession(HTTPSession)
	## connection should be closed before next request.
	## and response will be None when close.
	connection.close()
	## rediret if needed.
	if (httplib.FOUND == status) or (httplib.MOVED_PERMANENTLY == status):
		text = myRequest(method = 'GET', url = parseUrl(location), headers = headers)
	return text
	
def login(signIn = '', data = '', headers = {}):
	## login request.
	myRequest(method = 'POST', url = SignIn, body = data, headers = headers)
	
def getIt(url, i, k):
	page = myRequest(url = url, headers = headers)
	picUrl = ImagePattern.findall(page)
	fname = Today + "-" + str(i) + "-" + str(k + 1) + ".jpg"
	if 1 == len(picUrl):
		urllib.urlretrieve(picUrl[0], os.path.join(Root, fname))
	else:
		print 'number of pic url is %d' % len(picUrl)
	
def parsePages(indexOfPage = 0):
	global headers
	page = myRequest(url = Prefix + str(indexOfPage), headers = headers)
	items = ZoomLinkPattern.findall(page)
	tasks = []
	for k in range(len(items)):
		try:
			t = threading.Thread(target = getIt, args = (items[k], indexOfPage, k))
			tasks.append(t)
		except:
			print "some error in %sth download." % k
			continue
	for task in tasks:
		task.start()
	for task in tasks:
		task.join(300)
	return 0
	
def main():
	## forms need to commit when login.
	formsPrev = 'utf8=%E2%9C%93&'
	forms = {
			'authenticity_token' : '',
			'user[login]' : '[email protected]',
			'user[password]' : '123456',
			'user[remember_me]' : '0',
			}
	formsEnd = '&commit=%E7%99%BB%E9%99%86'
	## update authenticity_token from sign_in page.
	forms['authenticity_token'] = getAuthKey()
	data = formsPrev + urllib.urlencode(forms) + formsEnd
	## login. session created after get sign in.
	login(SignIn, data, headers)
	## begin to spider the pics.
	if False == os.path.exists(Root):
		os.mkdir(Root)
	for n in range(Pages):
		print "Now page %s" % str(n + 1)
		parsePages(n + 1)
		print "Page %s OK\n" % str(n + 1)

main();
	

        如果文中有什么问题,希望大家可以帮忙指出~

        在此先谢过各位了

你可能感兴趣的:(python,网络爬虫,python,源代码,cookie,session)