最近在研究爬虫:
主要是2个版本 C# , Python
首先: 我们的爬虫是用在游戏客户端上,大概的需求就是 服务器是web形式的,每天去点点总是很烦人,所以写一个web客户端
httpwatch抓包,分析包。
Python 部分研究可行性代码,没有封装
!# 请求服务器部分 ,研究可行性部分,未封装
###########################################################
#
#
# iQSRobots
# 使用范围:Python3 + T4
#
#
__author__ = "Eagle Zhao([email protected]"
__version__ = "$Revision: 1.0 $"
__date__ = "$Date: 2011/11/15 21:57:19 $"
__copyright__ = "Copyright (c) 2011 Eagle"
__license__ = "iQS"
###########################################################
import urllib.parse
import httplib2
http = httplib2.Http()
url = 'http://ts2.travian.tw/dorf1.php'
body = {'name': '小铃铛','password':'1838888','s1':'登陆','w':'1280:800','login': '1321368625'}
headers = {'Content-type': 'application/x-www-form-urlencoded'}
response, content = http.request(url, 'POST', headers=headers, body=urllib.parse.urlencode(body))
#print(urllib.parse.urlencode(body))
print(response)
headers = {'Cookie': response['set-cookie']}
url = 'http://ts2.travian.tw/dorf1.php'
response, content = http.request(url, 'GET', headers=headers)
#print(content.decode('utf-8'))
/// 解析HTML -==- 使用 HTMLPaser 效果不是很好,最后决定使用正则
file=open('re.xml',encoding='utf-8')
p=file.read()
import urllib.parse
import re
building_farm =[]
building_links = []
m=re.search('<map name="rx" id="rx">.+?</map',p, re.S)
m_b=m.group()
buildings = re.findall('<area href="build.php.+?>', m_b)
# Parse each building
for building in buildings:
# Get building link
m = re.search('href="build.php.+?"', building)
#print(building)
link = m.group()[6:-1]
#print(link)
# Get bulding title
m = re.search('title=".+?"', building)
#<b>伐木場</b>||等級
title = m.group()[7:-1]
#print("title=",title)
# Get level
partsLevel = title.split()
parttitle = title.split(';')
#print("parts=",partsLevel)
if len(partsLevel) == 1:
level = 0
else:
title = partsLevel[0]
level = int(partsLevel[1])
#print("资源田种类",parttitle[2][:-3])
#print("资源田等级",level)
#print()
# Add bulidings info into list, eliminate duplicates
#if not link in building_links:
building_links.append(link)
#link = urllib.parse.urljoin(host, link) # Convert to absolute link
#test code start-===========
link = urllib.parse.urljoin("Http://ts2.travian.tw/", link) # Convert to absolute link
#test code end-=============
p=re.compile('\d+')
m=p.findall(link)
idNum=m[-1]
building_farm.append([idNum,parttitle[2][:-3], level, link])
print(building_farm[int(idNum) -1])
代码好凌乱:
看到到了Python 3 代码风格部分:
最重要的几点:
Use 4-space indentation, and no tabs.
使用 4-空格 缩进, 且没有制表符.
4 spaces are a good compromise between small indentation (allows greater nesting depth) and large indentation (easier to read). Tabs introduce confusion, and are best left out.
4 空格是在小缩进 (允许更多嵌套) 和大缩进 (更易读) 之间的好的妥协. 制表符会带来混乱, 最好不要使用.
Wrap lines so that they don’t exceed 79 characters.
包装行使它们不超过 79 个字符.
This helps users with small displays and makes it possible to have several code files side-by-side on larger displays.
这会帮助小屏幕的用户, 而且使得可以在大屏幕上同时显示几个代码文件成为可能.
Use blank lines to separate functions and classes, and larger blocks of code inside functions.
使用空行分隔函数和类, 以及函数中的大的代码块.
When possible, put comments on a line of their own.
尽可能将注释独占一行.
Use docstrings.
使用文档字符串.
Use spaces around operators and after commas, but not directly inside bracketing constructs: a = f(1, 2) + g(3, 4).
在操作符两边, 逗号后面使用空格, 但是括号内部与括号之间直接相连的部分不要空格: a = f(1, 2) + g(3, 4).
Name your classes and functions consistently; the convention is to use CamelCase for classes and lower_case_with_underscoresfor functions and methods. Always use self as the name for the first method argument (see 初识类 for more on classes and methods).
保持类名和函数名的一致性; 约定是, 类名使用 CamelCase 格式, 方法名和函数名使用 lower_case_with_underscres 形式. 一直使用 self 作为方法的第一个参数名 (参阅 初识类 获得更多有关类和方法的信息).
Don’t use fancy encodings if your code is meant to be used in international environments. Python’s default, UTF-8, or even plain ASCII work best in any case.
当你的代码打算用于国际化的环境, 那么不要使用奇特的编码. Python 默认的 UTF-8, 或者甚至是简单的 ASCII 在任何情况下工作得最好.
Likewise, don’t use non-ASCII characters in identifiers if there is only the slightest chance people speaking a different language will read or maintain the code.
同样地, 如果代码的读者或维护者只是很小的概率使用不同的语言, 那么不要在标识符里使用 非ASCII 字符.