pyhton mechanize 学习笔记

1:简单的使用

import mechanize
# response = mechanize.urlopen("http://www.hao123.com/")
request = mechanize.Request("http://www.hao123.com/")
response = mechanize.urlopen(request)
print response.geturl()
print response.info()
# print response.read()

 

2:mechanize.urlretrieve

>>> import mechanize
>>> help(mechanize.urlretrieve)
Help on function urlretrieve in module mechanize._opener:

urlretrieve(url, filename=None, reporthook=None, data=None, timeout=<object object>)

 

  • 参数 finename 指定了保存本地路径(如果参数未指定,urllib会生成一个临时文件保存数据。)
  • 参数 reporthook 是一个回调函数,当连接上服务器、以及相应的数据块传输完毕时会触发该回调,我们可以利用这个回调函数来显示当前的下载进度。
  • 参数 data 指 post 到服务器的数据,该方法返回一个包含两个元素的(filename, headers)元组,filename 表示保存到本地的路径,header 表示服务器的响应头
  • 参数 timeout 是设定的超时对象

reporthook(block_read,block_size,total_size)定义回调函数,block_size是每次读取的数据块的大小,block_read是每次读取的数据块个数,taotal_size是一一共读取的数据量,单位是byte。可以使用reporthook函数来显示读取进度。

简单的例子

def cbk(a, b, c):print a,b,c
  
url = 'http://www.hao123.com/'
local = 'd://hao.html'
mechanize.urlretrieve(url,local,cbk)

 3:form表单登陆

br = mechanize.Browser()
br.set_handle_robots(False)
br.open("http://www.zhaopin.com/")
br.select_form(nr=0)
br['loginname'] = '**'自己注册一个账号密码就行了
br['password'] = '**'
r = br.submit()
print os.path.dirname(__file__)+'\login.html'
h = file(os.path.dirname(__file__)+'\login.html',"w")
rt = r.read()
h.write(rt)
h.close()

 

4:Browser

看完help的文档基本可以成神了

Help on class Browser in module mechanize._mechanize:

class Browser(mechanize._useragent.UserAgentBase)
 |  Browser-like class with support for history, forms and links.
 |  
 |  BrowserStateError is raised whenever the browser is in the wrong state to
 |  complete the requested operation - e.g., when .back() is called when the
 |  browser history is empty, or when .follow_link() is called when the current
 |  response does not contain HTML data.
 |  
 |  Public attributes:
 |  
 |  request: current request (mechanize.Request)
 |  form: currently selected form (see .select_form())
 |  
 |  Method resolution order:
 |      Browser
 |      mechanize._useragent.UserAgentBase
 |      mechanize._opener.OpenerDirector
 |      mechanize._urllib2_fork.OpenerDirector
 |  
 |  Methods defined here:
 |  
 |  __getattr__(self, name)
 |  
 |  __init__(self, factory=None, history=None, request_class=None)
 |      Only named arguments should be passed to this constructor.
 |      
 |      factory: object implementing the mechanize.Factory interface.
 |      history: object implementing the mechanize.History interface.  Note
 |       this interface is still experimental and may change in future.
 |      request_class: Request class to use.  Defaults to mechanize.Request
 |      
 |      The Factory and History objects passed in are 'owned' by the Browser,
 |      so they should not be shared across Browsers.  In particular,
 |      factory.set_response() should not be called except by the owning
 |      Browser itself.
 |      
 |      Note that the supplied factory's request_class is overridden by this
 |      constructor, to ensure only one Request class is used.
 |  
 |  __str__(self)
 |  
 |  back(self, n=1)
 |      Go back n steps in history, and return response object.
 |      
 |      n: go back this number of steps (default 1 step)
 |  
 |  clear_history(self)
 |  
 |  click(self, *args, **kwds)
 |      See mechanize.HTMLForm.click for documentation.
 |  
 |  click_link(self, link=None, **kwds)
 |      Find a link and return a Request object for it.
 |      
 |      Arguments are as for .find_link(), except that a link may be supplied
 |      as the first argument.
 |  
 |  close(self)
 |  
 |  encoding(self)
 |  
 |  find_link(self, **kwds)
 |      Find a link in current page.
 |      
 |      Links are returned as mechanize.Link objects.
 |      
 |      # Return third link that .search()-matches the regexp "python"
 |      # (by ".search()-matches", I mean that the regular expression method
 |      # .search() is used, rather than .match()).
 |      find_link(text_regex=re.compile("python"), nr=2)
 |      
 |      # Return first http link in the current page that points to somewhere
 |      # on python.org whose link text (after tags have been removed) is
 |      # exactly "monty python".
 |      find_link(text="monty python",
 |                url_regex=re.compile("http.*python.org"))
 |      
 |      # Return first link with exactly three HTML attributes.
 |      find_link(predicate=lambda link: len(link.attrs) == 3)
 |      
 |      Links include anchors (), image maps (), and frames (,
 |