Spider(二):cookie和代理、正则表达式、xpath解析、xpath表达式

一、requests基于cookie操作

cookie概念:当用户通过浏览器首次访问一个域名时,访问的web服务器会给客户端发送数据,以保持web服务器与客户端之间的状态保持,这些数据就是cookie.

cookie&代理案例:

#实现人人网的登录操作
import requests
#获取session对象,通过session发起的请求,该请求中会自动携带cookie
session=requests.session()
#指定url
url = 'http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=201883913543'

data = {
        'email': '17701256561',
        'icode': '',
        'origURL': 'http://www.renren.com/home',
        'domain': 'renren.com',
        'key_id': '1',
        'captcha_type': 'web_login',
        'password': '7b456e6c3eb6615b2e122a2942ef3845da1f91e3de075179079a3b84952508e4',
        'rkey': '44fd96c219c593f3c9612360c80310a3',
        'f': 'https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3Dm7m_NSUp5Ri_ZrK5eNIpn_dMs48UAcvT-N_kmysWgYW%26wd%3D%26eqid%3Dba95daf5000065ce000000035b120219',
    
}

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}
#第一次通过session发起了请求,该请求中一定会携带cookie
response = session.post(url=url,headers=headers,data=data)

#再次发起请求,访问二级子页面
url_ = 'http://www.renren.com/289676607/profile'
response_ = session.get(url=url_,headers=headers)

with open('./second.html','w',encoding='utf-8') as fp:
    fp.write(response_.text)

#代码设置代理
import requests
import random
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}
url = 'http://www.baidu.com/s'
param = {
    'ie':'utf-8',
    'wd':'ip'
}


proxy1 = {
    "http": "112.115.57.20:3128"
}
proxy2 = {
    'http': '121.41.171.223:3128'
}
proxy3 = {
    'http': '121.41.171.223:3128'
}
proxys = [proxy1,proxy2,proxy3]

proxy = random.choice(proxys)
response = requests.get(url=url,headers=headers,params=param,proxies=proxy)

response.text

#还原代理IP成自己本机IP
requests.get(url,proxies={'http':''})

xpath解析案例:
爬取网站:

百里守约

李清照

王安石

苏轼

柳宗元

this is span 宋朝是最强大的王朝,不是军队的强大,而是经济很强大,国民都很有钱 总为浮云能蔽日,长安不见使人愁 

  • 清明时节雨纷纷,路上行人欲断魂,借问酒家何处有,牧童遥指杏花村
  • 秦时明月汉时关,万里长征人未还,但使龙城飞将在,不教胡马度阴山
  • 岐王宅里寻常见,崔九堂前几度闻,正是江南好风景,落花时节又逢君
  • 杜甫
  • 杜牧
  • 杜小月
  • 度蜜月
  • 凤凰台上凤凰游,凤去台空江自流,吴宫花草埋幽径,晋代衣冠成古丘

原码:




	
	测试bs4


	

百里守约

from lxml import etree
etree = etree.parse('./soup_text.html')
etree =etree.HTML(response.text)
etree.xpath('//div[3]')
etree.xpath('//div[@class="tang"]/ul/li/a/text()')

#//div[@id="main"]/div/h3/a/text()  标题
#//div[@id="main"]/div[1]/div//text() 内容
from lxml import etree
import requests
url = 'http://www.haoduanzi.com/category-10.html'
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',
    }

page_text = requests.get(url=url,headers=headers).text

#讲网页数据转换成etree对象
tree = etree.HTML(page_text)

#解析:下面列表中存储的是包含段子内容和标题的子div
div_list = tree.xpath('//div[@id="main"]/div[@class="log cate10 auth1"]')
all_list = []
for div in div_list:
    d_l = div.xpath('./div//text()')
    content = ''.join(d_l) #文本内容
    title = div.xpath('./h3/a/text()')[0]
    
    all_content = title + ":" + content + '\n\n\n'
    all_list.append(all_content)
    

 

你可能感兴趣的:(spider,Python爬虫)