**
**
1、requests
htmlText = requests.get(url,headers=headers,proxies=proxies).content
#content与text差不多,优先使用content,当content解码不成功时用text。
Response=requests.get()
#请求头:response.request.headers
#响应头:response.request.headers
#状态码:response.status_code
#cookie:response…cookie
requests自己处理关于get传参中url地址编码拼接
url=’www.baidu.com//’
params={
‘wd’:’’”
}
Response=requests.get(url,headers,params) vertify=False:web忽略ca证书(如12306)
Data=response.content.decode()
#post和json格式都能读取
Requests.post(url,data=(参数),json=(参数))
2、urllib(环境python3.7)
相关资料:http://www.cnblogs.com/zhaof/p/6910871.html
from urllib import request
url=’’
response=request.urlopen(url).read().decode(‘utf=8’) #gbk gb23010等
urllib 关于地址的转译
Import urllib.parse
Urlib.parse.quote(str,safe=string.printable)
Urllib.parse.urlencode(header)
Urllib处理ip
Proxy={
免费ip
http:’’’’
}
Proxy_use={
#代理ip
‘http’:’username:[email protected]:8000’
}
Proxy_handler=urllib.request.ProxyHander(proxy)
Opener=urllib.request.build_opener(proxy_handler)
Data=opener.open(url).read()
#付费代理:
Ues_name=””
Pwd=””
Proxy_moner=”111.111.11.1:8000”
Password_manager=Urllib.request.httpPasswordMgrWithDefaultRealm()
Password_manager.add_password(none,proxy_money,use_name,pwd)
Handler_auth_proxy=urllib.request.ProxyBasicAuthhandler(password_manage)
Urllib.request.build_opener(handler_auth_proxy_)
Data=opener.open(url).read()
Cookie处理
Login_form_data={
“username”=’’
‘pwd’=’’
‘formhash’:
‘backurl’:
}
Cook_jar=cookiejar.CookieJar()
Cook_handler=urllib.request.HTTPCookiesProcessor(cook_jar)
Opener=urllib.request.build_opener(cookie_hander)
3、scrapy
安装:先安装VC++14.0 Twisted-18.7.0-cp37-cp37m-win_amd64(自己的环境,下面网址里找Twisted相应版本就行)
https://www.lfd.uci.edu/~gohlke/pythonlibs/
相关资料:
http://www.cnblogs.com/kongzhagen/p/6549053.html
pip install -U scrapy
打开cmd srapy startproject + 工程名
scrapy genspider +文件名 + 主网址
程序写完后 cmd scrapy crawl +工程名
**
**
1.正则表达式
import re
re.compile() #re.S,re.I
re.findall()/re.search()/re.match()
re.group(0) #此处0是全部元素,1,2,3才是列表位置对应re.compile()里面()里的元素。
2.xml
import lxml
from lxml import etree
selector = etree.HTML(htmlText)
tds = selector.xpath(’//*[@class=“tab-switch tab-progress”]/table/tr’) #根
href = td.xpath(’./td/p/a/@href’)
title = td.xpath(’./td/p/a/text()’)
3.beautifulsoup
用法:https://blog.csdn.net/maverick17/article/details/79610050
安装时 pip install bs4
soup = BeautifulSoup(html, ‘html.parser’)
soup = BeautifulSoup(html, ‘xml’)
title = soup.select(‘h2’)[0].get_text().replace(’,’, ‘,’)
tags_message = soup.select(’.u-tag i’)
4.pyquery
相关文档:https://www.cnblogs.com/lei0213/p/7676254.html
**
**
谷歌浏览器,在网址上输入about://version,按回车。
headers={‘User-Agent’:}
**
**
网址:http://www.xicidaili.com/nn/
网上爬取就行
源码连接:
https://blog.csdn.net/fantacy10000/article/details/76724145
**
**
文档:https://www.cnblogs.com/zhaof/p/6953241.html
下载驱动
import selenium
自动化
browser = webdriver.Chrome()
browser.get(“http://www.taobao.com”)
lis = browser.find_elements_by_css_selector(’.service-bd li’)
input_str = browser.find_element_by_id(‘q’)
input_str.send_keys(“ipad”) #输入关键字
#执行js命令
browser.get(“http://www.zhihu.com/explore”)
browser.execute_script(‘window.scrollTo(0, document.body.scrollHeight)’)
browser.execute_script(‘alert(“To Bottom”)’) #下拉页面
click() ##点击事件
Login_form_data={
‘username’:’’
‘pwd’:’’
‘formhash’:”’’
‘backurl’:’’”
}
Session=requests.session()
Login_response=session.post(login_url,login_url_data,header)
Data=session.get(real_url)
##数据存储
json
csv
mongodb
redis
mysql