Python 是一种跨平台的计算机程序设计语言,面向对象动态类型语言,Python是纯粹的自由软件,源代码和解释器cpython遵循 GPL(GNU General Public License)协议,随着版本的不断更新和语言新功能的添加,Python 越来越多被用于独立的、大型项目的开发。
快速抓取网页: 使用urllib最基本的抓取功能,将百度首页的内容保存到本地目录下.
import urllib.request
res=urllib.request.urlopen(“https://www.baidu.com”)
print(res.read().decode(“utf-8”))
f=open("./test.html",“wb”) #保存在本地
f.write(res.read())
f.close()
实现POST请求: 上述的例子是通过请求百度的get请求获得百度,下面使用urllib的post请求.
import urllib.parse
import urllib.requestdata=bytes(urllib.parse.urlencode({“hello”:“lyshark”}),encoding=“utf-8”)
print(data)
response = urllib.request.urlopen(‘http://www.baidu.com/post’,data=data)
print(response.read())
设置TIMEOUT时间: 我们需要给请求设置一个超时时间,而不是让程序一直在等待结果.
import urllib.request
response = urllib.request.urlopen(‘http://www.baidu.com’,timeout=1)
print(response.read())
获取网站状态: 我们可以通过status、getheaders(),getheader(“server”),获取状态码以及头部信息.
import urllib.request
res=urllib.request.urlopen(“https://www.python.org”)
print(type(res))
res.status
res.getheaders()
res.getheader(“server”)
伪装访问网站: 给请求添加头部信息,从而定制自己请求网站是时的头部信息,防止被和谐.
from urllib import request,parse
url = ‘http://www.baidu.com’
headers = {
‘User-Agent’: ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)’,‘Host’: ‘mkdirs.org’
}
dict = {
‘name’: ‘LyShark’
}
data = bytes(parse.urlencode(dict),encoding=‘utf8’)
req = request.Request(url=url,data=data,headers=headers,method=‘POST’)
response = request.urlopen(req)
print(response.read().decode(‘utf-8’))
简单的URL页面拼接:
import re
def Get_Url(target,start,ends):
urls=[]
for i in range(start,ends):
url = target+"/"+str(i)
urls.append(url)
return urls
if name == “main”:
url = Get_Url(“https://jq.qq.com/?_wv=1027&k=sqgP9S9Y”,1,10)
print(url)
request库的使用:
import re
import requests
head={‘user-agent’:‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/71.0.3578.98 Safari/537.36’}
if name == “main”:
ret = requests.get(url=“https://jq.qq.com/?_wv=1027&k=sqgP9S9Y”,headers=head,timeout=1)
all_pic_link = re.findall(’ print(all_pic_link)
简单实现爬取图片:
import re
import urllib.request
def open_url(url):
ret = urllib.request.Request(url)
ret.add_header(‘user-agent’,‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36’)
page = urllib.request.urlopen(ret)
html =page.read().decode(“utf-8”)
return html
def get_img(html):
ret = re.findall(’ for each in ret:
filename = each.split("/")[-1]
print(“完整路径:”,each)
print(“文件名称:”,filename)
urllib.request.urlretrieve(each,filename,None)
if name == ‘main’:
url = open_url(“https://jq.qq.com/?_wv=1027&k=sqgP9S9Y”)
get_img(url)
爬每日CVE漏洞列表:
import re
import requests
from bs4 import BeautifulSoup
head={‘user-agent’:‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/71.0.3578.98 Safari/537.36’}
def Get_CVE(url):
new_cve = []
ret = requests.get(url=url,timeout=3)
bs = BeautifulSoup(ret.text,‘html.parser’)
for i in bs.find_all(‘a’):
href = i.get(‘href’)
new_cve.append(href)
return(new_cve)
def Get_Number(list):
new = []
for i in list:
temp = re.findall("[0-9]{1,}-.*",str(i))
new.append(“CVE-{}”.format(temp))
return new
if name == “main”:
url= “https://cassandra.cerias.purdue.edu/CVE_changes/today.html”
cve = Get