注:个人笔记,并不详细,仅供参考。
爬取图片,问题分解:
获取网页内容。
requests 模块:主要是用来模拟浏览器行为,发送HTTP 请求,并处理HTTP 响应的功能。
import requests # 被认为,最贴近与人的操作的模块
import urllib
import urllib2
import urllib3
requests 模块处理网页内容的基本逻辑:
请求方法 | 说明 |
---|---|
requests.get() | GET 方法 |
requests.post() | |
requests.head() | 只返回响应头部,没有响应正文。 |
requests.options() | |
requests.put() | |
requests.delete() |
参数名字 | 参数含义 |
---|---|
url | 请求URL 地址 |
headers | 自定义请求头部 |
params | 发送GET 参数 |
data | 发送POST 参数 |
timeout | 请求延时 |
files | 文件上传数据流 |
方法名 | 解释 |
---|---|
response.text | 响应正文(文本方式) |
response.content | 响应正文(二进制) |
response.status_code | 响应状态码 |
response.url | 发送请求的URL 地址 |
response.headers | 响应头部 |
response.request.headers | 请求头部 |
response.cookies | cookie 相关信息 |
从网页内容中提取图片地址。
在目标字符串中搜索符合条件的字串
正则表达式(RE),是一些由字符和特殊符号组成的字符串,它们能按某种模式匹配一系列有相似特征的字符串。
>>> import re
>>> s = "I say food not Good"
>>> re.findall('ood',s)
['ood', 'ood']
>>> re.findall(r"[fG]ood", s)
['food', 'Good']
>>> re.findall(r"[a-z]ood", s)
['food']
>>> re.findall(r"[A-Z]ood", s)
['Good']
>>> re.findall(r"[0-9a-zA-Z]ood", s)
['food', 'Good']
>>> re.findall(r"[^a-z]ood",s)
['Good']
>>> re.findall('.ood',s)
['food', 'Good']
>>> re.findall(r'food|Good|not',s)
['food', 'not', 'Good']
>>> re.findall(r".o{1,2}.", s)
['food', 'not', 'Good']
>>> re.findall('o*',s)
['', '', '', '', '', '', '', 'oo', '', '', '', 'o', '', '', '', 'oo', '', '']
>>>
>>> s = "How old are you? I'm 24!"
>>> re.findall(r"[0-9][0-9]", s)
>>> s = "How old are you? I'm 24!"
>>> re.findall(r"[0-9]{1,2}", s)
['24']
>>> re.findall(r"\d{1,2}", s)
['24']
>>> re.findall(r"\w", s)
['H', 'o', 'w', 'o', 'l', 'd', 'a', 'r', 'e', 'y', 'o', 'u', 'I', 'm', '2', '4']
>>>
>>> s = 'I like google not ggle goooogle and gogle'
>>> re.findall('o+',s)
['oo', 'o', 'oooo', 'o']
>>> re.findall('go+',s)
['goo', 'goooo', 'go']
>>> re.findall('go+gle',s)
['google', 'goooogle', 'gogle']
>>> re.findall('go?gle',s)
['ggle', 'gogle']
>>> re.findall('go{1,2}gle',s)
['google', 'gogle']
>>>
记号 | 说明 |
---|---|
. | 匹配任意单个字符(换行符除外). 表示真正的. |
[…x-y…] | 匹配字符集合里的任意单个字符 |
[^…x-y…] | 匹配不在字符组里的任意单个字符 |
\d | 匹配任意数字,与[0-9] 同义 |
\w | 匹配任意数字、字母、下划线,与[0-9a-zA-Z_] 同义 |
\s | 匹配空白字符,与[\r\v\f\t\n] 同义 |
记号 | 说明 |
---|---|
字符串 | 匹配字符串值 |
字符串1|字符串2 | 匹配字符串1或字符串2 |
* | 左邻第一个字符出现0 次或无穷次 |
+ | 左邻第一个字符最少出现1 次或无穷次 |
? | 左邻第一个字符出现0 次或1 次 |
{m,n} | 左邻第一个字符出现最少m 次最多n 次 |
记号 | 说明 |
---|---|
^ | 匹配字符串的开始 集合取反 |
$ | 匹配字符串的结尾 |
\b | 匹配单词的边界,单词包括\w 中的内容 |
() | 对字符串分组 |
\数字 | 匹配已保存的子组 |
核心函数 | 说明 |
---|---|
re.findall() | 在字符串中查找正则表达式的所有(非覆盖)出现;返回一个匹配对象的列表。 |
re.match() | 尝试用正则表达式模式从字符串的开头匹配 如果匹配成功,则返回一个匹配对象 否则返回None |
re.search() | 在字符串中查找正则表达式模式的第一次出现 如果匹配成,则返回一个匹配对象 否则返回None |
re.group() | 使用match 或者search 匹配成功后,返回的匹配对象 可以通过group() 方法获取得匹配内容 |
re.finditer() | 和findall() 函数有相同的功能,但返回的不是列表而是迭代器 对于每个匹配,该迭代器返回一个匹配对象 |
re.split() | 根据正则表达式中的分隔符把字符分割为一个列表,并返回成功匹配的列表字符串也有类似的方法,但是正则表达式更加灵活 |
re.sub() | 把字符串中所有匹配正则表达式的地方换成新的字符串 |
>>> m = re.match('goo','I like google not ggle goooogle and gogle')
>>> type(m)
<class 'NoneType'>
>>> m = re.match('I','I like google not ggle goooogle and gogle')
>>> type(m)
<class 're.Match'>
>>> m.group()
'I'
>>> m = re.search('go{3,}','I like google not ggle goooogle and gogle')
>>> m.group()
'goooo'
>>> m = re.finditer('go*','I like google not ggle goooogle and gogle')
>>> list(m)
[<re.Match object; span=(7, 10), match='goo'>, <re.Match object; span=(10, 11), match='g'>, <re.Match object; span=(18, 19), match='g'>, <re.Match object; span=(19, 20), match='g'>, <re.Match object; span=(23, 28), match='goooo'>, <re.Match object; span=(28, 29), match='g'>, <re.Match object; span=(36, 38), match='go'>, <re.Match object; span=(38, 39), match='g'>]
>>> m = re.split('\.|-','hello-world.zs')
>>> m
['hello', 'world', 'zs']
>>> s = "hi x.Nice to meet you, x."
>>> s = re.sub('x','ZS',s)
>>> s
'hi ZS.Nice to meet you, ZS.'
>>>
通过python 脚本爬取网页图片:
将方法封装成函数。
import requests
url = "http://10.9.47.71/pyspider/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.5195.102 Safari/537.36"
}
# print(res.content.decode())
def getHtml(url):
res = requests.get(url = url, headers = headers)
return res.content
print(getHtml(url = url))
'''
style/u1257164168471355846fm170s9A36CD0036AA1F0D5E9CC09C0100E0E3w6.jpg
style/u18825255304088225336fm170sC213CF281D23248E7ED6550F0100A0E1w.jpg
style/\w*\.jpg
'''
import requests
import re
url = "http://10.9.47.71/pyspider/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.5195.102 Safari/537.36"
}
def getHtml(url):
res = requests.get(url = url, headers = headers)
return res.content
def getImgPathList(html):
imgPathList = re.findall(r"style/\w*\.jpg", html)
return imgPathList
for imgPath in getImgPathList(getHtml(url = url).decode()):
print(imgPath)
import requests
url = "http://10.9.47.71/pyspider/"
img_path = "style/u401307265719758014fm173s0068CFB1485C3ECA44B8C5E5030090F3w21.jpg"
img_url = url + img_path
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.5195.102 Safari/537.36"
}
def get_html(url):
res = requests.get(url = url, headers = headers)
return res.content
def save_img(img_save_path, img_url):
with open(img_save_path, "wb") as f:
f.write(get_html(url = img_url))
save_img("./images/1.jpg", img_url)
import requests
import re
import time
url = "http://10.9.47.71/pyspider/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.5195.102 Safari/537.36"
}
def get_html(url):
res = requests.get(url = url, headers = headers)
return res.content
def get_img_path_list(html):
img_path_list = re.findall(r"style/\w*\.jpg", html)
return img_path_list
def img_download(img_save_path, img_url):
with open(img_save_path, "wb") as f:
f.write(get_html(url = img_url))
html = get_html(url = url).decode()
img_path_list = get_img_path_list(html = html)
for img_path in img_path_list:
img_url = url + img_path
img_save_path = f"./images/{time.time()}.jpg"
img_download(img_save_path = img_save_path, img_url = img_url)
import requests
url = "http://10.9.47.71/php/array/get.php"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0"
}
res = requests.get(url = url, headers = headers)
# print(res.text)
# print(res.status_code) 查看状态码
# print(res.headers)
# print(res.url)
print(res.request.headers)
import requests
url = "http://10.9.47.71/php/array/get.php"
# url = "http://10.9.47.71/php/array/get.php?username=ZS&password=123456"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0"
}
params = {
"username": "ZS",
"password": "123456"
}
res = requests.get(url = url, headers = headers, params = params)
print(res.text)
import requests
url = "http://10.9.47.71/php/array/post.php"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0"
}
data = {
"username": "ZS",
"password": "123456"
}
res = requests.post(url = url, headers = headers, data = data)
print(res.text)
import requests
url = "http://10.9.47.71/dvwa_2.0.1/vulnerabilities/upload/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.5195.102 Safari/537.36",
"Cookie": "security=low; PHPSESSID=378olurk9upvuo9sspecnl46c2"
}
data = {
"MAX_FILE_SIZE": "100000",
"Upload": "Upload"
}
files = {
"uploaded": ("3.php",
b"",
"image/png")
}
res = requests.post(url = url, headers = headers, data = data, files = files, allow_redirects = False)
print(res.status_code)
print(res.headers)
import requests
url = "http://10.9.47.71/php/functions/sleep.php"
def timeout(url):
try:
res = requests.get(url = url, timeout = 3)
return res.text
except requests.exceptions.ReadTimeout:
return "timeout"
print(timeout(url))