图像抓取是一种常见的网络爬虫技术,用于从网页上下载图片并保存到本地文件夹中。然而,当需要抓取的图片数量很大时,可能会出现内存不足的错误,导致程序崩溃。本文介绍了如何使用Python进行大规模的图像抓取,并提供了一些优化内存使用的方法和技巧,以及如何计算和评估图片的质量指标。
为了实现图像抓取的功能,我们需要导入一些必要的库和模块,如pickle、logging、datetime等。其中,pickle模块用于序列化和反序列化对象,方便我们将处理结果保存到文件中;logging模块用于记录程序的运行日志,方便我们调试和监控程序的状态;datetime模块用于获取和处理日期和时间相关的信息,方便我们设置请求头部和日志格式等。
import pickle
import logging
from datetime import datetime
from dateutil.parser import parse as parse_date
from brisque import BRISQUE
import os
import cv2
import numpy as np
from PIL import Image
from io import BytesIO
import os
import requests
from skimage import color
from time import sleep
from random import choice
import concurrent.futures
from requests.exceptions import Timeout
from robots import RobotParser
from headers import HEADERS
MAX_RETRIES = 3 # Number of times the crawler should retry a URL
INITIAL_BACKOFF = 2 # Initial backoff delay in seconds
DEFAULT_SLEEP = 10 # Default sleep time in seconds after a 429 error
brisque = BRISQUE(url=False)
为了方便记录程序的运行日志,我们需要设置一个日志记录器,用于将日志信息输出到文件中。我们可以使用logging模块提供的方法来创建一个名为“image-scraper”的日志记录器,并设置其日志级别为INFO。然后,我们可以创建一个文件处理器,用于将日志信息写入到指定的文件中,并设置其日志格式为包含日志级别、线程名、时间和消息内容等信息。最后,我们可以将文件处理器添加到日志记录器中,使其生效。
# --- SETUP LOGGER ---
filename = 'image-scraper.log'
filepath = os.path.dirname(os.path.abspath(__file__))
# create file path for log file
log_file = os.path.join(filepath, filename)
# create a FileHandler to log messages to the log file
handler = logging.FileHandler(log_file)
# set the log message formats
handler.setFormatter(
logging.Formatter(
'%(levelname)s %(threadName)s (%(asctime)s): %(message)s')
)
# create a logger with the given name and log level
logger = logging.getLogger('image-scraper')
# prevent logging from being send to the upper logger - that includes the console logging
logger.propagate = False
logger.setLevel(logging.INFO)
# add the FileHandler to the logger
logger.addHandler(handler)
为了评估图片的质量,我们需要计算一些图片质量指标,如亮度、清晰度、对比度、色彩度等。我们可以定义一个函数get_image_quality_metrics,接受一个包含图片数据的响应对象作为参数,并返回一个包含各种质量指标的字典。在这个函数中,我们首先使用PIL库和numpy库将图片数据转换为数组形式,并使用cv2库和skimage库对图片进行处理和计算。具体来说:
def get_image_quality_metrics(response):
"""
Calculate various image quality metrics for an image.
Args:
response (requests.Response): The response object containing the image data.
Returns:
dict: A dict of image quality metrics including brightness, sharpness, contrast, and colorfulness.
"""
image_array = np.frombuffer(response.content, np.uint8)
image = cv2.imdecode(image_array, cv2.IMREAD_COLOR)
metrics = dict()
# Calculate brightness
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
metrics['brightness'] = np.mean(gray)
# Calculate sharpness using variance of Laplacian
metrics['sharpness'] = cv2.Laplacian(gray, cv2.CV_64F).var()
# Calculate contrast using root mean squared contrast
metrics['contrast'] = np.sqrt(np.mean((gray - np.mean(gray)) ** 2))
# Calculate image noise using variance of Gaussian or median absolute deviation (MAD)
metrics['noise'] = np.var(image)
# Calculate saturation using average saturation of pixels or histogram analysis
hsv = color.rgb2hsv(image)
saturation = hsv[:, :, 1]
metrics['saturation'] = np.mean(saturation)
# Calculate colorfulness
lab = color.rgb2lab(image)
a, b = lab[:, :, 1], lab[:, :, 2]
metrics['colorfulness'] = np.sqrt(np.mean(a ** 2 + b ** 2))
# Get dimenstions of the image
height, width, _ = image.shape
metrics['height'] = height
metrics['width'] = width
return metrics
为了从网页上下载图片,我们需要发送GET请求到图片的URL,并获取响应对象。我们可以定义一个函数send_request,接受一个URL作为参数,并返回一个响应对象。在这个函数中,我们需要处理一些可能出现的异常和错误,如超时、状态码不为200、429等。为了避免被网站屏蔽或限制,我们需要使用代理服务器和随机选择的请求头部。具体来说:
def send_request(url: str) -> requests.Response:
"""
Sends a GET request to the specified URL, checks whether the link is valid,
and returns a response object.
Args:
url (str): The URL to send the GET request to
"""
retry_count = 0
backoff = INITIAL_BACKOFF
header = choice(HEADERS)
# 亿牛云 爬虫代理加强版
proxyHost = "www.16yun.cn"
proxyPort = "31111"
# 代理验证信息
proxyUser = "16YUN"
proxyPass = "16IP"
# create a proxy server object using the proxy information
proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
"host": proxyHost,
"port": proxyPort,
"user": proxyUser,
"pass": proxyPass,
}
proxies = {
"http": proxyMeta,
"https": proxyMeta,
}
while retry_count < MAX_RETRIES:
try:
# Send a GET request to the website and return the response object
req = requests.get(url, headers=header, proxies=proxies, timeout=20)
req.raise_for_status()
logger.info(f"Successfully fetched {url}")
return req
except Timeout:
# Handle timeout error: log the error and increase the retry count and backoff delay
logger.error(f"Timeout error for {url}")
retry_count += 1
backoff *= 2
except requests.exceptions.HTTPError as e:
# Handle HTTP error: log the error and check the status code
logger.error(f"HTTP error for {url}: {e}")
status_code = e.response.status_code
if status_code == 429:
# Handle 429 error: wait for some time and retry
logger.info(f"Waiting for {DEFAULT_SLEEP} seconds after 429 error")
sleep(DEFAULT_SLEEP)
retry_count += 1
elif status_code == 403 or status_code == 404:
# Handle 403 or 404 error: break the loop and return None
logger.info(f"Skipping {url} due to {status_code} error")
break
else:
# Handle other errors: raise the exception and log the error
logger.error(f"Other HTTP error for {url}: {e}")
raise e
# Return None if the loop ends without returning a response object
return None
为了从响应对象中提取图片的数据,并计算其质量指标和BRISQUE分数,我们可以定义一个函数process_image,接受一个响应对象和一个URL作为参数,并返回一个包含图片信息的字典。在这个函数中,我们需要使用“with”语句来管理文件和图片对象的打开和关闭,以及使用“del”语句来释放不再需要的变量,从而优化内存使用。具体来说:
def process_image(response, url):
"""
Process an image from a response object and calculate its quality metrics and BRISQUE score.
Args:
response (requests.Response): The response object containing the image data.
url (str): The URL of the image.
Returns:
dict: A dict of image information including quality metrics and BRISQUE score.
"""
# Open the image data from the response object and convert it to RGBA format
image = Image.open(BytesIO(response.content)).convert('RGBA')
# Create a folder named "images" to store the downloaded images
os.makedirs('images', exist_ok=True)
# Get the current date and time and convert it to a string format as the image file name
date_time = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
# Open a file with the date and time as the file name and write the image data to it
with open(f'images/{date_time}.png', 'wb') as f:
image.save(f, 'PNG')
# Calculate the BRISQUE score of the image and add it to the dict
image_info = dict()
image_info['brisque'] = get_brisque_score(response)
# Calculate the other quality metrics of the image and add them to the dict
image_info.update(get_image_quality_metrics(response))
# Delete the response object and the image object to free up memory
del response
del image
# Return the dict of image information
return image_info
为了提高程序的效率和并发性,我们可以使用线程池来处理多个网站的图片抓取任务,并将处理结果保存到文件中。我们可以使用concurrent.futures模块提供的方法来创建一个线程池对象,并使用submit方法来提交每个网站的图片抓取任务。具体来说:
# Create a list of websites to scrape images from
websites = [
'https://unsplash.com/',
'https://pixabay.com/',
'https://www.pexels.com/',
'https://www.freeimages.com/',
'https://stocksnap.io/',
]
# Create a list to store the results of each website
results = []
# Create a thread pool with 10 threads and submit tasks for each website
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
for website in websites:
# Submit a task to send a request to the website and get a response object
future = executor.submit(send_request, website)
# Add the future object to the results list
results.append(future)
# Iterate over the results list and get the result of each future object
for future in results:
# Get the response object from the future object
response = future.result()
# Check if the response object is None or not
if response is not None:
# Process the response object and get the image information dict
image_info = process_image(response, website)
# Add the image information dict to the results list
results.append(image_info)
else:
# Skip the website if the response object is None
continue
# Serialize and save the results list to a file using pickle module
with open('results.pkl', 'wb') as f:
pickle.dump(results, f)
本文介绍了如何使用Python进行大规模的图像抓取,并提供了一些优化内存使用的方法和技巧,以及如何计算和评估图片的质量指标。我们使用requests库来发送GET请求到图片的URL,并使用代理服务器和随机选择的请求头部来避免被网站屏蔽或限制。我们使用PIL库和cv2库来处理图片数据,并使用brisque模块和自定义的函数来计算图片的质量指标和BRISQUE分数。我们使用logging模块来记录程序的运行日志,并使用pickle模块来将处理结果保存到文件中。我们使用“with”语句来管理文件和图片对象的打开和关闭,以及使用“del”语句来释放不再需要的变量,从而优化内存使用。我们使用concurrent.futures模块来创建一个线程池,并使用submit方法来提交每个网站的图片抓取任务,从而提高程序的效率和并发性。通过这些方法和技巧,我们可以实现一个高效、稳定、可扩展的大规模图像抓取程序。