Kosmoo

python + selenium多进程爬取淘宝搜索页数据

1. 功能描述

按照给定的关键词，在淘宝搜索对应的产品，然后爬取搜索结果中产品的信息，包括：标题，价格，销量，产地等信息，存入mongodb中，需要采用多进程提高爬取效率。

2. 环境

系统：win7
MongoDB 3.4.6
python 3.6.1
IDE：pycharm
安装过chrome浏览器（63.0.3239.132 (正式版本) 32 位）
selenium 3.7.0
配置好chromedriver v2.34

3. 代码


from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys

import pymongo
import time
import datetime

import re
import multiprocessing

import lxml.html
import lxml.etree

# ---------- 1. 一些配置信息  ------------
# 搜索关键字列表
keySearchWords = {
    "动漫": [1, "动漫周边"],
    "水果": [1, "水果沙拉"],
}

# 数据库初始化
client = pymongo.MongoClient("127.0.0.1:27017")
db = client["taobao"]
db_coll = db["productInfo"]

# 一个页面的最大重试次数
retryMax = 8

chrome_options = webdriver.ChromeOptions()
# 禁止图片和视频的加载，提高网页爬取速度
# prefs = {"profile.managed_default_content_settings.images": 2}
# chrome_options.add_experimental_option("prefs", prefs)

# 启用headless模式：无浏览器界面，提高速度与稳定性
# chrome_options.add_argument('--headless')
# chrome_options.add_argument('--disable-gpu')


# ---------- 2. 解析页面信息  ------------
# 获得每个产品在list页面的主要信息
def getProductMainInfo(htmlSource):
    try:
        resultTree = lxml.etree.HTML(htmlSource)
        # fix_html = lxml.html.tostring(resultTree, pretty_print=True)
        # print(f"htmlSource = {htmlSource}")

        productLst = resultTree.xpath("//div[@class='m-itemlist']//div[contains(@class, 'J_MouserOnverReq')]")
        print(f"productLst = {productLst}")
        productInfoLst = []
        for product in productLst:
            productInfo = {}

            # 唯一标记
            dataNid = product.xpath(".//div[contains(@class,'ctx-box')]//div[contains(@class, 'title')]/a/@data-nid")
            if len(dataNid) > 0:
                productInfo['dataNid'] = dataNid[0]
            else:
                productInfo['dataNid'] = 0
            productInfo['_id'] = productInfo['dataNid']

            taobaoCategory = product.xpath("@data-category")
            if len(taobaoCategory) > 0:
                productInfo['taobaoCategory'] = taobaoCategory[0]
            else:
                productInfo['taobaoCategory'] = 'unknow'

            rank = product.xpath("@data-index")
            if len(rank) > 0:
                productInfo['rank'] = rank[0]
            else:
                productInfo['rank'] = 0

            imgSrc = product.xpath(".//div[@class='pic']/a//img/@src")
            if len(imgSrc) > 0:
                productInfo['imgSrc'] = imgSrc[0]
            else:
                productInfo['imgSrc'] = ''

            title = product.xpath(".//div[contains(@class,'ctx-box')]//div[contains(@class, 'title')]/a/text()")
            productInfo['title'] = ''
            if len(title) > 0:
                for elem in title:
                    productInfo['title'] += elem.strip()

            detailUrl = product.xpath(".//div[contains(@class,'title')]//a/@href")
            if len(detailUrl) > 0:
                productInfo['detailUrl'] = detailUrl[0]
            else:
                productInfo['detailUrl'] = ''

            shopName = product.xpath(".//div[contains(@class,'ctx-box')]//div[@class='shop']/a[contains(@class,'shopname')]/span/text()")
            if len(shopName) > 0:
                productInfo['shopName'] = shopName[-1]
            else:
                productInfo['shopName'] = ''

            shopHref = product.xpath(".//div[contains(@class,'ctx-box')]//div[@class='shop']/a[contains(@class,'shopname')]/@href")
            if len(shopHref) > 0:
                productInfo['shopHref'] = shopHref[0]
            else:
                productInfo['shopHref'] = ''
            print(f"shopHref = {shopHref}")

            shopID = product.xpath(".//div[contains(@class,'ctx-box')]//div[@class='shop']/a[contains(@class,'shopname')]/@data-userid")
            if len(shopID) > 0:
                productInfo['shopID'] = shopID[0]
            else:
                productInfo['shopID'] = ''
            print(f"shopID = {shopID}")

            location = product.xpath(".//div[contains(@class,'clearfix')]//div[@class='location']/text()")
            if len(location) > 0:
                productInfo['location'] = location[0]
            else:
                productInfo['location'] = ''

            dealCountRes = product.xpath(".//div[contains(@class,'ctx-box')]//div[@class='deal-cnt']/text()")
            if len(dealCountRes) > 0:
                dealCount = dealCountRes[0]
                if dealCount != '':
                    dealCountRe = re.search('(\d+)', dealCount)
                    dealCount = int(dealCountRe.group(1)) if dealCountRe else 0.0
                    productInfo['dealCount'] = dealCount
            else:
                productInfo['dealCount'] = 0

            # 价格放到了 
            # productInfo['price'] = product.xpath(".//span[contains(@class,'priceNum')]//text()")
            price = product.xpath(".//div[contains(@class,'price')]//strong/text()")
            if len(price) > 0:
                price = price[0]
            else:
                price = ''
            if price != '':
                # 处理价格：price = ￥64.32'
                priceRe = re.search(r'([\d\.]+)', price)
                priceFloat = float(priceRe.group(1)) if priceRe else 0.0
                productInfo['price'] = priceFloat
            else:
                productInfo['price'] = 0.0

            print(f"productInfo = {productInfo}")
            productInfoLst.append(productInfo)
        return productInfoLst
    except Exception as e:
        print(f"解析List 商品信息出错，e = {e}")
        return []


# 第一步，进入首页，搜索关键字
def searchKey(browser, wait, keyWord, categorySearchWords, retryCount):
    print(f"searchKey: enter, keyWord = {keyWord}, categorySearchWords = {categorySearchWords}, retryCount = {retryCount}")
    retryCount += 1
    if retryCount > retryMax:
        return (False, 0, keyWord)
    mainUrl = "https://www.taobao.com/"
    print(f"searchKey: 访问taobao主页, 进行搜索. mainUrl = {mainUrl}")
    browser.get(mainUrl)

    # 尝试搜索
    try:
        # 搜索框是否出现
        input = wait.until(
            EC.presence_of_element_located((By.XPATH, "//input[@class='search-combobox-input']"))
        )
    except Exception as e:
        # 搜索框都没出现，说明页面没有加载好，重试
        print(f"searchKey: 搜索框还没有加载好，重新加载主页. retryCount = {retryCount}, url = {mainUrl}, e = {e}")
        searchKey(browser, wait, keyWord, categorySearchWords, retryCount)
    else:
        try:
            # 重新拿到搜索框，防止页面在这个时间有变化，导致input元素失焦
            input = browser.find_element_by_xpath("//input[@class='search-combobox-input']")
            # 输入搜索关键字
            time.sleep(3)
            input.clear()
            input.send_keys(categorySearchWords)
            # 敲enter键
            input.send_keys(Keys.RETURN)
            print(f"searchKey: press return key.")
            time.sleep(3)

            # 查看搜索结果是否出现
            searchRes = wait.until(
                EC.presence_of_element_located((By.XPATH, "//div[@class='m-itemlist']"))
            )
            print(f"searchKey: searchSuccess, searchRes = {searchRes}")
        except Exception as e:
            print(f"searchKey: 搜索结果总页数尚未加载好，重新加载主页. retryCount = {retryCount}, url = {mainUrl}, e = {e}")
            searchKey(browser, wait, keyWord, categorySearchWords, retryCount)
        else:
            # 如果发现结果页加载OK, 开始寻找总页数
            try:
                # 获取结果总页数
                print(f"searchKey: 搜索结果已出现，开始寻找总页数")
                totalPage = 0
                print(f"searchKey: totalPageInit = {totalPage}")
                #  共 100 页，
                totalRes = wait.until(
                    EC.presence_of_element_located((By.XPATH, "//div[@class='total']"))
                )
                # print(f"totalRes.text = {totalRes.text}")
                totalRe = re.search("(\d+)", totalRes.text)
                # print(f"totalRe = {totalRe}")
                totalPage = int(totalRe.group(1))
                print(f"searchKey: totalPage = {totalPage}")
                return (True, totalPage, keyWord)
            except Exception as e:
                print(f"searchKey: 搜索结果就一页. e = {e}")
                return (True, 1, keyWord)
            finally:
                # 这个部分会在本函数return语句之前执行
                # 参考文章：Python :浅析 return 和 finally 共同挖的坑
                # http://python.jobbole.com/88408/
                try:
                    print(f"searchKey: 取第一页的数据出来，进行存储")
                    # 解析页面内容：
                    if browser.page_source:
                        productInfoLst = getProductMainInfo(browser.page_source)
                        for product in productInfoLst:
                            product['item_type'] = "productMainInfo"
                            product['keyWord'] = keyWord
                            product['page'] = 1
                            product['markClawerTools'] = 1
                            product['reverse1'] = 1
                            product['categorySearchWords'] = categorySearchWords
                            try:
                                insertRes = db_coll.insert_one(product)
                                print(f"searchKey: insertRes = {insertRes.inserted_id}")
                            except Exception as e:
                                print(f"searchKey: insert_one Exception = {e}")
                except Exception as e:
                    print(f"searchKey: 取第一页数据出来这个过程出现异常。Exception = {e}")


# 输入页码，然后点击按钮，进行翻页
def getNextListPage(browser, wait, pageNum, retryCount):
    print(f"getNextListPage: enter, pageNum = {pageNum}, retryCount = {retryCount}")
    retryCount += 1
    if retryCount > retryMax:
        return (False, [])

    # 第一步判断List页是否加载成功，看一下下面的页码输入框是否加载好了
    try:
        input = wait.until(
            EC.presence_of_element_located((By.XPATH, "//div[@class='form']/input[@class='input J_Input']"))
        )
    except Exception as e:
        # 刷新当前页面，然后重新查看数据
        print(f"getNextListPage: 页码输入框尚未加载好，刷新当前页面，循环这个过程. retryCount = {retryCount}")
        browser.refresh()
        time.sleep(2)
        getNextListPage(browser, wait, pageNum, retryCount)
    else:
        # 开始输入页码数，进行翻页
        try:
            # 重新获取输入框，防止原先获取的input元素失焦
            input = browser.find_element_by_xpath("//div[@class='form']/input[@class='input J_Input']")
            input.clear()
            input.send_keys(pageNum)
            submit = browser.find_element_by_xpath("//div[@class='form']/span[@class='btn J_Submit']")
            submit.click()
            # 检查下面高亮的页码数和这次输入的是否一致，也就是说是否翻页成功
            activePage = wait.until(
                EC.text_to_be_present_in_element((By.XPATH, "//ul[@class='items']/li[@class='item active']/span"), str(pageNum))
            )
            # 检测翻页后商品列表是否加载成功
            allProducts = wait.until(
                EC.presence_of_element_located((By.XPATH, "//div[@class='m-itemlist']"))
            )
            # 解析页面内容：
            if browser.page_source:
                productInfoLst = getProductMainInfo(browser.page_source)
                return(True, productInfoLst)
            else:
                # 如果获取不到源代码，说明页面加载出错，刷新，重新获取
                browser.refresh()
                time.sleep(2)
                getNextListPage(browser, wait, pageNum, retryCount)
        except Exception as e:
            print(f"getNextListPage: 无法解析当前页列表下的商品，重新刷新，pageNum = {pageNum}, retryCount = {retryCount}, e = {e}")
            browser.refresh()
            time.sleep(2)
            getNextListPage(browser, wait, pageNum, retryCount)


# 浏览器进程，多进程执行单位，爬取淘宝数据主要流程...
def chromeProcessPer(queue, lock, mark):
    clawerStartTime = datetime.datetime.now()
    print(f"chromeProcessPer enter: clawerMark = {mark}, time = {clawerStartTime}")

    # 启动浏览器，并设置好wait
    browser = webdriver.Chrome(chrome_options = chrome_options)
    browser.set_window_size(900, 900)  # 根据桌面分辨率来定，主要是为了抓到验证码的截屏
    wait = WebDriverWait(browser, timeout = 30)

    # 从队列中获取搜索关键字
    while not queue.empty():
        keyWordInfo = queue.get()

        # 加锁，是为了防止散乱的打印。 保护一些临界状态
        # 多个进程运行的状态下，如果同一时刻都调用到print，那么显示的打印结果将会混乱
        lock.acquire()
        # keyWordInProcess = ('动漫', [1, '动漫周边']), markProcess = 1
        print(f"keyWordInProcess = {keyWordInfo}, markProcess = {mark}")
        lock.release()
        time.sleep(1)

        # 按照品类的搜索关键字来搜索
        searchRes = searchKey(browser, wait, keyWordInfo[0], keyWordInfo[1][1], 0)
        print(f"main: searchRes = {searchRes}, clawerMark = {mark}")
        if ((searchRes != None)) and searchRes[0] == True:
            # 从第二页开始存储，因为有的商品可能就一页，在搜索结果中就处理掉了。 而且也没有页码输入框，无法与下面的流程匹配
            if searchRes[1] > 1:
                for page in range(2, searchRes[1] + 1):
                    # time.sleep(5)      # 根据淘宝反爬策略调整延时
                    listPageRes = getNextListPage(browser, wait, page, 0)
                    print(f"chromeProcesser: keyWord = {keyWordInfo}, page = {page}, listPageRes = {listPageRes},  mark = {mark}")
                    if (listPageRes != None) and (listPageRes[0] == True):
                        for product in listPageRes[1]:
                            product['item_type'] = "productMainInfo"
                            product['keyWord'] = keyWordInfo[0]
                            product['page'] = page
                            product['markClawerTools'] = 1
                            product['reverse1'] = 1
                            product['categorySearchWords'] = keyWordInfo[1][1]
                            try:
                                insertRes = db_coll.insert_one(product)
                                print(f"insertRes = {insertRes.inserted_id}")
                            except Exception as e:
                                print(f"insert_one Exception = {e}")
            else:
                print(f"main: keyWord = {keyWordInfo}, totalPage = {searchRes[1]}, clawerMark = {mark}")
    clawerEndTime = datetime.datetime.now()
    # 一定要退出
    browser.quit()
    print(f"chromeProcessPer end: clawerMark = {mark}, time = {clawerEndTime}, timeUsed = {clawerEndTime - clawerStartTime}")


if __name__ == "__main__":
    mainStartTime = datetime.datetime.now()
    print(f"main: taobao mainStartTime = {mainStartTime}")

    lock = multiprocessing.Lock()       # 进程锁
    queue = multiprocessing.Queue(300)  # 队列，用于存放所有的初始关键字，最大为300个

    for keyWord, keyWordValue in keySearchWords.items():
        print(f"keyWord = {keyWord}, keyWordValue = {keyWordValue}")
        # 如果queue定的太小，剩下的放不进去，程序就会block住，等待队列有空余空间
        # def put(self, obj, block=True, timeout=None):
        queue.put((keyWord, keyWordValue))
    print(f"queueBefore = {queue}")

    getKeyProcessLst = []
    # 生成两个进程，并启动
    for i in range(2):
        # 携带的args 必须是python原有的数据类型，不能是自定义的。否则lock锁不住该对象
        process = multiprocessing.Process(target = chromeProcessPer, args = (queue, lock, i))
        process.start()
        getKeyProcessLst.append(process)

    # 守护线程
    # join 等待线程终止，如果不使用join方法对每个线程做等待终止，那么线程在运行过程中，可能会去执行最后的打印
    # 如果没有join，父进程就不会阻塞，启动子进程之后，父进程就直接执行最后的打印了
    for p in getKeyProcessLst:
        p.join()

    print(f"queueAfter = {queue}")
    queue.close()
    print(f"all queue used.")
    mainEndTime = datetime.datetime.now()
    print(f"### timeUsedTotal = {mainEndTime - mainStartTime}, mainStartTime = {mainStartTime}, mainEndTime = {mainEndTime}")

4. 结果

需要注意的是：
进程数越多，对电脑的压迫越大，找到合适的数量。而且一定要注意代码中的每一个注释。
目前淘宝对这个页面的反爬措施不强，不需要登录，也没有频率限制，也没有验证码或者广告弹窗。如果有新的反爬措施，参考前面其他相关专题，增加这样的机制。

00. 这里整理了最全的爬虫框架（Java + Python）有一只柴犬爬虫系列爬虫 java python
目录1、前言2、什么是网络爬虫3、常见的爬虫框架3.1、java框架3.1.1、WebMagic3.1.2、Jsoup3.1.3、HttpClient3.1.4、Crawler4j3.1.5、HtmlUnit3.1.6、Selenium3.2、Python框架3.2.1、Scrapy3.2.2、BeautifulSoup+Requests3.2.3、Selenium3.2.4、PyQuery3.2
使用selenium调用firefox提示Profile Missing的问题解决歪歪的酒壶 selenium 测试工具 python
在Ubuntu22.04环境中，使用python3运行selenium提示ProfileMissing，具体信息为：YourFirefoxprofilecannotbeloaded.Itmaybemissingorinaccessible在这个问题的环境中firefox浏览器工作正常。排查中，手动在命令行执行firefox可以打开浏览器，但是出现如下提示Gtk-Message:15:32:09.9
Python 安装 Selenium 报错解决方案：全方位排错指南小柒笔记 python selenium 开发语言
引言在尝试使用pip安装Selenium库时，您可能会遇到中断报错，这通常是由于多种原因造成的，如网络问题、权限问题或依赖项缺失等。本文将指导您如何解决这一常见问题。一、检查网络连接首先，确保您的网络连接稳定。pip安装过程中需要从互联网下载包，因此网络不稳定可能导致安装失败。二、使用管理员权限运行在Windows系统中，尝试使用管理员权限运行命令提示符或PowerShell。右键点击命令提示符或
面试真题 | web自动化关闭浏览器，quit()和close()的区别程序员笑笑软件测试面试前端自动化自动化测试软件测试功能测试程序人生
面试官问：在UI自动化中怎样进行浏览器的关闭操作？使用driver调用quit()和调用close()的区别是什么？考察点是否用过Selenium框架是否编写过对应浏览器退出的测试用例技术点SeleniumAPIdriver.quit()driver.close()总结quit()退出当前所有的窗口；close()关闭当前的标签页，其他窗口不退出关闭所有的浏览器窗口，销毁driver操作，则需要使
软件测试笔记｜web自动化测试｜Web 自动化测试中，有没有修改过页面元素的属性？如何修改？阳哥整理软件测试笔记 web自动化测试自动化
在Web自动化测试中，可以修改页面元素的属性。通常可以使用JavaScript来实现修改元素属性。以下是使用Selenium结合JavaScript修改页面元素属性的方法：fromseleniumimportwebdriverdriver=webdriver.Chrome()#打开网页driver.get("your_url_here")#找到要修改属性的元素element=driver.find
appium中遇到WebDriverException: Message: An unknown server-side error occurred while processing the ... Kingtester
selenium.common.exceptions.WebDriverException:Message:Anunknownserver-sideerroroccurredwhileprocessingthecommand.Originalerror:Anewsessioncouldnotbecreated.Details:sessionnotcreated:pleaseclose'com.te
Python浏览器指纹反爬详解（包含案例）——blog10 总得跑一个 python 网络爬虫 selenium
目录概述案例实操目标分析补充开始由此可以得到方法一：直接从api拿数据方法二：伪装selenium.webdriver测试测试用HTML如下：爬取失败——分析与思考改进最后附上使用selenium破解目标网站浏览器指纹的完整代码：觉得有帮助的小伙伴还请点个关注概述浏览器指纹是由浏览器类型、版本号、操作系统、屏幕分辨率、时区、插件、字体等信息组合而成的唯一标识，可以用于区分不同的用户。通过比对请求中
python安装selenium失败_python-3.x – 无法为python安装selenium weixin_39902472
我在python中导入seleniumwebdriver时遇到了一些麻烦.只是为了确保：这是我的小脚本：importseleniumfromseleniumimportwebdriverbrowser=webdriver.Firefox(executable_path='/Users/Sleeps/Webdrivers/Firefox/geckodriver')当我跑来自seleniumimpor
selenium 安装报错问题 weixin_30266829 python
本机装了py2和py3py2安装selenium总是报错找whl文件也没找到后尝试py3安装selenium成功了/(ఠൠఠ)ﾉ很烦/后来因py3的pip下载了selenium-3.11.0-py2.py3-none-any.whl所以py2的pip也找到了该whl文件直接安装成功/§(*￣▽￣*)§转载于:https://www.cnblogs.com/imaye/p/8794388.html
【Python】关于使用selenium安装失败的问题（2024.1）锐忻 selenium 测试工具 python
一、背景在练习爬网站的时候，会遇到一些问题：1、代码都正确，本地解析出来没有具体内容；2、浏览器源码看到的内容很多，解析出来只有一部分；3、有些网页需要滚动鼠标才加载内容，就是所谓的动态加载。这个时候，selenium进入我的视野，因为他能模拟浏览器操作，实现动态加载。但是，我按照网上教程下载安装，始终都失败，然后又花费了几天的时间，都要崩溃了。。。所幸今天终于运行成功，通过selenium打开了
Python 安装selenium时遇到问题解决措施博吧啦 python selenium 开发语言
在使用pip安装selenium库的过程中可能会遇到各种各样的问题。通过这篇文章，大多数与selenium安装相关的问题都可以得到解决。希望对各位有帮助。1、确保网络连接稳定安装过程需要网络，网络不稳定可能导致安装失败。2、使用管理员权限运行以Windows系统为例：按Windows键+X以显示WinX菜单。从弹出菜单中，选择"WindowsPowerShell（管理员）"以管理员模式打开它。Wi
Selenium自动化测试框架常见异常分析及解决方法程序员筱筱软件测试 selenium 测试工具自动化测试软件测试功能测试程序人生职场和发展
01pycharm中导入selenium报错现象:pycharm中输入fromseleniumimportwebdriver,selenium标红原因1:pycharm使用的虚拟环境中没有安装selenium,解决方法:在pycharm中通过设置或terminal面板重新安装selenium原因2:当前项目下有selenium.py,和系统包名冲突导致,解决方法：重命名这个文件02驱动及本地服务类
Python爬虫——Selenium方法爬取LOL页面张小生180 python 爬虫 selenium
文章目录Selenium介绍用Selenium方法爬取LOL每个英雄的图片及名字Selenium介绍Selenium是一个用于自动化Web应用程序测试的工具，但它同样可以被用来进行网页数据的抓取（爬虫）。Selenium通过模拟用户在浏览器中的操作（如点击、输入、滚动等）来与网页交互，并可以捕获网页的渲染结果，这对于需要JavaScript渲染的网页特别有用。安装Selenium首先，你需要安装S
Python爬虫如何搞定动态Cookie？小白也能学会！图灵学者 python精华 python 爬虫 github
目录1、动态Cookie基础1.1Cookie与Session的区别1.2动态Cookie生成原理2、requests.Session方法2.1Session对象保持2.2处理登录与Cookie刷新2.3长连接与状态保持策略3、Selenium结合ChromeDriver实战3.1安装配置Selenium3.2动态抓取&处理Cookie4、requests-Session结合Selenium技巧4
pip安装使用清华源后山小鲨鱼
python在使用pip安装的时候，一些小一点的还好，安装一些大的包的时候，会非常的慢，这时我们就可以使用清华大学的镜像来安装，打开cmdpipinstall要安装的包-ihttps://pypi.tuna.tsinghua.edu.cn/simple比如说要安装selenium包,可以这样写pipinstallselenium-ihttps://pypi.tuna.tsinghua.edu.cn
python selenium post,是否可以在Selenium中捕获POST数据？ weixin_39600328 python selenium post
I'mworkingwiththeSeleniumWebDriverToolandamwonderingifthistoolprovidesameansforcapturingthePOSTdatageneratedwhensubmittingaform.I'musingthedjangotestframeworktotestthatmydataisprocessedcorrectlyontheb
Selenium面试题（二）知识的宝藏 Selenium（Java）selenium 测试工具
如何在不使用sendKeys()的情况下输入文本可以通过组合使用JavaScript和WebDriver扩展类来实现。以下是一个示例代码：publicstaticvoidsetAttribute(WebElementelement,StringattributeName,Stringvalue){WrapsDriverwrappedElement=(WrapsDriver)element;Java
python爬虫处理滑块验证_python selenium爬虫滑块验证用户6731453637 python爬虫处理滑块验证
importrandomimporttimefromPILimportImagefromioimportBytesIOimportrequestsasrqfrombs4importBeautifulSoupasbsfromseleniumimportwebdriverfromselenium.webdriverimportActionChainsfromselenium.webdriverimpo
python中selenium中使用ajax_使用selenium和python捕获AJAX响应 weixin_39946534
我曾经截获了一些使用selenium向页面注入javascript的ajax调用.历史的不好的一面是,硒有时可能是,说“脆弱”.因此,无论如何我在进行注射时都会遇到硒异常.无论如何,我的想法是拦截XHR调用,并将其响应设置为我创建的一个新的dom元素,我可以从selenium操作.在拦截的条件下,你甚至可以使用发出请求的url来拦截你真正想要的那个(self._url)也许这有帮助.browser
python selenium chrome获取每个请求内容_selenium 获取请求返回内容的解决方案 weixin_39735166 python selenium chrome获取每个请求内容
提出问题之前我的一篇博客说的是怎么利用selenium来做自动化监控。当出现异常时，我们需要记录页面源码、网络请求数据、截图等信息来方便我们诊断问题，基本上就够用了。但是，这两天遇到一个棘手的异常，时不时页面会弹出：“系统繁忙，请稍候再试！”，这时候我们去看网络请求数据，结果状态码全部都是200，没有其它信息，这压根没法定位不了问题。这就说明：网络出现异常的时候，仅靠状态码是不够的。我们最好能拿到
Java+selenium+chrome+linux/windows实现数据获取 fox_初始化 Java selenium chrome 测试工具 java linux windows
背景：在进行业务数据获取或者自动化测试时，通常会使用模拟chrome方式启动页面，然后获取页面的数据。在本地可以使用windows的chromedriver.exe进行打开chrome页面、点击等操作。在linux下通常使用无界面无弹窗的方式进行操作。接下来是实现方案。代码层面：关键工具类：ChromeDriverUtilpublicclassChromeDriverUtil{publicWebD
python基础：10.面向对象之简介海阔and天空 python全栈自动化测试
0.前言如果可以的话，请先关注（专栏和账号），然后点赞和收藏，最后学习和进步。你的支持是我继续写下去的最大动力，个人定当倾囊而送，不负所望。谢谢！！！1.前提基于win10专业版64位系统+64位jdk1.8+64位python3.6.5+社区版pycharm2018.1.3+unittest+selenium3.141.0。要学好自动化测试，我们先从python语言基础开始学习，一步一个脚印，欲
selenium中键盘操作：Keys类 weixin_41812355 web自动化 selenium python
前言：本文详细介绍了如何使用Selenium库进行键盘操作，包括非组合键如回车、删除等，以及Ctrl+A、Ctrl+C等常见组合键的模拟。通过实例演示了在百度搜索中的应用，并展示了ActionChains类的使用方法。一、导入相关类selenium提供了比较完整的键盘操作，在使用的模拟键盘操作之前需要我们导入Keys类fromselenium.webdriver.common.keysimport
python面向对象简介_python基础：10.面向对象之简介奋哥时代 python面向对象简介
0.前言如果可以的话，请先关注(专栏和账号)，然后点赞和收藏，最后学习和进步。你的支持是我继续写下去的最大动力，个人定当倾囊而送，不负所望。谢谢！！！1.前提基于win10专业版64位系统+64位jdk1.8+64位python3.6.5+社区版pycharm2018.1.3+unittest+selenium3.141.0。要学好自动化测试，我们先从python语言基础开始学习，一步一个脚印，欲
Python+Selenium+Pytest+POM自动化测试框架封装测试老哥 python 软件测试 selenium pytest 自动化测试测试工具测试用例
点击文末小卡片，免费获取软件测试全套资料，资料在手，涨薪更快1、测试框架简介1）测试框架的优点代码复用率高，如果不使用框架的话，代码会显得很冗余。可以组装日志、报告、邮件等一些高级功能。提高元素等数据的可维护性，元素发生变化时，只需要更新一下配置文件。使用更灵活的PageObject设计模式。2）测试框架的整体目录【注意】init.py文件用以标识此目录为一个python包。2、首先时间管理首先，
利用selenium获取cookies，实现浏览器免登陆自动化操作 crownyouyou selenium python chrome 自动化
###一、设置默认源为国内的清华源（不想设置可跳过一）#查看pip安装源pipconfiglist#清华源pipconfigsetglobal.index-urlhttps://pypi.tuna.tsinghua.edu.cn/simple###二、下载json。（如果下载好json，可以跳过二）如果没下载json，可以使用pip下载pipinstalljson-i https://pypi.t
利用PHP和Selenium自动化采集数据、实现爬虫抓取 IT大数据小助手 php selenium 自动化
随着互联网时代的到来，抓取互联网上的数据成为越来越重要的工作。在web前端开发领域，我们经常需要获取页面中的数据来完成一系列的交互操作，为了提高效率，我们可以将这个工作自动化。本文将介绍如何利用PHP和Selenium进行自动化数据采集和爬虫抓取。一、什么是SeleniumSelenium是一个免费的开源自动化测试工具，主要用于自动化测试Web应用程序，可以模拟真实的用户行为，实现自动交互。使用S
基础爬虫 requests selenium aiohttp BeautifulSoup pyQuery Xpath&CssSelector 肯定是疯了
http://47.101.52.166/blog/back/python/%E7%88%AC%E8%99%AB.html请求requestsseleniumaiohttp*处理BeautifulSouppyQueryXpath&CssSelector*存储pymysqlPyMongoredisaiomysql*Scrapy
selenium启动浏览器时，控制台报错WebDriverException: Message: 'chromedriver' executable needs to be in PATH 疯狂小代码学习心得 Python
1、1、安装完Python、selenium后，下载Chrome浏览器对应版本的chromedriver，并将chromedriver放到了谷歌浏览器的安装目录下，在运行代码时，没有如期启动浏览器，控制台提示以上错误2、只看到了控制台的第一条信息，以为Chromedriver不匹配导致出错，仔细观察后，最后一条信息才是关键（手动捂脸），chromedriver找不到文件的路径，将chromedri
QMetry自动化框架：一站式功能测试解决方案芮奕滢Kirby
QMetry自动化框架：一站式功能测试解决方案qafQualityAutomationFrameworkforweb,mobileweb,mobilenativeandrestweb-serviceusingSelenium,webdrier,TestNGandJavaJersey项目地址:https://gitcode.com/gh_mirrors/qa/qaf项目介绍QMetry自动化框架（Q
java类加载顺序 3213213333332132 java
package com.demo; /** * @Description 类加载顺序 * @author FuJianyong * 2015-2-6上午11:21:37 */ public class ClassLoaderSequence { String s1 = "成员属性"; static String s2 = "
Hibernate与mybitas的比较 BlueSkator sql Hibernate 框架 ibatis orm
第一章 Hibernate与MyBatis Hibernate 是当前最流行的O/R mapping框架，它出身于sf.net，现在已经成为Jboss的一部分。 Mybatis 是另外一种优秀的O/R mapping框架。目前属于apache的一个子项目。 MyBatis 参考资料官网：http:
php多维数组排序以及实际工作中的应用 dcj3sjt126com PHP usort uasort
自定义排序函数返回false或负数意味着第一个参数应该排在第二个参数的前面, 正数或true反之, 0相等usort不保存键名uasort 键名会保存下来uksort 排序是对键名进行的 <!doctype html> <html lang="en"> <head> <meta charset="utf-8&q
DOM改变字体大小周华华前端
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml&q
c3p0的配置 g21121 c3p0
c3p0是一个开源的JDBC连接池，它实现了数据源和JNDI绑定，支持JDBC3规范和JDBC2的标准扩展。c3p0的下载地址是：http://sourceforge.net/projects/c3p0/这里可以下载到c3p0最新版本。以在spring中配置dataSource为例：  <bean name="prope
Java获取工程路径的几种方法 510888780 java
第一种： File f = new File(this.getClass().getResource("/").getPath()); System.out.println(f); 结果: C:\Documents%20and%20Settings\Administrator\workspace\projectName\bin 获取当前类的所在工程路径; 如果不加“
在类Unix系统下实现SSH免密码登录服务器 Harry642 免密 ssh
1.客户机 (1)执行ssh-keygen -t rsa -C "[email protected]"生成公钥，xxx为自定义大email地址 (2)执行scp ~/.ssh/id_rsa.pub root@xxxxxxxxx:/tmp将公钥拷贝到服务器上，xxx为服务器地址 (3)执行cat
Java新手入门的30个基本概念一 aijuans java java 入门新手
在我们学习Java的过程中,掌握其中的基本概念对我们的学习无论是J2SE,J2EE,J2ME都是很重要的,J2SE是Java的基础,所以有必要对其中的基本概念做以归纳,以便大家在以后的学习过程中更好的理解java的精髓,在此我总结了30条基本的概念。　　Java概述:　　目前Java主要应用于中间件的开发(middleware)---处理客户机于服务器之间的通信技术,早期的实践证明,Java不适合
Memcached for windows 简单介绍 antlove java Web windows cache memcached
1. 安装memcached server a. 下载memcached-1.2.6-win32-bin.zip b. 解压缩，dos 窗口切换到 memcached.exe所在目录，运行memcached.exe -d install c.启动memcached Server,直接在dos窗口键入 net start "memcached Server&quo
数据库对象的视图和索引百合不是茶索引 oeacle数据库视图
视图视图是从一个表或视图导出的表，也可以是从多个表或视图导出的表。视图是一个虚表，数据库不对视图所对应的数据进行实际存储，只存储视图的定义，对视图的数据进行操作时,只能将字段定义为视图,不能将具体的数据定义为视图为什么oracle需要视图; &
Mockito(一) --入门篇 bijian1013 持续集成 mockito 单元测试
Mockito是一个针对Java的mocking框架，它与EasyMock和jMock很相似，但是通过在执行后校验什么已经被调用，它消除了对期望行为（expectations）的需要。其它的mocking库需要你在执行前记录期望行为（expectations），而这导致了丑陋的初始化代码。 &nb
精通Oracle10编程SQL(5)SQL函数 bijian1013 oracle 数据库 plsql
/* * SQL函数 */ --数字函数 --ABS(n):返回数字n的绝对值 declare v_abs number(6,2); begin v_abs:=abs(&no); dbms_output.put_line('绝对值：'||v_abs); end; --ACOS(n):返回数字n的反余弦值，输入值的范围是-1~1，输出值的单位为弧度
【Log4j一】Log4j总体介绍 bit1129 log4j
Log4j组件：Logger、Appender、Layout Log4j核心包含三个组件：logger、appender和layout。这三个组件协作提供日志功能：日志的输出目标日志的输出格式日志的输出级别(是否抑制日志的输出) logger继承特性 A logger is said to be an ancestor of anothe
Java IO笔记白糖_ java
public static void main(String[] args) throws IOException { //输入流 InputStream in = Test.class.getResourceAsStream("/test"); InputStreamReader isr = new InputStreamReader(in); Bu
Docker 监控 ronin47 docker监控
目前项目内部署了docker，于是涉及到关于监控的事情，参考一些经典实例以及一些自己的想法，总结一下思路。 1、关于监控的内容监控宿主机本身监控宿主机本身还是比较简单的，同其他服务器监控类似，对cpu、network、io、disk等做通用的检查，这里不再细说。额外的，因为是docker的
java-顺时针打印图形 bylijinnan java
一个画图程序要求打印出： 1.int i=5; 2.1 2 3 4 5 3.16 17 18 19 6 4.15 24 25 20 7 5.14 23 22 21 8 6.13 12 11 10 9 7. 8.int i=6 9.1 2 3 4 5 6 10.20 21 22 23 24 7 11.19
关于iReport汉化版强制使用英文的配置方法 Kai_Ge iReport汉化英文版
对于那些具有强迫症的工程师来说，软件汉化固然好用，但是汉化不完整却极为头疼，本方法针对iReport汉化不完整的情况，强制使用英文版，方法如下：在 iReport 安装路径下的 etc/ireport.conf 里增加红色部分启动参数，即可变为英文版。 # ${HOME} will be replaced by user home directory accordin
[并行计算]论宇宙的可计算性 comsci 并行计算
现在我们知道,一个涡旋系统具有并行计算能力.按照自然运动理论,这个系统也同时具有存储能力,同时具备计算和存储能力的系统,在某种条件下一般都会产生意识...... 那么,这种概念让我们推论出一个结论 &nb
用OpenGL实现无限循环的coverflow dai_lm android coverflow
网上找了很久，都是用Gallery实现的，效果不是很满意，结果发现这个用OpenGL实现的，稍微修改了一下源码，实现了无限循环功能源码地址： https://github.com/jackfengji/glcoverflow public class CoverFlowOpenGL extends GLSurfaceView implements GLSurfaceV
JAVA数据计算的几个解决方案1 datamachine java Hibernate 计算
老大丢过来的软件跑了10天，摸到点门道，正好跟以前攒的私房有关联，整理存档。 -----------------------------华丽的分割线------------------------------------- 数据计算层是指介于数据存储和应用程序之间，负责计算数据存储层的数据，并将计算结果返回应用程序的层次。J &nbs
简单的用户授权系统,利用给user表添加一个字段标识管理员的方式 dcj3sjt126com yii
怎么创建一个简单的(非 RBAC)用户授权系统通过查看论坛，我发现这是一个常见的问题，所以我决定写这篇文章。本文只包括授权系统.假设你已经知道怎么创建身份验证系统(登录)。数据库首先在 user 表创建一个新的字段(integer 类型),字段名 'accessLevel',它定义了用户的访问权限扩展 CWebUser 类在配置文件(一般为 protecte
未选之路 dcj3sjt126com 诗
作者:罗伯特*费罗斯特黄色的树林里分出两条路, 可惜我不能同时去涉足, 我在那路口久久伫立, 我向着一条路极目望去, 直到它消失在丛林深处. 但我却选了另外一条路, 它荒草萋萋,十分幽寂; 显得更诱人,更美丽, 虽然在这两条小路上, 都很少留下旅人的足迹. 那天清晨落叶满地, 两条路都未见脚印痕迹. 呵,留下一条路等改日再
Java处理15位身份证变18位蕃薯耀 18位身份证变15位 15位身份证变18位身份证转换
15位身份证变18位，18位身份证变15位 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 蕃薯耀 201
SpringMVC4零配置--应用上下文配置【AppConfig】 hanqunfeng springmvc4
从spring3.0开始，Spring将JavaConfig整合到核心模块，普通的POJO只需要标注@Configuration注解，就可以成为spring配置类，并通过在方法上标注@Bean注解的方式注入bean。 Xml配置和Java类配置对比如下： applicationContext-AppConfig.xml <!-- 激活自动代理功能参看：
Android中webview跟JAVASCRIPT中的交互 jackyrong JavaScript html android 脚本
在android的应用程序中,可以直接调用webview中的javascript代码,而webview中的javascript代码,也可以去调用ANDROID应用程序(也就是JAVA部分的代码).下面举例说明之: 1 JAVASCRIPT脚本调用android程序要在webview中,调用addJavascriptInterface(OBJ,int
8个最佳Web开发资源推荐 lampcy 编程 Web 程序员
Web开发对程序员来说是一项较为复杂的工作，程序员需要快速地满足用户需求。如今很多的在线资源可以给程序员提供帮助，比如指导手册、在线课程和一些参考资料，而且这些资源基本都是免费和适合初学者的。无论你是需要选择一门新的编程语言，或是了解最新的标准，还是需要从其他地方找到一些灵感，我们这里为你整理了一些很好的Web开发资源，帮助你更成功地进行Web开发。这里列出10个最佳Web开发资源，它们都是受
架构师之面试------jdk的hashMap实现 nannan408 HashMap
1.前言。如题。 2.详述。 (1)hashMap算法就是数组链表。数组存放的元素是键值对。jdk通过移位算法（其实也就是简单的加乘算法），如下代码来生成数组下标(生成后indexFor一下就成下标了）。 static int hash(int h) { h ^= (h >>> 20) ^ (h >>>
html禁止清除input文本输入缓存 Rainbow702 html 缓存 input 输入框 change
多数浏览器默认会缓存input的值，只有使用ctl+F5强制刷新的才可以清除缓存记录。如果不想让浏览器缓存input的值，有2种方法：方法一：在不想使用缓存的input中添加 autocomplete="off"; <input type="text" autocomplete="off" n
POJO和JavaBean的区别和联系 tjmljw POJO java beans
POJO 和JavaBean是我们常见的两个关键字，一般容易混淆，POJO全称是Plain Ordinary Java Object / Pure Old Java Object，中文可以翻译成：普通Java类，具有一部分getter/setter方法的那种类就可以称作POJO，但是JavaBean则比 POJO复杂很多， Java Bean 是可复用的组件，对 Java Bean 并没有严格的规
java中单例的五种写法 liuxiaoling java 单例
/** * 单例模式的五种写法： * 1、懒汉 * 2、恶汉 * 3、静态内部类 * 4、枚举 * 5、双重校验锁 */ /** * 五、双重校验锁，在当前的内存模型中无效 */ class LockSingleton { private volatile static LockSingleton singleton; pri

python + selenium多进程爬取淘宝搜索页数据

python + selenium多进程爬取淘宝搜索页数据

1. 功能描述

2. 环境

3. 代码

4. 结果

你可能感兴趣的:(selenium)