java+selenium

selenium

  • 前言
  • 简介
  • 1.安装驱动
  • 2.简单案例走进爬虫
  • 3.seleniumAPI
      • 3-1创建一个可操控的浏览器对象
      • 3-2打开指定页面
      • 3-3定位元素
            • id定位
            • name定位
            • class 定位
            • link定位
            • xpath定位
      • 3-4浏览器常用方法
      • 案例 一 登录QQ邮箱
      • 3-5selenium 进阶
            • 鼠标
            • 切换窗口
            • 调用js
            • chromeOptions 创建浏览器 参数
            • 浏览器相关设置
            • 多线程示例
      • 实战案例 - 爬取房天下价格走势图
      • 实战案例 - 爬取链家小区价格
  • 注意事项
  • 后语

前言

文章仅供学习使用!!
严禁做违法违纪的事情,责任自负

简介

Selenium 是最广泛使用的开源 Web UI(用户界面)自动化测试套件之一。
与java集成,本质上是通过Java代码调用浏览器驱动 进行模拟人工的操作.
selenium支持不同的浏览器,本文以谷歌为例 !

1.安装驱动

selenium驱动有两种下载方式.任选其一即可
①首先需要确认浏览器版本: 在浏览器界面输入chrome://settings/
java+selenium_第1张图片② 下面网址任选其一,选择对应的版本下载 ( 此处如未有完全一致版本,则选择最大版本 例如本文中是104.0.5112.102 可选的版本是104开头 最优选为104版本中最大版号)

http://chromedriver.storage.googleapis.com/index.html
http://npm.taobao.org/mirrors/chromedriver/

在这里插入图片描述

2.简单案例走进爬虫

package com.mengkeng.selenium_demo.test;

import org.openqa.selenium.By;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;

import java.util.concurrent.TimeUnit;

public class BaiduDemo {

    public static void main(String[] args) throws Exception {
        //D://chromedriver.exe 以实际存储路径为准
        System.setProperty("webdriver.chrome.driver", "D://chromedriver.exe");
        ChromeOptions chromeOptions = new ChromeOptions();
        ChromeDriver driver = new ChromeDriver(chromeOptions);
        try {
            // 窗口最大化
            driver.manage().window().maximize();
            driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
            Thread.sleep(1000);
            //进入百度首页
            driver.get("https://www.baidu.com/");
            //找到输入框
            WebElement text = driver.findElement(By.id("kw"));
            //找到百度一下按钮
            WebElement button = driver.findElement(By.id("su"));
            text.sendKeys("123");
            button.click();
        } finally {
            sleep(10000);
            driver.quit();
        }
    }

    public static void sleep(int time) {
        try {
            Thread.sleep(1000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

通过几行代码实现了打开网页搜索 ‘123’ , 接下来看看常用的api , 理解即可 随用随查

3.seleniumAPI

3-1创建一个可操控的浏览器对象

//  注意修改实际驱动存储位置
System.setProperty("webdriver.chrome.driver", "D://chromedriver.exe");
WebDriver driver = new ChromeDriver();

3-2打开指定页面

driver.get("https://www.baidu.com/");

3-3定位元素

注意: 页面出现相同属性的元素, 则需要使用xpath定位方式进行指定获取

id定位
driver.findElement(By.id("pnum"));
name定位
driver.findElement(By.name("name"));
class 定位
driver.findElement(By.className("pgo"));
link定位
driver.findElement(By.linkText("link"));
xpath定位
driver.findElement(By.xpath("//div[@id='1']/div/div/h3/a[1]"))

3-4浏览器常用方法

方法 描述
sendKey() 模拟输入指定内容
clear() 清楚输入内容
text() 获取文本信息
getAttribute() 获取指定属性

ok掌握这一部分就可以书写简单爬虫了 , 有兴趣的童鞋试着做一下如下案例:

案例 一 登录QQ邮箱

需求:

登录qq邮箱,并打开收件箱页面

以下是实现代码

package com.mengkeng.selenium_demo.test;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import java.util.Objects;

public class QQEmaIlLoginDemo {
    public static void main(String[] args) throws InterruptedException {
        //定义使用什么版本的驱动,注意替换你的路径
        System.setProperty("webdriver.chrome.driver", "D://chromedriver.exe");
        ChromeDriver driver = new ChromeDriver();
        driver.manage().window().maximize();
        try {
            Thread.sleep(1000);
            driver.get("https://mail.qq.com/");
            driver.switchTo().frame("login_frame");
            WebElement username = driver.findElement(By.id("u"));
            WebElement password = driver.findElement(By.id("p"));
            username.sendKeys("[email protected]");
            password.sendKeys("xxxxxx");
            WebElement submit = driver.findElement(By.id("login_button"));
            submit.click();
            Thread.sleep(1000);
            driver.switchTo().defaultContent();
            WebElement element = validElement("//a[@id='folder_1']", driver);
            if (Objects.nonNull(element)){
                WebElement folder_1 = driver.findElement(By.xpath("//a[@id='folder_1']"));
                folder_1.click();
            }else{
                System.out.println("打开收件箱失败");
            }
        } finally {
            Thread.sleep(10000);
            driver.close();
            driver.quit();
        }
    }
    public static WebElement validElement(String str, WebDriver driver) {
        try {
            WebElement element = driver.findElement(By.xpath(str));
            return element;
        } catch (Exception e) {
            System.out.println("这个元素不存在" + str);
        }
        return null;
    }
}

上述只是简单案例 有鼠标,多页面跳转的怎么办呢 . 别急 这就来

3-5selenium 进阶

鼠标

注意 鼠标操作方法需要以perform()方法结尾 如未使用该方法结尾则操作不生效

方法 描述
click() 单击左键
context_click() 单击右键
double_click() 双击
drag_and_drop() 拖动
move_to_element() 鼠标悬停
perform() 执行所有ActionChains中存储的动作
切换窗口

当点击页面元素 浏览器创建新窗口后需要切换到最新页面.

driver.switchTo().window(frontHandle) // 此处的frontHandle是页面对象 可以使用driver.getWindowHandle(); 获取后暂存

调用js

模拟滑动页面
driver.executeScript(“window.scrollTo(0,300)”);

当页面元素无法点击的时候(反爬虫拦截)
driver.executeScript(“arguments[0].click();”, element);// 其中element为按钮或元素

chromeOptions 创建浏览器 参数
        ChromeOptions chromeOptions = new ChromeOptions();
        chromeOptions.setPageLoadStrategy(PageLoadStrategy.EAGER);  //  急速加载模式
  		 chromeOptions.addArguments("--incognito"); // 隐私窗口模式
        chromeOptions.addArguments("--blink-settings=imagesEnabled=false"); //  不加载图片
        chromeOptions.addArguments("--headless");	//  无头模式
        chromeOptions.addArguments("--no-sandbox"); //  禁用沙箱模式
        chromeOptions.addArguments("--disable-gpu");//  禁用gpu加速
        chromeOptions.addArguments("--proxy-server=" + proxy); //  添加代理
        ChromeDriver driver = new ChromeDriver(chromeOptions);
浏览器相关设置
//  设置全局等待时间
driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
//  最大化页面
driver.manage().window().maximize();
//  去除sesenium标志
String js1="Object.defineProperties(navigator, {webdriver:{get:()=>undefined}});";
((ChromeDriver) driver).executeScript(js1);
//  添加UA请求头
 String[] arr = {"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15",
                "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 Edg/103.0.1264.37",
                "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",
                "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",
                "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)"};
chromeOptions.addArguments("User-Agent=" + arr[random.nextInt(7)]);
多线程示例

在解析列表页 创建浏览器对象执行解析


            
  private void parsePagePre(SetOperations ops) {
      ThreadPoolExecutor threadPoolExecutor = new ThreadPoolExecutor(2,
            8, 30L,
            TimeUnit.SECONDS, new LinkedBlockingQueue<>());
        List<BuildAreaUrlLj> buildAreaUrlLjs = buildAreaUrlLjMapper.selectList(null);
        for (BuildAreaUrlLj buildAreaUrlLj : buildAreaUrlLjs1) {
            pagepoolExecutor.execute(() -> parsePage(ops, opsForHash, buildAreaUrlLj));
        }
    }
        
  private void parsePage(SetOperations ops, HashOperations<String, Object, Object> opsForHash, BuildAreaUrlLj buildAreaUrlLj) {
        ChromeDriver driver = getChromeDriver();
			driver.get(buildAreaUrlLj.getAreaUrl());
       //  业务代码
   }
private ChromeDriver getChromeDriver() {
        String[] arr = {"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15",
                "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 Edg/103.0.1264.37",
                "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",
                "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",
                "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)"};

        ChromeOptions chromeOptions = new ChromeOptions();
        chromeOptions.setPageLoadStrategy(PageLoadStrategy.EAGER);
        chromeOptions.addArguments("--incognito");
        chromeOptions.addArguments("--blink-settings=imagesEnabled=false");
        chromeOptions.addArguments("--headless");
        chromeOptions.addArguments("--no-sandbox");
        chromeOptions.addArguments("--disable-gpu");
        if ("用代理") {
            chromeOptions.addArguments("--proxy-server=" + nextProxy);
        }
        HashMap<String, Object> map = new HashMap<>();
        map.put("webrtc.ip_handling_policy", "disable_non_proxied_udp");
        map.put("webrtc.multiple_routes_enabled", false);
        map.put("webrtc.nonproxied_udp_enabled", false);
        chromeOptions.setExperimentalOption("prefs", map);
        Random random = new Random();
        chromeOptions.addArguments("User-Agent=" + arr[random.nextInt(7)]);
        ChromeDriver driver = new ChromeDriver(chromeOptions);
        driver.manage().window().maximize();
        return driver;
    }

实战案例 - 爬取房天下价格走势图

package com.mengkeng.selenium_demo.test;

import com.alibaba.fastjson.JSON;
import com.mengkeng.selenium_demo.config.RestTemplateConfig;
import com.mengkeng.selenium_demo.entity.TkBuildingsPriceAjk;
import lombok.extern.slf4j.Slf4j;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.redis.core.RedisTemplate;
import org.springframework.data.redis.core.SetOperations;
import org.springframework.http.*;
import org.springframework.util.CollectionUtils;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.client.RestTemplate;

import java.math.BigDecimal;
import java.util.*;
import java.util.concurrent.TimeUnit;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 *
 * Date: 2022-07-10 13:50
 * Description:
 */
@RestController
@RequestMapping("fang")
@Slf4j
public class FangtianxiaDemo {
    @Autowired
    private RedisTemplate redisTemplate;

    private static LinkedList<String> pages = new LinkedList<>();
    /**
     * 基础页面
     */
    public static final String PRICE_URL = "https://pinggun.fang.com/RunChartNew/MakeChartData/";
    /**
     * redis 记录页面
     */
    public static final String SKIP_URLS = "SKIP_URLS";
    /**
     * 成功标识
     */
    public static String TEMP_FLAG = "fail";


    @RequestMapping("sync")
    public String sync() {
        while (!TEMP_FLAG.equals("success")) {
            System.setProperty("webdriver.chrome.driver", "D://chromedriver.exe");
            ChromeOptions chromeOptions = new ChromeOptions();
            chromeOptions.addArguments("--headless");
            chromeOptions.addArguments("--no-sandbox");
            chromeOptions.addArguments("--disable-gpu");
            chromeOptions.addArguments("--disable-dev-shm-usage");
            WebDriver driver = new ChromeDriver(chromeOptions);
            driver.manage().window().maximize();
            driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
            driver.get("https://esf.fang.com/housing/");
            sleep(2000);
            try {
                parseFTX(driver);
            } catch (Exception e) {
                try {
                    Thread.sleep(10000);
                } catch (InterruptedException interruptedException) {
                    interruptedException.printStackTrace();
                }
            } finally {
                sleep(10000);
                driver.quit();
            }
        }
        return "ok";
    }

    /**
     * 解析fangtianxia
     */
    private void parseFTX(WebDriver driver) {
        SetOperations ops = redisTemplate.opsForSet();
        List<WebElement> elements = driver.findElements(By.xpath("//div[@class='qxName']/a"));
        // 区域
        for (int i = 2; i <= elements.size() - 3; i++) {

            WebElement element = driver.findElement(By.xpath("//div[@class='qxName']/a[" + i + "]"));
            element.click();
            sleep(800);
            //商圈
            List<WebElement> elementsShangquan = driver.findElements(By.xpath("//p[@id='shangQuancontain']/a"));
            for (int sq = 2; sq <= elementsShangquan.size(); sq++) {

                WebElement elementsq = driver.findElement(By.xpath("//p[@id='shangQuancontain']/a[" + sq + "]"));
                String tempHref = elementsq.getAttribute("href");

//                if (ops.isMember(SKIP_URLS, tempHref)) {
//                    System.out.println("跳过了当前链接" + tempHref);
//                    continue;
//                }

                elementsq.click();
                parsePage(driver);
                ops.add(SKIP_URLS, tempHref);
                sleep(800);
            }
        }
        TEMP_FLAG = "success"; //正常跑一圈 结束
    }

    /**
     * 解析分页
     *
     * @param driver
     */
    private void parsePage(WebDriver driver) {
        // 分页
        try {
            driver.findElement(By.className("txt")).getText();
        } catch (Exception e) {
            log.info("该分类下无数据 url是" + driver.getCurrentUrl());
            return;
        }
        String pageTotal = driver.findElement(By.className("txt")).getText().replaceAll("共", "").replaceAll("页", "");
        for (int page = 0; page < Integer.parseInt(pageTotal); page++) {

            List<WebElement> houseList = driver.findElements(By.xpath("//div[@class='houseList']/div"));

            for (int i = 1; i < houseList.size(); i++) {
                String communityName = driver.findElement(By.xpath("//div[@class='houseList']/div[" + i + "]/dl/dd/p[1]/a[1]")).getText();
                String communityCode = driver.findElement(By.xpath("//div[@class='houseList']/div[" + i + "]/dl/dd/p[1]/a[2]")).getAttribute("projcode");
                String areaName = driver.findElement(By.xpath("//div[@class='houseList']/div[" + i + "]/dl/dd/p[2]/a[1]")).getText();


                // 跳转到详情页
                pages.addAll(driver.getWindowHandles());
                driver.findElement(By.xpath("//div[@class='houseList']/div[" + i + "]/dl/dd/p[1]/a[1]")).click();
                sleepAndCutoverNewPage(800, driver);

                parseDetail(communityCode, communityName, areaName);

                driver.close();
                driver.switchTo().window(pages.getLast());
                sleep(1000);
            }
            if (page + 1 == Integer.parseInt(pageTotal)) {
                break;
            }
            String pageNow = driver.findElement(By.xpath("//div[@id='houselist_B14_01']/a[last()-1]")).getAttribute("href");
            System.out.println("下一页是------------" + pageNow + "----" + pageTotal);
            driver.findElement(By.xpath("//div[@id='houselist_B14_01']/a[last()-1]")).click();
            sleep(600);

        }
    }

    /**
     * 解析详情
     *
     * @param communityCode
     * @param communityName
     * @param areaName
     */
    public void parseDetail(String communityCode, String communityName, String areaName) {
        HashMap<String, Object> map = new HashMap<>();
        map.put("newcode", communityCode);
        map.put("city", cnToUnicode("北京"));
        map.put("district", cnToUnicode(areaName));

        HttpHeaders headers = new HttpHeaders();
        headers.setContentType(MediaType.APPLICATION_JSON_UTF8);
        HttpEntity<String> entity = new HttpEntity<>(JSON.toJSONString(map), headers);
        RestTemplate restTemplate = null;
        try {
            restTemplate = new RestTemplate(RestTemplateConfig.generateHttpRequestFactory());
        } catch (Exception e) {
            e.printStackTrace();
        }
        ResponseEntity<String> stringResponseEntity = restTemplate.exchange(PRICE_URL, HttpMethod.POST, entity, String.class);
        Pattern compile = Pattern.compile(",(\\w+)]");
        Matcher matcher = compile.matcher(stringResponseEntity.getBody());

        Pattern compileMonth = Pattern.compile("年(\\w+)月");
        Matcher matcherMonth = compileMonth.matcher(stringResponseEntity.getBody());
        ArrayList<String> list = new ArrayList<>();
        while (matcherMonth.find()) {
            list.add(matcherMonth.group(1));
        }

        Pattern compileYear = Pattern.compile("&(\\w+)年");
        Matcher matcherYear = compileYear.matcher(stringResponseEntity.getBody());
        int year = 2020;
        while (matcherYear.find()) {
            year = Integer.parseInt(matcherYear.group(1));
        }
        ArrayList months = null;
        if (!CollectionUtils.isEmpty(list)) {
            months = getMonths(year, Integer.parseInt(list.get(0)), Integer.parseInt(list.get(1)));
        }

        while (matcher.find()) {
            TkBuildingsPriceAjk ajk = new TkBuildingsPriceAjk();
            ajk.setDataOrigin("fangtianxia");
            ajk.setCommunityCode(communityCode);
            ajk.setCommunity(communityName);
            ajk.setAvgPrice(new BigDecimal(matcher.group(1)));
            System.out.println("持久化=======================================" + ajk);
        }

    }

    private static void sleep(int millis) {
        try {
            Thread.sleep(millis);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

    /**
     * 切换页面
     *
     * @param millis
     * @param driver
     * @return
     */
    private static String sleepAndCutoverNewPage(int millis, WebDriver driver) {
        try {
            Thread.sleep(millis);
            for (String handle : driver.getWindowHandles()) {
                if (!pages.contains(handle)) {
                    driver.switchTo().window(handle);
                }
            }
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        return null;
    }

    /**
     * 获取对象unionCode值
     *
     * @param cn
     * @return
     */
    private static String cnToUnicode(String cn) {
        char[] chars = cn.toCharArray();
        StringBuilder returnStr = new StringBuilder();
        for (int i = 0; i < chars.length; i++) {
            returnStr.append("\\u").append(Integer.toString(chars[i], 16));
        }
        return returnStr.toString();
    }
    /**
     * 获取年份列表-只支持今年至下一年
     *
     * @param year  开始年份
     * @param start 开始月份
     * @param end   结束月份
     * @return
     */
    private static ArrayList getMonths(int year, int start, int end) {
        ArrayList res = new ArrayList();
        for (int i = start; i <= (end == 12 ? 12 : end + 12); i++) {
            if (i > 12) {
                res.add((year + 1) + String.format("%02d", i - 12));
            } else {
                res.add(year + String.format("%02d", i));
            }
        }
        return res;
    }
}

实战案例 - 爬取链家小区价格

package com.mengkeng.selenium_demo.test;

import com.alibaba.fastjson.JSON;
import com.mengkeng.selenium_demo.entity.BuildAreaUrlLj;
import com.mengkeng.selenium_demo.entity.IdAndNamePO;
import com.mengkeng.selenium_demo.entity.TkBuildingsAreaInfolj;
import com.mengkeng.selenium_demo.entity.TkBuildingsMonthPriceLj;
import com.mengkeng.selenium_demo.mapper.BuildAreaUrlLjMapper;
import com.mengkeng.selenium_demo.service.ProxyService;
import lombok.extern.slf4j.Slf4j;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.lang3.time.DateFormatUtils;
import org.openqa.selenium.By;
import org.openqa.selenium.PageLoadStrategy;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.redis.core.HashOperations;
import org.springframework.data.redis.core.SetOperations;
import org.springframework.data.redis.core.StringRedisTemplate;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

import java.time.LocalDate;
import java.time.LocalDateTime;
import java.util.*;
import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 *
 * Date: 2022-09-05 13:58
 * Description: 小区
 */
@RestController
@RequestMapping("areaInfo")
@Slf4j
public class LianjiaAreaInfoDemo {
    @Autowired
    private StringRedisTemplate redisTemplate;
    @Autowired
    private BuildAreaUrlLjMapper buildAreaUrlLjMapper;
    @Autowired
    private ProxyService proxyService;

    public static final String SKIP_URLS = "SKIP_URLS_AREAINFO_LIANJIA";
    public static final String URLS = "URLS_AREAINFO_LIANJIA";
    public static final String AREA_INFO_COMMUNITY_CODE_LJ = "AREA_INFO_COMMUNITY_CODE_LJ";

    private static LinkedList<String> pages = new LinkedList<>();
    ThreadPoolExecutor pagepoolExecutor = new ThreadPoolExecutor(2,
            10, 30L,
            TimeUnit.SECONDS, new LinkedBlockingQueue<>());

    @RequestMapping("sync")
    public void sync() throws InterruptedException {
        System.setProperty("webdriver.chrome.driver", "D://chromedriver.exe");
        boolean flag = false;
        while (!flag) {
            try {
                ChromeDriver driver = getChromeDriver();
                SetOperations ops = redisTemplate.opsForSet();
                try {
                    getUrls(driver, ops);

                    parsePagePre(ops);
                 } finally {
                    sleep(1000);
                    driver.quit();
                }

            } catch (Exception e) {
                Thread.sleep(10000);
                continue;
            }
            flag = true;
        }
        System.out.println("完成");
    }

    /**
     * 获取浏览器对象
     * @return
     */
    private ChromeDriver getChromeDriver() {
        String nextProxy = proxyService.getNextProxy();
        System.out.println("当前ip是" + nextProxy);
        String[] arr = {"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15",
                "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 Edg/103.0.1264.37",
                "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",
                "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",
                "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)"};

        ChromeOptions chromeOptions = new ChromeOptions();
        chromeOptions.setPageLoadStrategy(PageLoadStrategy.EAGER);
        chromeOptions.addArguments("--incognito");
        chromeOptions.addArguments("--blink-settings=imagesEnabled=false");
        chromeOptions.addArguments("--headless");
        chromeOptions.addArguments("--no-sandbox");
        chromeOptions.addArguments("--disable-gpu");
        if (StringUtils.isNotBlank(nextProxy) && !nextProxy.equals("local")) {
            chromeOptions.addArguments("--proxy-server=" + nextProxy);
        }
        HashMap<String, Object> map = new HashMap<>();
        map.put("webrtc.ip_handling_policy", "disable_non_proxied_udp");
        map.put("webrtc.multiple_routes_enabled", false);
        map.put("webrtc.nonproxied_udp_enabled", false);
        chromeOptions.setExperimentalOption("prefs", map);
        Random random = new Random();
        chromeOptions.addArguments("User-Agent=" + arr[random.nextInt(7)]);
        ChromeDriver driver = new ChromeDriver(chromeOptions);
        driver.manage().window().maximize();
        return driver;
    }

    private void parsePagePre(SetOperations ops) {
        HashOperations<String, Object, Object> opsForHash = redisTemplate.opsForHash();

        List<BuildAreaUrlLj> buildAreaUrlLjs = buildAreaUrlLjMapper.selectList(null);
        List<BuildAreaUrlLj> buildAreaUrlLjs1 = buildAreaUrlLjs.subList(1,3500);
        for (BuildAreaUrlLj buildAreaUrlLj : buildAreaUrlLjs1) {
            if (ops.isMember(SKIP_URLS, buildAreaUrlLj.getAreaUrl())) {
                System.out.println("跳过当前区域" + buildAreaUrlLj.getCityName() + "-" + buildAreaUrlLj.getCountyName());
                continue;
            }
            pagepoolExecutor.execute(() -> parsePage(ops, opsForHash, buildAreaUrlLj));
        }
    }

    /**
     * 解析列表
     * @param ops
     * @param opsForHash
     * @param buildAreaUrlLj
     */
    private void parsePage(SetOperations ops, HashOperations<String, Object, Object> opsForHash, BuildAreaUrlLj buildAreaUrlLj) {
        ChromeDriver driver = getChromeDriver();
        try {
            driver.get(buildAreaUrlLj.getAreaUrl());
            String windowHandlePage = driver.getWindowHandle();
            WebElement totalNumStr = validElement("//h2[@class='total fl']/span", driver);
            if (null != totalNumStr) {
                Integer total = Integer.valueOf(totalNumStr.getText());
                // 有数据
                if (total > 1) {
                    String pageData = driver.findElement(By.xpath("//div[@class='page-box house-lst-page-box']")).getAttribute("page-data");
                    Integer pageNumStr = Integer.valueOf(JSON.parseObject(pageData).getString("totalPage"));
                    System.out.println("当前区域页数" + pageNumStr + "---" + buildAreaUrlLj.getAreaUrl());
                    for (int x = 1; x <= pageNumStr; x++) {
                        List<WebElement> elements = driver.findElements(By.xpath("//ul[@class='listContent']/li/div[1]/div[1]/a"));
                        for (int i = 0; i < elements.size(); i++) {
                            WebElement item = elements.get(i);
                            String code = "";
                            Pattern compile1 = Pattern.compile("xiaoqu/(\\w+)/");
                            Matcher matcher1 = compile1.matcher(item.getAttribute("href"));
                            while (matcher1.find()) {
                                code = matcher1.group(1);
                            }
                            driver.executeScript("arguments[0].click();", item);
                            sleepAndCutoverNewPage(300, driver);

                            // 如果有 则不解析详情
                            if (!opsForHash.hasKey(AREA_INFO_COMMUNITY_CODE_LJ, code)) {
                                parseDetail(driver, code, buildAreaUrlLj, opsForHash);
                            } else {
                                System.out.println("当前code redis 存在" + code);
                                //更新
                                //                                new  TkBuildingsMonthPriceLj();
                            }


                            driver.close();
                            driver.switchTo().window(windowHandlePage);
                            sleep(200);
                            elements = driver.findElements(By.xpath("//ul[@class='listContent']/li/div[1]/div[1]/a"));
                        }
                        if (x != pageNumStr) {
                            String nextPage = buildAreaUrlLj.getAreaUrl() + "pg" + (x + 1) + "/";
                            driver.get(nextPage);
                            System.out.println("下一页是" + nextPage);
                            sleep(200);
                        }
                    }
                }
            }
            ops.add(SKIP_URLS, buildAreaUrlLj.getAreaUrl());
        } catch (NumberFormatException e) {
            throw new RuntimeException("多线程发生异常"+e.getMessage());
        }finally {
            driver.quit();
        }

    }

    /**
     * 解析详情
     * @param driver
     * @param communityCode
     * @param buildAreaUrlLj
     * @param opsForHash
     */
    private void parseDetail(ChromeDriver driver, String communityCode, BuildAreaUrlLj buildAreaUrlLj, HashOperations<String, Object, Object> opsForHash) {
        LocalDateTime now1 = LocalDateTime.now();
        if (null != validElement("//span[@class='xiaoquUnitPrice']", driver)) {
            TkBuildingsMonthPriceLj lj = new TkBuildingsMonthPriceLj();
            lj.setCommunityCode(communityCode);
            String year = String.valueOf(LocalDate.now().getYear());
            if (driver.findElement(By.className("xiaoquUnitPriceDesc")).getText().equals("挂牌均价")){
                lj.setYearmonth(DateFormatUtils.format(new Date(),"yyyyMM"));
            }else{
                String monthStr = driver.findElement(By.className("xiaoquUnitPriceDesc")).getText().replace("月参考均价", "");
                String month = String.format("%02d", Integer.parseInt(monthStr));
                lj.setYearmonth(year + month);
            }
            lj.setAvgPrice(Integer.valueOf(driver.findElement(By.className("xiaoquUnitPrice")).getText()));
            lj.setGenerateType("0");
            lj.setCreateBy("1");
            lj.setCreateDate(new Date());
            lj.setUpdateBy("1");
            lj.setUpdateDate(new Date());
            lj.setDelFlag("0");
            System.out.println("持久化价格"+lj);
        }
        LocalDateTime now2 = LocalDateTime.now();

        TkBuildingsAreaInfolj infolj = new TkBuildingsAreaInfolj();
        infolj.setDataOrigin("lianjia");
        infolj.setGenerateType("0");
        infolj.setProvince(buildAreaUrlLj.getProvinceId());
        infolj.setCity(buildAreaUrlLj.getCityId());
        infolj.setArea(buildAreaUrlLj.getCountyId());
        infolj.setCommunity(validElement("//h1[@class='detailTitle']", driver) == null ?
                "" : driver.findElement(By.xpath("//h1[@class='detailTitle']")).getText());
        infolj.setCommunityCode(communityCode);
        infolj.setBuildingYear(validElement("//span[text()='建筑年代']", driver) == null ?
                "" : driver.findElement(By.xpath("//span[text()='建筑年代']/parent::div/span[2]")).getText());
        infolj.setBuildingType(validElement("//span[text()='建筑类型']", driver) == null ?
                "" : driver.findElement(By.xpath("//span[text()='建筑类型']/parent::div/span[2]")).getText());
        infolj.setManageCost(validElement("//span[text()='物业费用']", driver) == null ?
                "" : driver.findElement(By.xpath("//span[text()='物业费用']/parent::div/span[2]")).getText());
        infolj.setManageCompany(validElement("//span[text()='物业公司']", driver) == null ?
                "" : driver.findElement(By.xpath("//span[text()='物业公司']/parent::div/span[2]")).getText());
        infolj.setManageDevlop(validElement("//span[text()='开发商']", driver) == null ?
                "" : driver.findElement(By.xpath("//span[text()='开发商']/parent::div/span[2]")).getText());
        infolj.setBuildingCount(validElement("//span[text()='楼栋总数']", driver) == null ?
                "" : driver.findElement(By.xpath("//span[text()='楼栋总数']/parent::div/span[2]")).getText());
        infolj.setRoomCount(validElement("//span[text()='房屋总数']", driver) == null ?
                "" : driver.findElement(By.xpath("//span[text()='房屋总数']/parent::div/span[2]")).getText());
        infolj.setCreateBy("1");
        infolj.setCreateDate(new Date());
        infolj.setUpdateBy("1");
        infolj.setUpdateDate(new Date());
        infolj.setDelFlag("0");
        System.out.println("持久化小区"+infolj);
    }

    /**
     * 爬取链接
     * @param driver
     * @param ops
     */
    private void getUrls(ChromeDriver driver, SetOperations ops) {
        driver.get("https://www.lianjia.com/city/");

        int count = 0;
        List<WebElement> elements = driver.findElements(By.xpath("//ul[@class='city_list_ul']/li/div[2]/div/ul/li/a"));
        for (int i = 0; i < elements.size(); i++) {
            WebElement element = elements.get(i);
            String provinceName = element.findElement(By.xpath("./parent::li/parent::ul/parent::div/div")).getText();
            String areaName = element.getText();
            Boolean memberFlag = ops.isMember(URLS, areaName);
            if (memberFlag) {
                System.out.println("已跑过当前区域  跳过" + areaName);
                continue;
            }

            driver.executeScript("arguments[0].click();", element);
            String frontPage = driver.getWindowHandle();
            WebElement ershoufang = null;
            try {
                ershoufang = driver.findElement(By.linkText("小区"));
            } catch (Exception e) {
                ops.add(URLS, areaName);

                sleep(200);
                System.out.println(areaName + "  没有小区====");
                driver.get("https://www.lianjia.com/city/");
                elements = driver.findElements(By.xpath("//ul[@class='city_list_ul']/li/div[2]/div/ul/li/a"));
                continue;
            }
            driver.executeScript("arguments[0].click();", ershoufang);
            sleepAndCutoverNewPage(500, driver);
            List<WebElement> citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a"));
            citys.forEach(e -> System.out.println("市级============" + e.getText() + "==" + e.getAttribute("href")));

            for (int j = 0; j < citys.size(); j++) {
                String countyName = citys.get(j).getText();
                driver.executeScript("arguments[0].click();", citys.get(j));
                sleep(200);
                if (validElement("//h2[@class='total fl']/span", driver) != null) {
                    String text = driver.findElement(By.xpath("//h2[@class='total fl']/span")).getText();
                    count += Integer.parseInt(text);
                    System.out.println(countyName + text + "个");
                    System.out.println("当前总数是" + count);
                }

                List<WebElement> areas = null;
                try {
                    areas = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[2]/a"));
                } catch (Exception e) {
                    citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a"));
                    saveDataCity(countyName, areaName, provinceName, citys);
                    break;
                }
                if (areas.size() == 0) {
                    citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a"));
                    saveDataCity(countyName, areaName, provinceName, citys);
                    break;
                }
                saveDataCounty(countyName, areaName, provinceName, areas);

                sleep(100);
                citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a"));
            }

            ops.add(URLS, areaName);
            driver.close();
            driver.switchTo().window(frontPage);
            driver.get("https://www.lianjia.com/city/");
            sleep(200);
            elements = driver.findElements(By.xpath("//ul[@class='city_list_ul']/li/div[2]/div/ul/li/a"));
        }
        System.out.println("总数是" + count);
    }

    private void saveDataCounty(String countyName, String areaName, String provinceName, List<WebElement> list) {
        for (WebElement element : list) {
            String url = element.getAttribute("href");
            BuildAreaUrlLj buildAreaUrlLj = new BuildAreaUrlLj();
            IdAndNamePO provincepo = queryProvinceCityArea(1, provinceName, null);
            buildAreaUrlLj.setProvinceName(provincepo.getBusinessName());
            buildAreaUrlLj.setProvinceId(provincepo.getBusinessId());
            IdAndNamePO areapo = queryProvinceCityArea(2, areaName, provincepo.getBusinessId());
            buildAreaUrlLj.setCityName(areapo.getBusinessName());
            buildAreaUrlLj.setCityId(areapo.getBusinessId());
            IdAndNamePO countypo = queryProvinceCityArea(3, countyName, areapo.getBusinessId());
            buildAreaUrlLj.setCountyName(countypo.getBusinessName());
            buildAreaUrlLj.setCountyId(countypo.getBusinessId());
            buildAreaUrlLj.setAreaUrl(url);
            buildAreaUrlLj.setCreateTime(new Date());
            buildAreaUrlLj.setUpdateTime(new Date());
            System.out.println("持久化链接"+buildAreaUrlLj);
        }
    }

    private void saveDataCity(String countyName, String areaName, String provinceName, List<WebElement> list) {
        for (WebElement element : list) {
            String url = element.getAttribute("href");
            BuildAreaUrlLj buildAreaUrlLj = new BuildAreaUrlLj();
            IdAndNamePO provincepo = queryProvinceCityArea(1, provinceName, null);
            buildAreaUrlLj.setProvinceName(provinceName);
            buildAreaUrlLj.setProvinceId(provincepo.getBusinessId());
            buildAreaUrlLj.setCityName(areaName);
            IdAndNamePO areapo = queryProvinceCityArea(2, areaName, provincepo.getBusinessId());
            buildAreaUrlLj.setCityId(areapo.getBusinessId());
            IdAndNamePO countypo = queryProvinceCityArea(3, countyName, areapo.getBusinessId());
            buildAreaUrlLj.setCountyName(countypo.getBusinessName());
            buildAreaUrlLj.setCountyId(countypo.getBusinessId());

            buildAreaUrlLj.setAreaUrl(url);
            buildAreaUrlLj.setCreateTime(new Date());
            buildAreaUrlLj.setUpdateTime(new Date());
            System.out.println("持久化链接"+buildAreaUrlLj);
        }
    }

    /**
     * 根据名称查询省市县信息
     * @param type 1/省  2/市 3/区
     * @param businessName 名称
     * @param parentId 父id
     * @return
     */
    private IdAndNamePO queryProvinceCityArea(Integer type, String businessName, String parentId) {
        if (StringUtils.isNotBlank(parentId)) {
            ArrayList<String> citys = new ArrayList<>(8);
            citys.add("50");
            citys.add("11");
            citys.add("31");
            citys.add("12");
            if (citys.contains(parentId)) {
                businessName = "市辖区";
            }

        }
        IdAndNamePO po = null;
        try {
            if (type == 1) {
//                po = buildingsAvgMapper.queryProvinceIdByName(businessName);
            } else if (type == 2) {
//                po = buildingsAvgMapper.queryCityIdByName(businessName, parentId);
            } else if (type == 3) {
//                po = buildingsAvgMapper.querycountyIdByName(businessName, parentId);
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        if (null == po) {
            po = new IdAndNamePO();
            po.setBusinessId("-1");
            po.setBusinessName(businessName);
        }
        return po;
    }


    private static String sleepAndCutoverNewPage(int millis, WebDriver driver) {
        try {
            Thread.sleep(millis);
            for (String handle : driver.getWindowHandles()) {
                if (!pages.contains(handle)) {
                    driver.switchTo().window(handle);
                }
            }
        } catch (InterruptedException e) {
        }
        return null;
    }

    private static void sleep(int millis) {
        try {
            Thread.sleep(millis);
        } catch (InterruptedException e) {
        }
    }

    public static WebElement validElement(String str, WebDriver driver) {
        try {
            WebElement element = driver.findElement(By.xpath(str));
            return element;
        } catch (Exception e) {
            System.out.println("这个元素不存在" + str);
        }
        return null;
    }
}

注意事项

1. driver.close 是关闭当前页  driver.quit是退出进程   循环跑列表的不退出进程的话浏览器会把内存吃满 
2. 跳转页面尽量显示等待一下 以防元素未加载导致查找错误
3. 请求不可太频繁  特殊需求请加代理 

后语

上述案例源码

https://download.csdn.net/download/DoAsOnePleases/86772623

你可能感兴趣的:(JAVA爬虫,java,selenium,chrome)