文章仅供学习使用!!
严禁做违法违纪的事情,责任自负
Selenium 是最广泛使用的开源 Web UI(用户界面)自动化测试套件之一。
与java集成,本质上是通过Java代码调用浏览器驱动 进行模拟人工的操作.
selenium支持不同的浏览器,本文以谷歌为例 !
selenium驱动有两种下载方式.任选其一即可
①首先需要确认浏览器版本: 在浏览器界面输入chrome://settings/
② 下面网址任选其一,选择对应的版本下载 ( 此处如未有完全一致版本,则选择最大版本 例如本文中是104.0.5112.102 可选的版本是104开头 最优选为104版本中最大版号)
http://chromedriver.storage.googleapis.com/index.html
http://npm.taobao.org/mirrors/chromedriver/
package com.mengkeng.selenium_demo.test;
import org.openqa.selenium.By;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import java.util.concurrent.TimeUnit;
public class BaiduDemo {
public static void main(String[] args) throws Exception {
//D://chromedriver.exe 以实际存储路径为准
System.setProperty("webdriver.chrome.driver", "D://chromedriver.exe");
ChromeOptions chromeOptions = new ChromeOptions();
ChromeDriver driver = new ChromeDriver(chromeOptions);
try {
// 窗口最大化
driver.manage().window().maximize();
driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
Thread.sleep(1000);
//进入百度首页
driver.get("https://www.baidu.com/");
//找到输入框
WebElement text = driver.findElement(By.id("kw"));
//找到百度一下按钮
WebElement button = driver.findElement(By.id("su"));
text.sendKeys("123");
button.click();
} finally {
sleep(10000);
driver.quit();
}
}
public static void sleep(int time) {
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
通过几行代码实现了打开网页搜索 ‘123’ , 接下来看看常用的api , 理解即可 随用随查
// 注意修改实际驱动存储位置
System.setProperty("webdriver.chrome.driver", "D://chromedriver.exe");
WebDriver driver = new ChromeDriver();
driver.get("https://www.baidu.com/");
注意: 页面出现相同属性的元素, 则需要使用xpath定位方式进行指定获取
driver.findElement(By.id("pnum"));
driver.findElement(By.name("name"));
driver.findElement(By.className("pgo"));
driver.findElement(By.linkText("link"));
driver.findElement(By.xpath("//div[@id='1']/div/div/h3/a[1]"))
方法 | 描述 |
---|---|
sendKey() | 模拟输入指定内容 |
clear() | 清楚输入内容 |
text() | 获取文本信息 |
getAttribute() | 获取指定属性 |
ok掌握这一部分就可以书写简单爬虫了 , 有兴趣的童鞋试着做一下如下案例:
需求:
登录qq邮箱,并打开收件箱页面
以下是实现代码
package com.mengkeng.selenium_demo.test;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import java.util.Objects;
public class QQEmaIlLoginDemo {
public static void main(String[] args) throws InterruptedException {
//定义使用什么版本的驱动,注意替换你的路径
System.setProperty("webdriver.chrome.driver", "D://chromedriver.exe");
ChromeDriver driver = new ChromeDriver();
driver.manage().window().maximize();
try {
Thread.sleep(1000);
driver.get("https://mail.qq.com/");
driver.switchTo().frame("login_frame");
WebElement username = driver.findElement(By.id("u"));
WebElement password = driver.findElement(By.id("p"));
username.sendKeys("[email protected]");
password.sendKeys("xxxxxx");
WebElement submit = driver.findElement(By.id("login_button"));
submit.click();
Thread.sleep(1000);
driver.switchTo().defaultContent();
WebElement element = validElement("//a[@id='folder_1']", driver);
if (Objects.nonNull(element)){
WebElement folder_1 = driver.findElement(By.xpath("//a[@id='folder_1']"));
folder_1.click();
}else{
System.out.println("打开收件箱失败");
}
} finally {
Thread.sleep(10000);
driver.close();
driver.quit();
}
}
public static WebElement validElement(String str, WebDriver driver) {
try {
WebElement element = driver.findElement(By.xpath(str));
return element;
} catch (Exception e) {
System.out.println("这个元素不存在" + str);
}
return null;
}
}
上述只是简单案例 有鼠标,多页面跳转的怎么办呢 . 别急 这就来
注意 鼠标操作方法需要以perform()方法结尾 如未使用该方法结尾则操作不生效
方法 | 描述 |
---|---|
click() | 单击左键 |
context_click() | 单击右键 |
double_click() | 双击 |
drag_and_drop() | 拖动 |
move_to_element() | 鼠标悬停 |
perform() | 执行所有ActionChains中存储的动作 |
当点击页面元素 浏览器创建新窗口后需要切换到最新页面.
driver.switchTo().window(frontHandle) // 此处的frontHandle是页面对象 可以使用driver.getWindowHandle(); 获取后暂存
模拟滑动页面
driver.executeScript(“window.scrollTo(0,300)”);当页面元素无法点击的时候(反爬虫拦截)
driver.executeScript(“arguments[0].click();”, element);// 其中element为按钮或元素
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.setPageLoadStrategy(PageLoadStrategy.EAGER); // 急速加载模式
chromeOptions.addArguments("--incognito"); // 隐私窗口模式
chromeOptions.addArguments("--blink-settings=imagesEnabled=false"); // 不加载图片
chromeOptions.addArguments("--headless"); // 无头模式
chromeOptions.addArguments("--no-sandbox"); // 禁用沙箱模式
chromeOptions.addArguments("--disable-gpu");// 禁用gpu加速
chromeOptions.addArguments("--proxy-server=" + proxy); // 添加代理
ChromeDriver driver = new ChromeDriver(chromeOptions);
// 设置全局等待时间
driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
// 最大化页面
driver.manage().window().maximize();
// 去除sesenium标志
String js1="Object.defineProperties(navigator, {webdriver:{get:()=>undefined}});";
((ChromeDriver) driver).executeScript(js1);
// 添加UA请求头
String[] arr = {"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 Edg/103.0.1264.37",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",
"Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)"};
chromeOptions.addArguments("User-Agent=" + arr[random.nextInt(7)]);
在解析列表页 创建浏览器对象执行解析
private void parsePagePre(SetOperations ops) {
ThreadPoolExecutor threadPoolExecutor = new ThreadPoolExecutor(2,
8, 30L,
TimeUnit.SECONDS, new LinkedBlockingQueue<>());
List<BuildAreaUrlLj> buildAreaUrlLjs = buildAreaUrlLjMapper.selectList(null);
for (BuildAreaUrlLj buildAreaUrlLj : buildAreaUrlLjs1) {
pagepoolExecutor.execute(() -> parsePage(ops, opsForHash, buildAreaUrlLj));
}
}
private void parsePage(SetOperations ops, HashOperations<String, Object, Object> opsForHash, BuildAreaUrlLj buildAreaUrlLj) {
ChromeDriver driver = getChromeDriver();
driver.get(buildAreaUrlLj.getAreaUrl());
// 业务代码
}
private ChromeDriver getChromeDriver() {
String[] arr = {"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 Edg/103.0.1264.37",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",
"Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)"};
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.setPageLoadStrategy(PageLoadStrategy.EAGER);
chromeOptions.addArguments("--incognito");
chromeOptions.addArguments("--blink-settings=imagesEnabled=false");
chromeOptions.addArguments("--headless");
chromeOptions.addArguments("--no-sandbox");
chromeOptions.addArguments("--disable-gpu");
if ("用代理") {
chromeOptions.addArguments("--proxy-server=" + nextProxy);
}
HashMap<String, Object> map = new HashMap<>();
map.put("webrtc.ip_handling_policy", "disable_non_proxied_udp");
map.put("webrtc.multiple_routes_enabled", false);
map.put("webrtc.nonproxied_udp_enabled", false);
chromeOptions.setExperimentalOption("prefs", map);
Random random = new Random();
chromeOptions.addArguments("User-Agent=" + arr[random.nextInt(7)]);
ChromeDriver driver = new ChromeDriver(chromeOptions);
driver.manage().window().maximize();
return driver;
}
package com.mengkeng.selenium_demo.test;
import com.alibaba.fastjson.JSON;
import com.mengkeng.selenium_demo.config.RestTemplateConfig;
import com.mengkeng.selenium_demo.entity.TkBuildingsPriceAjk;
import lombok.extern.slf4j.Slf4j;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.redis.core.RedisTemplate;
import org.springframework.data.redis.core.SetOperations;
import org.springframework.http.*;
import org.springframework.util.CollectionUtils;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.client.RestTemplate;
import java.math.BigDecimal;
import java.util.*;
import java.util.concurrent.TimeUnit;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
*
* Date: 2022-07-10 13:50
* Description:
*/
@RestController
@RequestMapping("fang")
@Slf4j
public class FangtianxiaDemo {
@Autowired
private RedisTemplate redisTemplate;
private static LinkedList<String> pages = new LinkedList<>();
/**
* 基础页面
*/
public static final String PRICE_URL = "https://pinggun.fang.com/RunChartNew/MakeChartData/";
/**
* redis 记录页面
*/
public static final String SKIP_URLS = "SKIP_URLS";
/**
* 成功标识
*/
public static String TEMP_FLAG = "fail";
@RequestMapping("sync")
public String sync() {
while (!TEMP_FLAG.equals("success")) {
System.setProperty("webdriver.chrome.driver", "D://chromedriver.exe");
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.addArguments("--headless");
chromeOptions.addArguments("--no-sandbox");
chromeOptions.addArguments("--disable-gpu");
chromeOptions.addArguments("--disable-dev-shm-usage");
WebDriver driver = new ChromeDriver(chromeOptions);
driver.manage().window().maximize();
driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
driver.get("https://esf.fang.com/housing/");
sleep(2000);
try {
parseFTX(driver);
} catch (Exception e) {
try {
Thread.sleep(10000);
} catch (InterruptedException interruptedException) {
interruptedException.printStackTrace();
}
} finally {
sleep(10000);
driver.quit();
}
}
return "ok";
}
/**
* 解析fangtianxia
*/
private void parseFTX(WebDriver driver) {
SetOperations ops = redisTemplate.opsForSet();
List<WebElement> elements = driver.findElements(By.xpath("//div[@class='qxName']/a"));
// 区域
for (int i = 2; i <= elements.size() - 3; i++) {
WebElement element = driver.findElement(By.xpath("//div[@class='qxName']/a[" + i + "]"));
element.click();
sleep(800);
//商圈
List<WebElement> elementsShangquan = driver.findElements(By.xpath("//p[@id='shangQuancontain']/a"));
for (int sq = 2; sq <= elementsShangquan.size(); sq++) {
WebElement elementsq = driver.findElement(By.xpath("//p[@id='shangQuancontain']/a[" + sq + "]"));
String tempHref = elementsq.getAttribute("href");
// if (ops.isMember(SKIP_URLS, tempHref)) {
// System.out.println("跳过了当前链接" + tempHref);
// continue;
// }
elementsq.click();
parsePage(driver);
ops.add(SKIP_URLS, tempHref);
sleep(800);
}
}
TEMP_FLAG = "success"; //正常跑一圈 结束
}
/**
* 解析分页
*
* @param driver
*/
private void parsePage(WebDriver driver) {
// 分页
try {
driver.findElement(By.className("txt")).getText();
} catch (Exception e) {
log.info("该分类下无数据 url是" + driver.getCurrentUrl());
return;
}
String pageTotal = driver.findElement(By.className("txt")).getText().replaceAll("共", "").replaceAll("页", "");
for (int page = 0; page < Integer.parseInt(pageTotal); page++) {
List<WebElement> houseList = driver.findElements(By.xpath("//div[@class='houseList']/div"));
for (int i = 1; i < houseList.size(); i++) {
String communityName = driver.findElement(By.xpath("//div[@class='houseList']/div[" + i + "]/dl/dd/p[1]/a[1]")).getText();
String communityCode = driver.findElement(By.xpath("//div[@class='houseList']/div[" + i + "]/dl/dd/p[1]/a[2]")).getAttribute("projcode");
String areaName = driver.findElement(By.xpath("//div[@class='houseList']/div[" + i + "]/dl/dd/p[2]/a[1]")).getText();
// 跳转到详情页
pages.addAll(driver.getWindowHandles());
driver.findElement(By.xpath("//div[@class='houseList']/div[" + i + "]/dl/dd/p[1]/a[1]")).click();
sleepAndCutoverNewPage(800, driver);
parseDetail(communityCode, communityName, areaName);
driver.close();
driver.switchTo().window(pages.getLast());
sleep(1000);
}
if (page + 1 == Integer.parseInt(pageTotal)) {
break;
}
String pageNow = driver.findElement(By.xpath("//div[@id='houselist_B14_01']/a[last()-1]")).getAttribute("href");
System.out.println("下一页是------------" + pageNow + "----" + pageTotal);
driver.findElement(By.xpath("//div[@id='houselist_B14_01']/a[last()-1]")).click();
sleep(600);
}
}
/**
* 解析详情
*
* @param communityCode
* @param communityName
* @param areaName
*/
public void parseDetail(String communityCode, String communityName, String areaName) {
HashMap<String, Object> map = new HashMap<>();
map.put("newcode", communityCode);
map.put("city", cnToUnicode("北京"));
map.put("district", cnToUnicode(areaName));
HttpHeaders headers = new HttpHeaders();
headers.setContentType(MediaType.APPLICATION_JSON_UTF8);
HttpEntity<String> entity = new HttpEntity<>(JSON.toJSONString(map), headers);
RestTemplate restTemplate = null;
try {
restTemplate = new RestTemplate(RestTemplateConfig.generateHttpRequestFactory());
} catch (Exception e) {
e.printStackTrace();
}
ResponseEntity<String> stringResponseEntity = restTemplate.exchange(PRICE_URL, HttpMethod.POST, entity, String.class);
Pattern compile = Pattern.compile(",(\\w+)]");
Matcher matcher = compile.matcher(stringResponseEntity.getBody());
Pattern compileMonth = Pattern.compile("年(\\w+)月");
Matcher matcherMonth = compileMonth.matcher(stringResponseEntity.getBody());
ArrayList<String> list = new ArrayList<>();
while (matcherMonth.find()) {
list.add(matcherMonth.group(1));
}
Pattern compileYear = Pattern.compile("&(\\w+)年");
Matcher matcherYear = compileYear.matcher(stringResponseEntity.getBody());
int year = 2020;
while (matcherYear.find()) {
year = Integer.parseInt(matcherYear.group(1));
}
ArrayList months = null;
if (!CollectionUtils.isEmpty(list)) {
months = getMonths(year, Integer.parseInt(list.get(0)), Integer.parseInt(list.get(1)));
}
while (matcher.find()) {
TkBuildingsPriceAjk ajk = new TkBuildingsPriceAjk();
ajk.setDataOrigin("fangtianxia");
ajk.setCommunityCode(communityCode);
ajk.setCommunity(communityName);
ajk.setAvgPrice(new BigDecimal(matcher.group(1)));
System.out.println("持久化=======================================" + ajk);
}
}
private static void sleep(int millis) {
try {
Thread.sleep(millis);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
/**
* 切换页面
*
* @param millis
* @param driver
* @return
*/
private static String sleepAndCutoverNewPage(int millis, WebDriver driver) {
try {
Thread.sleep(millis);
for (String handle : driver.getWindowHandles()) {
if (!pages.contains(handle)) {
driver.switchTo().window(handle);
}
}
} catch (InterruptedException e) {
e.printStackTrace();
}
return null;
}
/**
* 获取对象unionCode值
*
* @param cn
* @return
*/
private static String cnToUnicode(String cn) {
char[] chars = cn.toCharArray();
StringBuilder returnStr = new StringBuilder();
for (int i = 0; i < chars.length; i++) {
returnStr.append("\\u").append(Integer.toString(chars[i], 16));
}
return returnStr.toString();
}
/**
* 获取年份列表-只支持今年至下一年
*
* @param year 开始年份
* @param start 开始月份
* @param end 结束月份
* @return
*/
private static ArrayList getMonths(int year, int start, int end) {
ArrayList res = new ArrayList();
for (int i = start; i <= (end == 12 ? 12 : end + 12); i++) {
if (i > 12) {
res.add((year + 1) + String.format("%02d", i - 12));
} else {
res.add(year + String.format("%02d", i));
}
}
return res;
}
}
package com.mengkeng.selenium_demo.test;
import com.alibaba.fastjson.JSON;
import com.mengkeng.selenium_demo.entity.BuildAreaUrlLj;
import com.mengkeng.selenium_demo.entity.IdAndNamePO;
import com.mengkeng.selenium_demo.entity.TkBuildingsAreaInfolj;
import com.mengkeng.selenium_demo.entity.TkBuildingsMonthPriceLj;
import com.mengkeng.selenium_demo.mapper.BuildAreaUrlLjMapper;
import com.mengkeng.selenium_demo.service.ProxyService;
import lombok.extern.slf4j.Slf4j;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.lang3.time.DateFormatUtils;
import org.openqa.selenium.By;
import org.openqa.selenium.PageLoadStrategy;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.redis.core.HashOperations;
import org.springframework.data.redis.core.SetOperations;
import org.springframework.data.redis.core.StringRedisTemplate;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import java.time.LocalDate;
import java.time.LocalDateTime;
import java.util.*;
import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
*
* Date: 2022-09-05 13:58
* Description: 小区
*/
@RestController
@RequestMapping("areaInfo")
@Slf4j
public class LianjiaAreaInfoDemo {
@Autowired
private StringRedisTemplate redisTemplate;
@Autowired
private BuildAreaUrlLjMapper buildAreaUrlLjMapper;
@Autowired
private ProxyService proxyService;
public static final String SKIP_URLS = "SKIP_URLS_AREAINFO_LIANJIA";
public static final String URLS = "URLS_AREAINFO_LIANJIA";
public static final String AREA_INFO_COMMUNITY_CODE_LJ = "AREA_INFO_COMMUNITY_CODE_LJ";
private static LinkedList<String> pages = new LinkedList<>();
ThreadPoolExecutor pagepoolExecutor = new ThreadPoolExecutor(2,
10, 30L,
TimeUnit.SECONDS, new LinkedBlockingQueue<>());
@RequestMapping("sync")
public void sync() throws InterruptedException {
System.setProperty("webdriver.chrome.driver", "D://chromedriver.exe");
boolean flag = false;
while (!flag) {
try {
ChromeDriver driver = getChromeDriver();
SetOperations ops = redisTemplate.opsForSet();
try {
getUrls(driver, ops);
parsePagePre(ops);
} finally {
sleep(1000);
driver.quit();
}
} catch (Exception e) {
Thread.sleep(10000);
continue;
}
flag = true;
}
System.out.println("完成");
}
/**
* 获取浏览器对象
* @return
*/
private ChromeDriver getChromeDriver() {
String nextProxy = proxyService.getNextProxy();
System.out.println("当前ip是" + nextProxy);
String[] arr = {"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 Edg/103.0.1264.37",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",
"Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)"};
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.setPageLoadStrategy(PageLoadStrategy.EAGER);
chromeOptions.addArguments("--incognito");
chromeOptions.addArguments("--blink-settings=imagesEnabled=false");
chromeOptions.addArguments("--headless");
chromeOptions.addArguments("--no-sandbox");
chromeOptions.addArguments("--disable-gpu");
if (StringUtils.isNotBlank(nextProxy) && !nextProxy.equals("local")) {
chromeOptions.addArguments("--proxy-server=" + nextProxy);
}
HashMap<String, Object> map = new HashMap<>();
map.put("webrtc.ip_handling_policy", "disable_non_proxied_udp");
map.put("webrtc.multiple_routes_enabled", false);
map.put("webrtc.nonproxied_udp_enabled", false);
chromeOptions.setExperimentalOption("prefs", map);
Random random = new Random();
chromeOptions.addArguments("User-Agent=" + arr[random.nextInt(7)]);
ChromeDriver driver = new ChromeDriver(chromeOptions);
driver.manage().window().maximize();
return driver;
}
private void parsePagePre(SetOperations ops) {
HashOperations<String, Object, Object> opsForHash = redisTemplate.opsForHash();
List<BuildAreaUrlLj> buildAreaUrlLjs = buildAreaUrlLjMapper.selectList(null);
List<BuildAreaUrlLj> buildAreaUrlLjs1 = buildAreaUrlLjs.subList(1,3500);
for (BuildAreaUrlLj buildAreaUrlLj : buildAreaUrlLjs1) {
if (ops.isMember(SKIP_URLS, buildAreaUrlLj.getAreaUrl())) {
System.out.println("跳过当前区域" + buildAreaUrlLj.getCityName() + "-" + buildAreaUrlLj.getCountyName());
continue;
}
pagepoolExecutor.execute(() -> parsePage(ops, opsForHash, buildAreaUrlLj));
}
}
/**
* 解析列表
* @param ops
* @param opsForHash
* @param buildAreaUrlLj
*/
private void parsePage(SetOperations ops, HashOperations<String, Object, Object> opsForHash, BuildAreaUrlLj buildAreaUrlLj) {
ChromeDriver driver = getChromeDriver();
try {
driver.get(buildAreaUrlLj.getAreaUrl());
String windowHandlePage = driver.getWindowHandle();
WebElement totalNumStr = validElement("//h2[@class='total fl']/span", driver);
if (null != totalNumStr) {
Integer total = Integer.valueOf(totalNumStr.getText());
// 有数据
if (total > 1) {
String pageData = driver.findElement(By.xpath("//div[@class='page-box house-lst-page-box']")).getAttribute("page-data");
Integer pageNumStr = Integer.valueOf(JSON.parseObject(pageData).getString("totalPage"));
System.out.println("当前区域页数" + pageNumStr + "---" + buildAreaUrlLj.getAreaUrl());
for (int x = 1; x <= pageNumStr; x++) {
List<WebElement> elements = driver.findElements(By.xpath("//ul[@class='listContent']/li/div[1]/div[1]/a"));
for (int i = 0; i < elements.size(); i++) {
WebElement item = elements.get(i);
String code = "";
Pattern compile1 = Pattern.compile("xiaoqu/(\\w+)/");
Matcher matcher1 = compile1.matcher(item.getAttribute("href"));
while (matcher1.find()) {
code = matcher1.group(1);
}
driver.executeScript("arguments[0].click();", item);
sleepAndCutoverNewPage(300, driver);
// 如果有 则不解析详情
if (!opsForHash.hasKey(AREA_INFO_COMMUNITY_CODE_LJ, code)) {
parseDetail(driver, code, buildAreaUrlLj, opsForHash);
} else {
System.out.println("当前code redis 存在" + code);
//更新
// new TkBuildingsMonthPriceLj();
}
driver.close();
driver.switchTo().window(windowHandlePage);
sleep(200);
elements = driver.findElements(By.xpath("//ul[@class='listContent']/li/div[1]/div[1]/a"));
}
if (x != pageNumStr) {
String nextPage = buildAreaUrlLj.getAreaUrl() + "pg" + (x + 1) + "/";
driver.get(nextPage);
System.out.println("下一页是" + nextPage);
sleep(200);
}
}
}
}
ops.add(SKIP_URLS, buildAreaUrlLj.getAreaUrl());
} catch (NumberFormatException e) {
throw new RuntimeException("多线程发生异常"+e.getMessage());
}finally {
driver.quit();
}
}
/**
* 解析详情
* @param driver
* @param communityCode
* @param buildAreaUrlLj
* @param opsForHash
*/
private void parseDetail(ChromeDriver driver, String communityCode, BuildAreaUrlLj buildAreaUrlLj, HashOperations<String, Object, Object> opsForHash) {
LocalDateTime now1 = LocalDateTime.now();
if (null != validElement("//span[@class='xiaoquUnitPrice']", driver)) {
TkBuildingsMonthPriceLj lj = new TkBuildingsMonthPriceLj();
lj.setCommunityCode(communityCode);
String year = String.valueOf(LocalDate.now().getYear());
if (driver.findElement(By.className("xiaoquUnitPriceDesc")).getText().equals("挂牌均价")){
lj.setYearmonth(DateFormatUtils.format(new Date(),"yyyyMM"));
}else{
String monthStr = driver.findElement(By.className("xiaoquUnitPriceDesc")).getText().replace("月参考均价", "");
String month = String.format("%02d", Integer.parseInt(monthStr));
lj.setYearmonth(year + month);
}
lj.setAvgPrice(Integer.valueOf(driver.findElement(By.className("xiaoquUnitPrice")).getText()));
lj.setGenerateType("0");
lj.setCreateBy("1");
lj.setCreateDate(new Date());
lj.setUpdateBy("1");
lj.setUpdateDate(new Date());
lj.setDelFlag("0");
System.out.println("持久化价格"+lj);
}
LocalDateTime now2 = LocalDateTime.now();
TkBuildingsAreaInfolj infolj = new TkBuildingsAreaInfolj();
infolj.setDataOrigin("lianjia");
infolj.setGenerateType("0");
infolj.setProvince(buildAreaUrlLj.getProvinceId());
infolj.setCity(buildAreaUrlLj.getCityId());
infolj.setArea(buildAreaUrlLj.getCountyId());
infolj.setCommunity(validElement("//h1[@class='detailTitle']", driver) == null ?
"" : driver.findElement(By.xpath("//h1[@class='detailTitle']")).getText());
infolj.setCommunityCode(communityCode);
infolj.setBuildingYear(validElement("//span[text()='建筑年代']", driver) == null ?
"" : driver.findElement(By.xpath("//span[text()='建筑年代']/parent::div/span[2]")).getText());
infolj.setBuildingType(validElement("//span[text()='建筑类型']", driver) == null ?
"" : driver.findElement(By.xpath("//span[text()='建筑类型']/parent::div/span[2]")).getText());
infolj.setManageCost(validElement("//span[text()='物业费用']", driver) == null ?
"" : driver.findElement(By.xpath("//span[text()='物业费用']/parent::div/span[2]")).getText());
infolj.setManageCompany(validElement("//span[text()='物业公司']", driver) == null ?
"" : driver.findElement(By.xpath("//span[text()='物业公司']/parent::div/span[2]")).getText());
infolj.setManageDevlop(validElement("//span[text()='开发商']", driver) == null ?
"" : driver.findElement(By.xpath("//span[text()='开发商']/parent::div/span[2]")).getText());
infolj.setBuildingCount(validElement("//span[text()='楼栋总数']", driver) == null ?
"" : driver.findElement(By.xpath("//span[text()='楼栋总数']/parent::div/span[2]")).getText());
infolj.setRoomCount(validElement("//span[text()='房屋总数']", driver) == null ?
"" : driver.findElement(By.xpath("//span[text()='房屋总数']/parent::div/span[2]")).getText());
infolj.setCreateBy("1");
infolj.setCreateDate(new Date());
infolj.setUpdateBy("1");
infolj.setUpdateDate(new Date());
infolj.setDelFlag("0");
System.out.println("持久化小区"+infolj);
}
/**
* 爬取链接
* @param driver
* @param ops
*/
private void getUrls(ChromeDriver driver, SetOperations ops) {
driver.get("https://www.lianjia.com/city/");
int count = 0;
List<WebElement> elements = driver.findElements(By.xpath("//ul[@class='city_list_ul']/li/div[2]/div/ul/li/a"));
for (int i = 0; i < elements.size(); i++) {
WebElement element = elements.get(i);
String provinceName = element.findElement(By.xpath("./parent::li/parent::ul/parent::div/div")).getText();
String areaName = element.getText();
Boolean memberFlag = ops.isMember(URLS, areaName);
if (memberFlag) {
System.out.println("已跑过当前区域 跳过" + areaName);
continue;
}
driver.executeScript("arguments[0].click();", element);
String frontPage = driver.getWindowHandle();
WebElement ershoufang = null;
try {
ershoufang = driver.findElement(By.linkText("小区"));
} catch (Exception e) {
ops.add(URLS, areaName);
sleep(200);
System.out.println(areaName + " 没有小区====");
driver.get("https://www.lianjia.com/city/");
elements = driver.findElements(By.xpath("//ul[@class='city_list_ul']/li/div[2]/div/ul/li/a"));
continue;
}
driver.executeScript("arguments[0].click();", ershoufang);
sleepAndCutoverNewPage(500, driver);
List<WebElement> citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a"));
citys.forEach(e -> System.out.println("市级============" + e.getText() + "==" + e.getAttribute("href")));
for (int j = 0; j < citys.size(); j++) {
String countyName = citys.get(j).getText();
driver.executeScript("arguments[0].click();", citys.get(j));
sleep(200);
if (validElement("//h2[@class='total fl']/span", driver) != null) {
String text = driver.findElement(By.xpath("//h2[@class='total fl']/span")).getText();
count += Integer.parseInt(text);
System.out.println(countyName + text + "个");
System.out.println("当前总数是" + count);
}
List<WebElement> areas = null;
try {
areas = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[2]/a"));
} catch (Exception e) {
citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a"));
saveDataCity(countyName, areaName, provinceName, citys);
break;
}
if (areas.size() == 0) {
citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a"));
saveDataCity(countyName, areaName, provinceName, citys);
break;
}
saveDataCounty(countyName, areaName, provinceName, areas);
sleep(100);
citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a"));
}
ops.add(URLS, areaName);
driver.close();
driver.switchTo().window(frontPage);
driver.get("https://www.lianjia.com/city/");
sleep(200);
elements = driver.findElements(By.xpath("//ul[@class='city_list_ul']/li/div[2]/div/ul/li/a"));
}
System.out.println("总数是" + count);
}
private void saveDataCounty(String countyName, String areaName, String provinceName, List<WebElement> list) {
for (WebElement element : list) {
String url = element.getAttribute("href");
BuildAreaUrlLj buildAreaUrlLj = new BuildAreaUrlLj();
IdAndNamePO provincepo = queryProvinceCityArea(1, provinceName, null);
buildAreaUrlLj.setProvinceName(provincepo.getBusinessName());
buildAreaUrlLj.setProvinceId(provincepo.getBusinessId());
IdAndNamePO areapo = queryProvinceCityArea(2, areaName, provincepo.getBusinessId());
buildAreaUrlLj.setCityName(areapo.getBusinessName());
buildAreaUrlLj.setCityId(areapo.getBusinessId());
IdAndNamePO countypo = queryProvinceCityArea(3, countyName, areapo.getBusinessId());
buildAreaUrlLj.setCountyName(countypo.getBusinessName());
buildAreaUrlLj.setCountyId(countypo.getBusinessId());
buildAreaUrlLj.setAreaUrl(url);
buildAreaUrlLj.setCreateTime(new Date());
buildAreaUrlLj.setUpdateTime(new Date());
System.out.println("持久化链接"+buildAreaUrlLj);
}
}
private void saveDataCity(String countyName, String areaName, String provinceName, List<WebElement> list) {
for (WebElement element : list) {
String url = element.getAttribute("href");
BuildAreaUrlLj buildAreaUrlLj = new BuildAreaUrlLj();
IdAndNamePO provincepo = queryProvinceCityArea(1, provinceName, null);
buildAreaUrlLj.setProvinceName(provinceName);
buildAreaUrlLj.setProvinceId(provincepo.getBusinessId());
buildAreaUrlLj.setCityName(areaName);
IdAndNamePO areapo = queryProvinceCityArea(2, areaName, provincepo.getBusinessId());
buildAreaUrlLj.setCityId(areapo.getBusinessId());
IdAndNamePO countypo = queryProvinceCityArea(3, countyName, areapo.getBusinessId());
buildAreaUrlLj.setCountyName(countypo.getBusinessName());
buildAreaUrlLj.setCountyId(countypo.getBusinessId());
buildAreaUrlLj.setAreaUrl(url);
buildAreaUrlLj.setCreateTime(new Date());
buildAreaUrlLj.setUpdateTime(new Date());
System.out.println("持久化链接"+buildAreaUrlLj);
}
}
/**
* 根据名称查询省市县信息
* @param type 1/省 2/市 3/区
* @param businessName 名称
* @param parentId 父id
* @return
*/
private IdAndNamePO queryProvinceCityArea(Integer type, String businessName, String parentId) {
if (StringUtils.isNotBlank(parentId)) {
ArrayList<String> citys = new ArrayList<>(8);
citys.add("50");
citys.add("11");
citys.add("31");
citys.add("12");
if (citys.contains(parentId)) {
businessName = "市辖区";
}
}
IdAndNamePO po = null;
try {
if (type == 1) {
// po = buildingsAvgMapper.queryProvinceIdByName(businessName);
} else if (type == 2) {
// po = buildingsAvgMapper.queryCityIdByName(businessName, parentId);
} else if (type == 3) {
// po = buildingsAvgMapper.querycountyIdByName(businessName, parentId);
}
} catch (Exception e) {
e.printStackTrace();
}
if (null == po) {
po = new IdAndNamePO();
po.setBusinessId("-1");
po.setBusinessName(businessName);
}
return po;
}
private static String sleepAndCutoverNewPage(int millis, WebDriver driver) {
try {
Thread.sleep(millis);
for (String handle : driver.getWindowHandles()) {
if (!pages.contains(handle)) {
driver.switchTo().window(handle);
}
}
} catch (InterruptedException e) {
}
return null;
}
private static void sleep(int millis) {
try {
Thread.sleep(millis);
} catch (InterruptedException e) {
}
}
public static WebElement validElement(String str, WebDriver driver) {
try {
WebElement element = driver.findElement(By.xpath(str));
return element;
} catch (Exception e) {
System.out.println("这个元素不存在" + str);
}
return null;
}
}
1. driver.close 是关闭当前页 driver.quit是退出进程 循环跑列表的不退出进程的话浏览器会把内存吃满
2. 跳转页面尽量显示等待一下 以防元素未加载导致查找错误
3. 请求不可太频繁 特殊需求请加代理
上述案例源码
https://download.csdn.net/download/DoAsOnePleases/86772623