题记 最近发现了springboot本身就集成了selenium-java ,就拿来研究了一下,分布式爬虫是很大的一个架构,也比较复杂,没做研究,还是对爬虫的客户端实现有点兴趣,就想着用java做一个小爬虫客户端,其实selenium是做测试的,不过来拿来做小爬虫还是不错的,能真实模拟用户,不用担心异步加载爬取不到,还有一些元素获取不方便,更值得一提的是selenium-java 提供的元素获取的方式真的很方便,根据className、tagName、id 都有,爬取网站的时候会遇到一些验证码,这时候可以使用tess4j做简单的验证码识别,实在不行就等待人工输入也行。下面会提到爬虫验证码处理、frame内容爬取、翻页处理、滚动条等。
Selenium 是一个用于Web应用程序测试的工具。Selenium测试直接运行在浏览器中,就像真正的用户在操作一样。支持的浏览器包括IE(7, 8, 9, 10, 11),Mozilla Firefox,Safari,Google Chrome,Opera等。这个工具的主要功能包括:测试与浏览器的兼容性——测试你的应用程序看是否能够很好得工作在不同浏览器和操作系统之上。测试系统功能——创建回归测试检验软件功能和用户需求。支持自动录制动作和自动生成 .Net、Java、Perl等不同语言的测试脚本。
selenium-java 是 selenium的java 版,根据不同driver,可以驱动不同的浏览区,比如 selenium-chrome-driver、selenium-edge-driver、selenium-firefox-driver、selenium-ie-driver、selenium-opera-driver、phantomjsdriver等等,我用了其中的chromedriver 和 phantomjsdriver,这个能完全模拟真实用户操作,不错的测试框架。
以下是chromedriver对应的chrome版本:
驱动 | 对应版本号 |
---|---|
2.37 | v64-66 |
2.36 | v63-65 |
2.35 | v62-64 |
2.34 | v61-63 |
2.33 | v60-62 |
2.32 | v59-61 |
2.31 | v58-60 |
2.30 | v58-60 |
2.29 | v56-58 |
驱动的下载地址如下:
http://chromedriver.storage.googleapis.com/index.html
注意:64位向下兼容,直接下载32位的就可以啦,亲测可用。
ChromeOptions options = new ChromeOptions();
// 设置允许弹框
options.addArguments("disable-infobars","disable-web-security");
// 设置无gui 开发时还是不要加,可以看到浏览器效果
options.addArguments("--headless");
String driverPath = "D:\\crawler-plugin\\chromedriver.exe";
System.setProperty("webdriver.chrome.driver", driverPath);
RemoteWebDriver driver= new ChromeDriver(options);
driver.get("http://www.baidu.com");
System.out.println(driver.findElement(By.tagName("body")).getText());
下载地址 http://phantomjs.org/download.html
String driverPath = "D:\\crawler-plugin\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe";
System.setProperty("phantomjs.binary.path", driverPath);//设置PhantomJs访问路径
DesiredCapabilities desiredCapabilities = DesiredCapabilities.phantomjs();
//设置参数
desiredCapabilities.setCapability("phantomjs.page.settings.userAgent", "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0");
desiredCapabilities.setCapability("phantomjs.page.customHeaders.User-Agent", "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0");
RemoteWebDriver driver = new PhantomJSDriver(desiredCapabilities);
driver.get("http://www.baidu.com");
System.out.println(driver.findElement(By.tagName("body")).getText());
网站有时候需要登录,登录时候遇到验证码就非常棘手,tess4j能做简单的验证码识别,复杂的就别想了。。
maven 依赖
net.sourceforge.tess4j
tess4j
3.4.0
com.sun.jna
jna
tess4j 配置项
##tess4j config
tess4j.language=chi_sim
tess4j.language.path=D:\\crawler-plugin\\tessdata
tess4j.data.path=D:\\crawler-plugin\\
读取配置文件
@Configuration
public class Tess4jConfig {
@Value("${tess4j.data.path}")
@Setter
@Getter
private String tess4jDataPath ;
@Value("${tess4j.language.path}")
@Setter
@Getter
private String tess4jLanguagePath ;
@Value("${tess4j.language}")
@Setter
@Getter
private String tess4jLanguage ;
}
工具类
import com.cdchen.crawler.config.Tess4jConfig;
import lombok.extern.slf4j.Slf4j;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.util.LoggHelper;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.File;
@Slf4j
public class Tess4jUtil {
private static final Logger logger = LoggerFactory.getLogger(new LoggHelper().toString());
static final double MINIMUM_DESKEW_THRESHOLD = 0.05d;
private static ITesseract instance;
private static String datapath = "D:\\crawler-plugin\\";
private static String testResourcesLanguagePath = "D:\\crawler-plugin\\tessdata";
private static String language = "chi_sim";
private static ITesseract getInstance(){
Tess4jConfig config = SpringBeanUtil.getBean(Tess4jConfig.class);
if(config != null){
datapath = config.getTess4jDataPath();
language = config.getTess4jLanguage();
testResourcesLanguagePath = config.getTess4jLanguagePath();
}
if(datapath == null){
log.error("必须在properties配置tess4jdata.path,否则验证码无法识别");
return null;
}
if(testResourcesLanguagePath == null){
log.error("必须在properties配置tess4jlanguage.path,否则验证码无法识别");
return null;
}
if(language == null){
log.error("必须在properties配置tess4jlanguage,否则验证码无法识别");
return null;
}
if(instance == null){
instance = new Tesseract();
instance.setDatapath(new File(datapath).getPath());
//set language
instance.setDatapath(testResourcesLanguagePath);
instance.setLanguage(language);
}
return instance;
}
public static String doOcr(File file) throws Exception{
String result = getInstance().doOCR(file);
return result;
}
}
翻页相对来就很简单了,有很多种解决方法,举例2种
1 找到翻页url规律,替换对应的页码
2 找到翻页按钮,模拟点击
我用的第二种,一下代码为爬取qiushi百科时候的翻页代码,供参考:
private void jumpPageNum(int pageNum){
if(WebElementUtils.doesWebElementExist(driver,By.className("pagination"))){
WebElement pagination = driver.findElement(By.className("pagination"));
String currentText = pagination.findElement(By.className("current")).getText();
int currentPageNum = Integer.parseInt(currentText);
while (currentPageNum != pageNum){
List pageNums = pagination.findElements(By.className("page-numbers"));
for (int i = 0; i < pageNums.size(); i++) {
String pageNumText = pageNums.get(i).getText();
if(pageNumText.equals(pageNum+"")){
pageNums.get(i).click();
scrollBar.toPageEnd();
break;
}else{
if(i == (pageNums.size()-1)){
pageNums.get(i).click();
}
}
}
pagination = driver.findElement(By.className("pagination"));
currentText = pagination.findElement(By.className("current")).getText();
currentPageNum = Integer.parseInt(currentText);
}
}
}
滚动条就比较麻烦了,因为Driver没有对应的api操作滚动条(或许我没有找到。。),我用了曲线救国的方法去实现,而且用同样的思路可以解决很多类似的问题。思路就是:使用JavaScript去操作滚动条,
实现步骤是:
贴出来我写的工具类:
import com.cdchen.crawler.util.SleepUtil;
import lombok.Data;
import lombok.extern.slf4j.Slf4j;
import org.openqa.selenium.remote.RemoteWebDriver;
/**
*
* @description: tb
*
* @author: cdchen
*
* @create: 2019-04-30 17:08
**/
@Data
@Slf4j
public class ScrollBar {
RemoteWebDriver driver = null;
private static String getScrollTopJs = "function getScrollTop(){"
+ " var scrollTop = 0, bodyScrollTop = 0, documentScrollTop = 0;"
+ " if(document.body){"
+ " bodyScrollTop = document.body.scrollTop;"
+ " }"
+ " if(document.documentElement){"
+ " documentScrollTop = document.documentElement.scrollTop;"
+ " }"
+ " scrollTop = (bodyScrollTop - documentScrollTop > 0) ? bodyScrollTop : documentScrollTop;"
+ " return scrollTop;"
+ "};";
private static String getScrollHeightJs = "function getScrollHeight(){"
+ " var scrollHeight = 0, bodyScrollHeight = 0, documentScrollHeight = 0;"
+ " if(document.body){"
+ " bodyScrollHeight = document.body.scrollHeight;"
+ " }"
+ " if(document.documentElement){"
+ " documentScrollHeight = document.documentElement.scrollHeight;"
+ " }"
+ " scrollHeight = (bodyScrollHeight - documentScrollHeight > 0) ? bodyScrollHeight : documentScrollHeight;"
+ " return scrollHeight;"
+ "};";
private static String getWindowHeightJs = "function getWindowHeight(){"
+ " var windowHeight = 0;"
+ " if(document.compatMode == \"CSS1Compat\"){"
+ " windowHeight = document.documentElement.clientHeight;"
+ " }else{"
+ " windowHeight = document.body.clientHeight;"
+ " }"
+ " return windowHeight;"
+ "};";
private static String scroollIsOverJs = "function scroollIsOver(){"
+ " if(getScrollTop() + getWindowHeight() == getScrollHeight()){"
+ " return true;"
+ " }else{"
+ " return false;"
+ " }"
+ "};";
private static String insertScriptJs = "var body = document.getElementsByTagName('body')[0];"
+ "var newScript = document.createElement('script');"
+ "newScript.type = 'text/javascript';"
+ "newScript.innerHTML = '"+getScrollTopJs+getScrollHeightJs+getWindowHeightJs+scroollIsOverJs+"';"
+ "body.appendChild(newScript);";
public ScrollBar(RemoteWebDriver dr){
driver = dr;
}
public void toPageEnd() {
getDriver().executeScript(insertScriptJs);
int start = 0;
boolean scroollIsOver = false;
while (!scroollIsOver){
getDriver().executeScript("window.scrollTo("+start+","+(start+500)+")");
Boolean res = (Boolean)getDriver().executeScript("return scroollIsOver();");
if(res != null && res){
scroollIsOver = true;
}
start = start+500;
SleepUtil.sleep(1000);
}
}
}
有时候页面内置了iframe,想要获取iframe的元素就获取不了,这是就需要把driver切换到iframe内,如下代码:
WebElement iframe = driver.findElement(By.tagName("iframe"));
driver.switchTo().frame(iframe);
// 然后再去获取元素或者其他操作、操作完需切换回来
driver.switchTo().parentFrame();
有时候用chromedriver 时候需要开启多个标签页如何操作?
List tabs = new ArrayList(driver.getWindowHandles());
// 切换到第一个标签
driver.switchTo().window(tabs.get(0));
// 切换到第 **n**个标签
driver.switchTo().window(tabs.get(n));
爬虫客户短 应该尽量的小,但是有时候需要记录一些 爬取进度信息,不得不使用数据库或者文件之类的记录,所以我就集成了一个sqllite 作为持久化数据库,客户端重启也不用担心丢数据啦。框架 使用的是 SpringBoot + Mybatis+Tkmapper+Sqllite+selenium-java ,使用起来也是相当方便,项目放在了gitee上,地址:https://gitee.com/passionday/crawler