现在用Python做爬虫很是盛行,在学Java的本人寻思着Java如何做爬虫。
本爬虫例子为体育彩票网http://www.sporttery.cn/
本例实现对“足球赛果开奖”的爬取;若要对体育彩票站其他页面爬取,稍微修改代码中URL规则即可;若要爬取非体彩网的其他网站,则需要重新分析其站结构,修改其爬取方式。
进入正题,编译器为intellij IDEA,大略分析工程构成,上图为其结构:
所含jar包:
第一类 jsoup包,为java爬取页面利器,传送门https://segmentfault.com/a/1190000007967145
第二类 poi包,将数据保存为本地word、excel等利器,传送门http://www.voidcn.com/article/p-odzwqxka-er.html
所含代码:
1.main类
完成对足球赛果开奖”页面分析,制定爬取规则。eg页面http://info.sporttery.cn/football/match_result.php?page=1&search_league=0&start_date=2018-10-21&end_date=2018-10-23&dan= 感兴趣同学可去分析下页面结构。
指挥DownloadWord对象将页面数据爬到中list中。
指挥SaveEveryRow对象将list数据保存本机Excel中。
package DownloadWord;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import java.lang.String;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Calendar;
import java.util.Date;
import java.util.List;
public class Main {
public static void main(String[] args) {
int allRows = 0;
SaveEveryRow save = new SaveEveryRow();
HSSFWorkbook workbook = new HSSFWorkbook();
HSSFSheet spreadsheet = workbook.createSheet("try");
DownloadWord down = new DownloadWord();
try {
Date d1 = new SimpleDateFormat("yyyy-MM-dd").parse("2018-08-22");
Date d2 = new SimpleDateFormat("yyyy-MM-dd").parse("2018-10-23");
Calendar dd = Calendar.getInstance();
dd.setTime(d1);
while (dd.getTime().before(d2)) {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
String str = sdf.format(dd.getTime());
String url = "http://info.sporttery.cn/football/match_result.php?page=1&search_league=0&start_date=" + str + "&end_date=" + str
+ "&dan=";
if (down.getM_page(url) <= 0) {
List lis = down.getTexts(url);
save.Save(url, (ArrayList) lis,allRows,workbook,spreadsheet);
allRows = allRows + lis.size();
dd.add(Calendar.DAY_OF_YEAR, 1);
}
else{
for(int j=1;j<=down.getM_page(url);j++){
String u = "http://info.sporttery.cn/football/match_result.php?page="+j+"&search_league=0&start_date=" + str + "&end_date=" + str
+ "&dan=";
List lis = down.getTexts(u);
save.Save(u, (ArrayList) lis,allRows,workbook,spreadsheet);
allRows = allRows + lis.size();
}
dd.add(Calendar.DAY_OF_YEAR, 1);
}
}
}
catch (Exception e) {
e.printStackTrace();
}
System.out.println("成功爬取到本机!");
}
}
2.DownloadWord类,需熟悉jsoup包中函数相关函数、对页面源码分析。
将爬取页面的信息保存到自己定义的ArrayList数据结构中.
package DownloadWord;
import java.io.IOException;
import java.util.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class DownloadWord {
private int rows;
private int page;
public List getTexts(String u){
List list = new ArrayList();
try {
Document document = Jsoup.connect(u).get();
Elements a = document.select(".m-tab").select("tr");
rows = a.size()-2;
for(int i = 0;i < rows;i++) {
Elements b= a.get(i).select("td");
Match e = new Match();
e.setTime( b.get(0).text());
e.setNum(b.get(1).text());
e.setMatch(b.get(2).text());
e.setVs(b.get(3).text());
e.setMidScore(b.get(4).text());
e.setFullScore(b.get(5).text());
e.setConditon(b.get(6).text());
list.add(e);
}
} catch (IOException e) {
e.printStackTrace();
}
return list;
}
public int getM_page(String u) throws IOException {
Document document = Jsoup.connect(u).get();
Elements a = document.select(".m-page").select("li");
page = a.size()-3;
return page;
}
}
package DownloadWord;
import org.apache.poi.hssf.usermodel.HSSFCell;
import org.apache.poi.hssf.usermodel.HSSFRow;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.ArrayList;
public class SaveEveryRow {
private String[] STR;
private static String outputFile = "D:\\try.xls";
public void Save(String url, ArrayList lis, int allRows, HSSFWorkbook workbook, HSSFSheet spreadsheet) throws IOException {
for (int i = 0; i
4.Match类,竞赛基本信息类。
定义的各变量、get、set函数。
package DownloadWord;
public class Match {
private String time;
private String num;
private String match;
private String vs;
private String midScore;
private String fullScore;
private String conditon;
public String getTime() {
return time;
}
public void setTime(String time) {
this.time = time;
}
public String getNum() {
return num;
}
public void setNum(String num) {
this.num = num;
}
public String getMatch() {
return match;
}
public void setMatch(String match) {
this.match = match;
}
public String getVs() {
return vs;
}
public void setVs(String vs) {
this.vs = vs;
}
public String getMidScore() {
return midScore;
}
public void setMidScore(String midScore) {
this.midScore = midScore;
}
public String getFullScore() {
return fullScore;
}
public void setFullScore(String fullScore) {
this.fullScore = fullScore;
}
public String getConditon() {
return conditon;
}
public void setConditon(String conditon) {
this.conditon = conditon;
}
}
效果:爬取足球竞彩结果8月22号到10月22号,一共2292条信息,存储到本机D:\try.xls文件中。(注意main方法引用的函数dd.getTime().before(d2)是指在d2时间前,即23号之前指22号、21号等等,所以d2应该多设置一天)
结语:用java爬虫相对Python麻烦(好友飞哥的Python代码更简洁),但是亦可通过其java各类jar包所提供的功能丰富接口完工,这也是java强大的一点。
每当需要爬取一个站点时,需分析站点结构,然后这个站及其子站基本就是你的了。
代码与其包含jar包已传到github:https://github.com/Bicycleful/javaCrawler