利用 get 请求 访问博客, 增加阅读量, 易于搜索引擎收录
注意: 每次用代码访问博客就会增加一个访问量, 设计思路是获取文章共分几页显示, 从每页的文章列表中爬出文章的url, 然后遍历这些url进行get请求访问.
每篇博客的网址是 http://blog.csdn.net/never_cxb/article/details/47324459, 最后面的数字为标记id, 每篇文章不同
博客文章列表为 http://blog.csdn.net/never_cxb/article/list/1, 我们要做的就是从中爬出每篇文章的url
文章比较多, 会分好几页显示 http://blog.csdn.net/never_cxb/article/list/1 最后面的数字1会递增为2, 3, 4 等等. 可以在 http://blog.csdn.net/never_cxb/article/list/1 最下面的25条数据 共2页1 2 下一页 尾页
爬出一共几页
// 对一个文章url进行get请求访问
public static void accessBolg(String blogUrl) {
URL url;
try {
url = new URL(blogUrl);
HttpURLConnection con = (HttpURLConnection) url.openConnection();
con.addRequestProperty("Content-Type", "text/html; charset=UTF-8");
con.addRequestProperty(
"User-Agent",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.104 Safari/537.36");
con.setRequestMethod("GET");
if (con.getResponseCode() == 200) {
InputStream inputStr = con.getInputStream();
StreamTool.read(inputStr);
System.out.println(blogUrl + " has been accessed");
// 给个随机数暂停访问
int sleepSec = new Random().nextInt(200) + 100;
Thread.sleep(sleepSec * 1000);
}
} catch (MalformedURLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
// 获取该列表页上的文章 url 集合
public static List<String> getBolgUrls(String listUrl) {
List<String> result = new ArrayList<String>(20);
Document doc;
try {
doc = Jsoup.connect(listUrl).userAgent("Mozilla").timeout(3000)
.get();
Elements links = doc.select("a[href]");
for (Element link : links) {
if (link.text().equals("阅读")) {
// 在属性名前加 abs: 前缀,取得一个绝对路径,
// 这样就可以返回包含根路径的URL地址attr("abs:href")
result.add(link.attr("abs:href"));
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return result;
}
// 获取文章页数
public static HomePage countPage(String pageUrl) {
Document doc;
try {
doc = Jsoup.connect(pageUrl).userAgent("Mozilla").timeout(3000)
.get();
Elements links = doc.select(".pagelist");
Element link = links.first();
String text = link.select("span").text();
System.out.println(text);
String pattern = new String("(\\d+)[^\\d]*(\\d+)[^\\d]*");
// 创建 Pattern 对象
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(text);
if (m.find()) {
return new HomePage(Integer.valueOf(m.group(2)),
Integer.valueOf(m.group(1)));
}
// <div id="papelist" class="pagelist">
// <span> 29条数据 共2页</span><a href="/never_cxb/article/list/1">首页</a>
// <a href="/never_cxb/article/list/1">上一页</a> <a
// href="/never_cxb/article/list/1">1</a> <strong>2</strong>
// </div>
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return new HomePage();
}
omePage home = countPage(baseListUrl);
int page = home.getPage();
List<Integer> originPages = new ArrayList<Integer>(page);
for (int i = 0; i < page; i++) {
originPages.add(i);
}
List<String> blogList = new ArrayList<String>(page);
// random
for (int i = 0; i < page; i++) {
int random = new Random().nextInt(originPages.size());
int index = originPages.get(random) + 1;
String list = baseListUrl + index;
originPages.remove(random);
blogList.add(list);
}
// 将每页的文章 url打乱
public static List<String> randomUrlList(List<String> oriListUrl) {
if (oriListUrl != null) {
int size = oriListUrl.size();
List<String> randomList = new ArrayList<String>(size);
int[] oriArray = new int[size];
for (int i = 0; i < size; i++) {
oriArray[i] = i;
}
for (int i = 0; i < size; i++) {
int random = new Random().nextInt(size - i);
int value = oriArray[random];
randomList.add(oriListUrl.get(value));
// note 随着i递增,当前最后一个数的位置会往前移
oriArray[random] = oriArray[size - i - 1];
oriArray[size - i - 1] = value;
}
return randomList;
}
return null;
}
public class StreamTool {
public static byte[] read(InputStream inputStr) throws Exception {
ByteArrayOutputStream outStr = new ByteArrayOutputStream();
// TODO Auto-generated method stub
byte[] buffer = new byte[1024];
int len = 0;
while ((len = inputStr.read(buffer)) != -1) {
String searchTitle = new String(buffer, "UTF-8");
if (searchTitle.contains(" <title>Java 数组 声明 定义 传参数 - never_cxb的专栏")){
System.out.println(searchTitle);
}
outStr.write(buffer, 0, len);
}
inputStr.close();
return outStr.toByteArray();
}
}
32条数据 共3页
http://blog.csdn.net/never_cxb/article/details/47148535 has been accessed
http://blog.csdn.net/never_cxb/article/details/47042831 has been accessed
http://blog.csdn.net/never_cxb/article/details/47324803 has been accessed
http://blog.csdn.net/never_cxb/article/details/47184241 has been accessed
http://blog.csdn.net/never_cxb/article/details/47303109 has been accessed
http://blog.csdn.net/never_cxb/article/details/47323241 has been accessed
http://blog.csdn.net/never_cxb/article/details/47375017 has been accessed
http://blog.csdn.net/never_cxb/article/details/47156935 has been accessed
http://blog.csdn.net/never_cxb/article/details/47324459 has been accessed
http://blog.csdn.net/never_cxb/article/details/47134107 has been accessed
http://blog.csdn.net/never_cxb/article/details/47346891 has been accessed
http://blog.csdn.net/never_cxb/article/details/47359761 has been accessed
http://blog.csdn.net/never_cxb/article/details/47324101 has been accessed
http://blog.csdn.net/never_cxb/article/details/47360253 has been accessed
<div id="papelist" class="pagelist">
<span> 31条数据 共3页</span><a href="/never_cxb/article/list/1">首页</a> <a href="/never_cxb/article/list/1">上一页</a> <a href="/never_cxb/article/list/1">1</a> <strong>2</strong> <a href="/never_cxb/article/list/3">3</a> <a href="/never_cxb/article/list/3">下一页</a> <a href="/never_cxb/article/list/3">尾页</a>
</div>
对于这段html, 先用doc.select(".pagelist")
匹配出了class="pagelist"
再用link.select("span")
匹配了 <span> <span> 31条数据 共3页</span></span>
再用正则匹配出 共3页
中的页数3
Java 正则中, Matcher.group(0)
表示整个匹配的字符串, 而group(1)
则表示第一个括号内的匹配
注意一定在有Matcher.group(0)
前一定先调用 Matcher.find()
,不然会报java.lang.IllegalStateException: No match found
doc = Jsoup.connect(pageUrl).userAgent("Mozilla").timeout(3000)
.get();
Elements links = doc.select(".pagelist");
Element link = links.first();
String text = link.select("span").text();
String pattern = new String("[^\\d]*(\\d+)[^\\d]*(\\d+)[^\\d]*");
// 创建 Pattern 对象
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(text);
if (m.find()) {
return Integer.valueOf(m.group(2));
}