因为爬虫的进阶阶段,最基本的就是要用到ip代理池,因为单个代理请求频繁,会被ban掉,所以要备一个代理池,用来请求使用
访问 https://start.spring.io/ 生成一个初始项目
解压之后记得复制下demo文件夹放的路径
先用IDE编辑 pom.xml
文件,在下图红框上面加入下述代码
<repositories>
<repository>
<id>aliyunid>
<name>aliyunname>
<url>https://maven.aliyun.com/repository/publicurl>
<releases>
<enabled>trueenabled>
releases>
<snapshots>
<enabled>falseenabled>
snapshots>
repository>
<repository>
<id>spring-milestonesid>
<name>Spring Milestonesname>
<url>https://maven.aliyun.com/repository/springurl>
<releases>
<enabled>trueenabled>
releases>
<snapshots>
<enabled>falseenabled>
snapshots>
repository>
repositories>
<pluginRepositories>
<pluginRepository>
<id>spring-pluginid>
<name>spring-pluginname>
<url>https://maven.aliyun.com/repository/spring-pluginurl>
<releases>
<enabled>trueenabled>
releases>
<snapshots>
<enabled>falseenabled>
snapshots>
pluginRepository>
pluginRepositories>
下面是导入流程:
IDEA里点击File -> Open -> 粘贴刚刚的项目文件夹路径 -> 找到pom.xml
双击
-> Open as Peoject -> 等待Maven
加载完毕,看不明白看下图
Open as Project,之后等待Maven
加载完毕即可
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0modelVersion>
<parent>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-starter-parentartifactId>
<version>2.3.1.RELEASEversion>
<relativePath/>
parent>
<groupId>com.github.gleansgroupId>
<artifactId>SpringBoot-ProxyPoolartifactId>
<version>0.0.1-SNAPSHOTversion>
<name>SpringBoot-ProxyPoolname>
<description>Demo project for Spring Bootdescription>
<properties>
<java.version>1.8java.version>
<httpclient.version>4.5.12httpclient.version>
<jsonp.version>1.13.1jsonp.version>
<knife4j.version>2.0.3knife4j.version>
<lombok.version>1.18.12lombok.version>
<mysql.version>8.0.19mysql.version>
properties>
<dependencies>
<dependency>
<groupId>org.apache.httpcomponentsgroupId>
<artifactId>httpclientartifactId>
<version>${httpclient.version}version>
dependency>
<dependency>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-starter-webartifactId>
dependency>
<dependency>
<groupId>org.projectlombokgroupId>
<artifactId>lombokartifactId>
<version>${lombok.version}version>
<scope>providedscope>
dependency>
<dependency>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-starter-testartifactId>
<scope>testscope>
<exclusions>
<exclusion>
<groupId>org.junit.vintagegroupId>
<artifactId>junit-vintage-engineartifactId>
exclusion>
exclusions>
dependency>
<dependency>
<groupId>com.github.xiaoymingroupId>
<artifactId>knife4j-spring-boot-starterartifactId>
<version>${knife4j.version}version>
dependency>
<dependency>
<groupId>org.jsoupgroupId>
<artifactId>jsoupartifactId>
<version>${jsonp.version}version>
dependency>
<dependency>
<groupId>mysqlgroupId>
<artifactId>mysql-connector-javaartifactId>
<version>${mysql.version}version>
dependency>
<dependency>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-starter-data-jpaartifactId>
dependency>
<dependency>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-starter-thymeleafartifactId>
dependency>
dependencies>
<build>
<plugins>
<plugin>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-maven-pluginartifactId>
plugin>
plugins>
build>
<repositories>
<repository>
<id>aliyunid>
<name>aliyunname>
<url>https://maven.aliyun.com/repository/publicurl>
<releases>
<enabled>trueenabled>
releases>
<snapshots>
<enabled>falseenabled>
snapshots>
repository>
<repository>
<id>spring-milestonesid>
<name>Spring Milestonesname>
<url>https://maven.aliyun.com/repository/springurl>
<releases>
<enabled>trueenabled>
releases>
<snapshots>
<enabled>falseenabled>
snapshots>
repository>
repositories>
<pluginRepositories>
<pluginRepository>
<id>spring-pluginid>
<name>spring-pluginname>
<url>https://maven.aliyun.com/repository/spring-pluginurl>
<releases>
<enabled>trueenabled>
releases>
<snapshots>
<enabled>falseenabled>
snapshots>
pluginRepository>
pluginRepositories>
project>
package com.github.gleans.ekko.model;
import io.swagger.annotations.ApiModelProperty;
import lombok.Data;
import lombok.NoArgsConstructor;
import lombok.experimental.Accessors;
import javax.persistence.Entity;
import javax.persistence.Id;
@Data
@Entity(name = "ip_data")
@NoArgsConstructor
@Accessors(chain = true)
public class IPData {
@Id
@ApiModelProperty(value = "编号")
private Long ipNo;
@ApiModelProperty(value = "国家")
private String country;
@ApiModelProperty(value = "IP地址")
private String ipAddress;
@ApiModelProperty(value = "端口")
private Integer port;
@ApiModelProperty(value = "服务器地址")
private String serverAddress;
@ApiModelProperty(value = "是否匿名")
private String anonymous;
@ApiModelProperty(value = "类型")
private String type;
@ApiModelProperty(value = "速度")
private String speed;
@ApiModelProperty(value = "连接时间")
private String connTime;
@ApiModelProperty(value = "存活时间")
private String survivalTime;
@ApiModelProperty(value = "验证时间")
private String postTime;
}
IPServiceImpl.java
package com.github.gleans.ekko.service.impl;
import com.github.gleans.ekko.model.IPData;
import com.github.gleans.ekko.service.IPService;
import com.github.gleans.ekko.utils.HttpCustom;
import lombok.extern.slf4j.Slf4j;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.springframework.stereotype.Service;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
import java.util.stream.Collectors;
@Slf4j
@Service
public class IPServiceImpl implements IPService {
@Override
public List<IPData> getIpList() {
String html = HttpCustom.getIpStore("https://www.xicidaili.com/nn/1", null, null);
//将html解析成DOM结构
Document document = Jsoup.parse(html);
//提取所需要的数据
Elements trs = document.select("table[id=ip_list]").select("tbody").select("tr");
if (null == trs || trs.size() == 0) {
return new ArrayList<>();
}
return trs.stream()
.map(tr -> {
Elements trd = tr.select("td");
if (trd != null && trd.size() > 0) {
String country = tr.select("td").get(0).text();
String ipAddress = tr.select("td").get(1).text();
Integer port = Integer.valueOf(tr.select("td").get(2).text());
String serverAddress = tr.select("td").get(3).text();
String anonymous = tr.select("td").get(4).text();
String ipType = tr.select("td").get(5).text();
String speed = tr.select("td").get(6).select("div[class=bar]").attr("title");
return new IPData().setIpAddress(ipAddress)
.setPort(port).setType(ipType)
.setCountry(country).setSpeed(speed)
.setAnonymous(anonymous).setServerAddress(serverAddress);
} else {
return null;
}
}).filter(Objects::nonNull).collect(Collectors.toList());
}
}
上面代码核心有参考:https://github.com/dhengyi/ip-proxy-pools-regularly
package com.github.gleans.ekko.utils;
import org.apache.http.HttpHost;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
public class HttpCustom {
private final static int CONNECT_TIMEOUT = 3000;
private final static int SOCKET_TIMEOUT = 3000;
/**
* 获取网页信息
*
* @param url
* @param ip
* @param port
*/
public static String getIpStore(String url, String ip, Integer port) {
String resBody = "";
CloseableHttpClient httpClient = HttpClients.createDefault();
RequestConfig.Builder configBuilder = RequestConfig
.custom()
.setConnectTimeout(CONNECT_TIMEOUT)
.setSocketTimeout(SOCKET_TIMEOUT);
if (ip != null && port != null) {
HttpHost proxy = new HttpHost(ip, port);
configBuilder.setProxy(proxy);
}
RequestConfig config = configBuilder.build();
HttpGet httpGet = new HttpGet(url);
httpGet.setConfig(config);
httpGet.setHeader("Pragma", "no-cache");
httpGet.setHeader("Connection", "keep-alive");
httpGet.setHeader("Host", "www.xicidaili.com");
httpGet.setHeader("Cache-Control", "no-cache");
httpGet.setHeader("Upgrade-Insecure-Requests", "1");
httpGet.setHeader("Accept-Language", "zh-CN,zh;q=0.8");
httpGet.setHeader("Accept-Encoding", "gzip, deflate, sdch");
httpGet.setHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36");
try {
//客户端执行httpGet方法,返回响应
CloseableHttpResponse httpResponse = httpClient.execute(httpGet);
//得到服务响应状态码
if (httpResponse.getStatusLine().getStatusCode() == 200) {
resBody = EntityUtils.toString(httpResponse.getEntity(), StandardCharsets.UTF_8);
}
httpResponse.close();
httpClient.close();
} catch (IOException e) {
resBody = null;
}
return resBody;
}
}
spring:
datasource:
driver-class-name: com.mysql.cj.jdbc.Driver
url: jdbc:mysql://127.0.0.1:3306/big-data?characterEncoding=utf-8
username: root
password: root
jpa:
open-in-view: true
database-platform: org.hibernate.dialect.H2Dialect
# spring.jpa.show-sql=true 配置在日志中打印出执行的 SQL 语句信息。
show-sql: true
# 配置指明在程序启动的时候要删除并且创建实体类对应的表。
# create 这个参数很危险,因为他会把对应的表删除掉然后重建。所以千万不要在生成环境中使用。只有在测试环境中,一开始初始化数据库结构的时候才能使用一次。
# ddl-auto:create----每次运行该程序,没有表格会新建表格,表内有数据会清空
# ddl-auto:create-drop----每次程序结束的时候会清空表
# ddl-auto:update----每次运行程序,没有表格会新建表格,表内有数据不会清空,只会更新(推荐)
# ddl-auto:validate----运行程序会校验数据与数据库的字段类型是否相同,不同会报错
hibernate.ddl-auto: update
技术栈
index.html
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Titletitle>
<link rel="stylesheet" href="https://unpkg.com/element-ui/lib/theme-chalk/index.css">
head>
<body>
<div id="app">
<h1>{{ message }}h1>
是否重新抓取
<el-switch
v-model="isRefresh">
el-switch>
<br>
<el-table
:data="tableData"
border
style="width: 100%">
<el-table-column
fixed
prop="ipAddress"
label="IP地址"
width="150">
el-table-column>
<el-table-column
prop="port"
label="端口"
width="120">
el-table-column>
<el-table-column
prop="serverAddress"
label="服务器地址"
width="120">
el-table-column>
<el-table-column
prop="speed"
label="速度"
width="120">
el-table-column>
<el-table-column
prop="type"
label="请求方式"
width="300">
el-table-column>
<el-table-column
prop="anonymous"
label="匿名类型"
width="120">
el-table-column>
<el-table-column
label="操作">
<el-button @click="handleClick(scope.row)" type="text" size="small">查看el-button>
<el-button type="text" size="small">编辑el-button>
el-table-column>
el-table>
div>
<script src="https://cdn.jsdelivr.net/npm/vue/dist/vue.js">script>
<script src="https://unpkg.com/axios/dist/axios.min.js">script>
<script src="https://unpkg.com/element-ui/lib/index.js">script>
<script type="text/javascript">
var app = new Vue({
el: '#app',
methods: {
getTableData() {
let _this = this;
// 为给定 ID 的 user 创建请求
axios.get('ip/list')
.then(function (response) {
console.log(response);
_this.tableData = response.data.data
})
.catch(function (error) {
console.log(error);
});
}
},
created() {
this.getTableData()
},
data: {
message: 'ip池子代理',
tableData: [],
isRefresh: true
}
});
script>
body>
html>
启动之后。访问http://127.0.0.1:8080/index
https://github.com/Gleans/SpringBootLearn/tree/master/SpringBoot-ProxyPool