Spring Boot 2.x: 爬取ip代理池

Spring Boot 2.x: 爬取ip代理池入库

Spring Boot 2.x: 爬取ip代理池_第1张图片

概述

因为爬虫的进阶阶段,最基本的就是要用到ip代理池,因为单个代理请求频繁,会被ban掉,所以要备一个代理池,用来请求使用

技术栈

  • HttpClient
  • Spring Boot 2.3.1
  • JDK 1.8

快速创建Spring Boot项目

访问 https://start.spring.io/ 生成一个初始项目

Spring Boot 2.x: 爬取ip代理池_第2张图片
我们需要去请求接口,所以需要一个Web依赖

Spring Boot 2.x: 爬取ip代理池_第3张图片
点击Generate,会下载一个zip的项目压缩包

导入Spring Boot项目

解压之后记得复制下demo文件夹放的路径

先用IDE编辑 pom.xml 文件,在下图红框上面加入下述代码

可以切换下载依赖的源为国内阿里源
Spring Boot 2.x: 爬取ip代理池_第4张图片

<repositories>
        
        <repository>
            <id>aliyunid>
            <name>aliyunname>
            <url>https://maven.aliyun.com/repository/publicurl>
            <releases>
                <enabled>trueenabled>
            releases>
            <snapshots>
                <enabled>falseenabled>
            snapshots>
        repository>
        
        <repository>
            <id>spring-milestonesid>
            <name>Spring Milestonesname>
            <url>https://maven.aliyun.com/repository/springurl>
            <releases>
                <enabled>trueenabled>
            releases>
            <snapshots>
                <enabled>falseenabled>
            snapshots>
        repository>
    repositories>
    <pluginRepositories>
        
        <pluginRepository>
            <id>spring-pluginid>
            <name>spring-pluginname>
            <url>https://maven.aliyun.com/repository/spring-pluginurl>
            <releases>
                <enabled>trueenabled>
            releases>
            <snapshots>
                <enabled>falseenabled>
            snapshots>
        pluginRepository>
    pluginRepositories>

下面是导入流程:

IDEA里点击File -> Open -> 粘贴刚刚的项目文件夹路径 -> 找到pom.xml双击
-> Open as Peoject -> 等待Maven加载完毕,看不明白看下图
Spring Boot 2.x: 爬取ip代理池_第5张图片
Open as Project,之后等待Maven加载完毕即可

pom.xml文件


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0modelVersion>
	<parent>
		<groupId>org.springframework.bootgroupId>
		<artifactId>spring-boot-starter-parentartifactId>
		<version>2.3.1.RELEASEversion>
		<relativePath/> 
	parent>
	<groupId>com.github.gleansgroupId>
	<artifactId>SpringBoot-ProxyPoolartifactId>
	<version>0.0.1-SNAPSHOTversion>
	<name>SpringBoot-ProxyPoolname>
	<description>Demo project for Spring Bootdescription>

	<properties>
		<java.version>1.8java.version>
		<httpclient.version>4.5.12httpclient.version>
		<jsonp.version>1.13.1jsonp.version>
		<knife4j.version>2.0.3knife4j.version>
		<lombok.version>1.18.12lombok.version>
		<mysql.version>8.0.19mysql.version>
	properties>

	<dependencies>
		<dependency>
			<groupId>org.apache.httpcomponentsgroupId>
			<artifactId>httpclientartifactId>
			<version>${httpclient.version}version>
		dependency>
		<dependency>
			<groupId>org.springframework.bootgroupId>
			<artifactId>spring-boot-starter-webartifactId>
		dependency>
		<dependency>
			<groupId>org.projectlombokgroupId>
			<artifactId>lombokartifactId>
			<version>${lombok.version}version>
			<scope>providedscope>
		dependency>
		<dependency>
			<groupId>org.springframework.bootgroupId>
			<artifactId>spring-boot-starter-testartifactId>
			<scope>testscope>
			<exclusions>
				<exclusion>
					<groupId>org.junit.vintagegroupId>
					<artifactId>junit-vintage-engineartifactId>
				exclusion>
			exclusions>
		dependency>
		<dependency>
			<groupId>com.github.xiaoymingroupId>
			<artifactId>knife4j-spring-boot-starterartifactId>
			<version>${knife4j.version}version>
		dependency>
		<dependency>
			<groupId>org.jsoupgroupId>
			<artifactId>jsoupartifactId>
			<version>${jsonp.version}version>
		dependency>
		<dependency>
			<groupId>mysqlgroupId>
			<artifactId>mysql-connector-javaartifactId>
			<version>${mysql.version}version>
		dependency>
		<dependency>
			<groupId>org.springframework.bootgroupId>
			<artifactId>spring-boot-starter-data-jpaartifactId>
		dependency>
		<dependency>
			<groupId>org.springframework.bootgroupId>
			<artifactId>spring-boot-starter-thymeleafartifactId>
		dependency>
	dependencies>

	<build>
		<plugins>
			<plugin>
				<groupId>org.springframework.bootgroupId>
				<artifactId>spring-boot-maven-pluginartifactId>
			plugin>
		plugins>
	build>
	<repositories>
		
		<repository>
			<id>aliyunid>
			<name>aliyunname>
			<url>https://maven.aliyun.com/repository/publicurl>
			<releases>
				<enabled>trueenabled>
			releases>
			<snapshots>
				<enabled>falseenabled>
			snapshots>
		repository>
		
		<repository>
			<id>spring-milestonesid>
			<name>Spring Milestonesname>
			<url>https://maven.aliyun.com/repository/springurl>
			<releases>
				<enabled>trueenabled>
			releases>
			<snapshots>
				<enabled>falseenabled>
			snapshots>
		repository>
	repositories>
	<pluginRepositories>
		
		<pluginRepository>
			<id>spring-pluginid>
			<name>spring-pluginname>
			<url>https://maven.aliyun.com/repository/spring-pluginurl>
			<releases>
				<enabled>trueenabled>
			releases>
			<snapshots>
				<enabled>falseenabled>
			snapshots>
		pluginRepository>
	pluginRepositories>
project>

新建ip实体对象

package com.github.gleans.ekko.model;

import io.swagger.annotations.ApiModelProperty;
import lombok.Data;
import lombok.NoArgsConstructor;
import lombok.experimental.Accessors;

import javax.persistence.Entity;
import javax.persistence.Id;

@Data
@Entity(name = "ip_data")
@NoArgsConstructor
@Accessors(chain = true)
public class IPData {

    @Id
    @ApiModelProperty(value = "编号")
    private Long ipNo;

    @ApiModelProperty(value = "国家")
    private String country;

    @ApiModelProperty(value = "IP地址")
    private String ipAddress;

    @ApiModelProperty(value = "端口")
    private Integer port;

    @ApiModelProperty(value = "服务器地址")
    private String serverAddress;

    @ApiModelProperty(value = "是否匿名")
    private String anonymous;

    @ApiModelProperty(value = "类型")
    private String type;

    @ApiModelProperty(value = "速度")
    private String speed;

    @ApiModelProperty(value = "连接时间")
    private String connTime;

    @ApiModelProperty(value = "存活时间")
    private String survivalTime;

    @ApiModelProperty(value = "验证时间")
    private String postTime;
}

主要的业务类

IPServiceImpl.java

package com.github.gleans.ekko.service.impl;

import com.github.gleans.ekko.model.IPData;
import com.github.gleans.ekko.service.IPService;
import com.github.gleans.ekko.utils.HttpCustom;
import lombok.extern.slf4j.Slf4j;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.springframework.stereotype.Service;

import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
import java.util.stream.Collectors;

@Slf4j
@Service
public class IPServiceImpl implements IPService {


    @Override
    public List<IPData> getIpList() {
        String html = HttpCustom.getIpStore("https://www.xicidaili.com/nn/1", null, null);
        //将html解析成DOM结构
        Document document = Jsoup.parse(html);

        //提取所需要的数据
        Elements trs = document.select("table[id=ip_list]").select("tbody").select("tr");

        if (null == trs || trs.size() == 0) {
            return new ArrayList<>();
        }

        return trs.stream()
                .map(tr -> {
                    Elements trd = tr.select("td");
                    if (trd != null && trd.size() > 0) {
                        String country = tr.select("td").get(0).text();
                        String ipAddress = tr.select("td").get(1).text();
                        Integer port = Integer.valueOf(tr.select("td").get(2).text());
                        String serverAddress = tr.select("td").get(3).text();
                        String anonymous = tr.select("td").get(4).text();
                        String ipType = tr.select("td").get(5).text();
                        String speed = tr.select("td").get(6).select("div[class=bar]").attr("title");

                        return new IPData().setIpAddress(ipAddress)
                                .setPort(port).setType(ipType)
                                .setCountry(country).setSpeed(speed)
                                .setAnonymous(anonymous).setServerAddress(serverAddress);
                    } else {
                        return null;
                    }

                }).filter(Objects::nonNull).collect(Collectors.toList());
    }
}

上面代码核心有参考:https://github.com/dhengyi/ip-proxy-pools-regularly

封装请求类

package com.github.gleans.ekko.utils;

import org.apache.http.HttpHost;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;
import java.nio.charset.StandardCharsets;

public class HttpCustom {

    private final static int CONNECT_TIMEOUT = 3000;
    private final static int SOCKET_TIMEOUT = 3000;

    /**
     * 获取网页信息
     *
     * @param url
     * @param ip
     * @param port
     */
    public static String getIpStore(String url, String ip, Integer port) {

        String resBody = "";
        CloseableHttpClient httpClient = HttpClients.createDefault();

        RequestConfig.Builder configBuilder = RequestConfig
                .custom()
                .setConnectTimeout(CONNECT_TIMEOUT)
                .setSocketTimeout(SOCKET_TIMEOUT);
        if (ip != null && port != null) {
            HttpHost proxy = new HttpHost(ip, port);
            configBuilder.setProxy(proxy);
        }
        RequestConfig config = configBuilder.build();

        HttpGet httpGet = new HttpGet(url);
        httpGet.setConfig(config);

        httpGet.setHeader("Pragma", "no-cache");
        httpGet.setHeader("Connection", "keep-alive");
        httpGet.setHeader("Host", "www.xicidaili.com");
        httpGet.setHeader("Cache-Control", "no-cache");
        httpGet.setHeader("Upgrade-Insecure-Requests", "1");
        httpGet.setHeader("Accept-Language", "zh-CN,zh;q=0.8");
        httpGet.setHeader("Accept-Encoding", "gzip, deflate, sdch");
        httpGet.setHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36");

        try {
            //客户端执行httpGet方法,返回响应
            CloseableHttpResponse httpResponse = httpClient.execute(httpGet);

            //得到服务响应状态码
            if (httpResponse.getStatusLine().getStatusCode() == 200) {
                resBody = EntityUtils.toString(httpResponse.getEntity(), StandardCharsets.UTF_8);
            }

            httpResponse.close();
            httpClient.close();
        } catch (IOException e) {
            resBody = null;
        }

        return resBody;
    }

}

applicat.yml配置文件

spring:
  datasource:
      driver-class-name: com.mysql.cj.jdbc.Driver
      url: jdbc:mysql://127.0.0.1:3306/big-data?characterEncoding=utf-8
      username: root
      password: root
  jpa:
    open-in-view: true
    database-platform: org.hibernate.dialect.H2Dialect
    # spring.jpa.show-sql=true 配置在日志中打印出执行的 SQL 语句信息。
    show-sql: true
    # 配置指明在程序启动的时候要删除并且创建实体类对应的表。
    # create 这个参数很危险,因为他会把对应的表删除掉然后重建。所以千万不要在生成环境中使用。只有在测试环境中,一开始初始化数据库结构的时候才能使用一次。
    # ddl-auto:create----每次运行该程序,没有表格会新建表格,表内有数据会清空
    # ddl-auto:create-drop----每次程序结束的时候会清空表
    # ddl-auto:update----每次运行程序,没有表格会新建表格,表内有数据不会清空,只会更新(推荐)
    # ddl-auto:validate----运行程序会校验数据与数据库的字段类型是否相同,不同会报错
    hibernate.ddl-auto: update

前端显示

技术栈

  • vue
  • element-ui
  • html5

index.html


<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Titletitle>
    
    <link rel="stylesheet" href="https://unpkg.com/element-ui/lib/theme-chalk/index.css">
head>
<body>

<div id="app">
    <h1>{{ message }}h1>
    是否重新抓取
    <el-switch
            v-model="isRefresh">
    el-switch>
    <br>
    
    <el-table
            :data="tableData"
            border
            style="width: 100%">
        <el-table-column
                fixed
                prop="ipAddress"
                label="IP地址"
                width="150">
        el-table-column>
        <el-table-column
                prop="port"
                label="端口"
                width="120">
        el-table-column>
        <el-table-column
                prop="serverAddress"
                label="服务器地址"
                width="120">
        el-table-column>
        <el-table-column
                prop="speed"
                label="速度"
                width="120">
        el-table-column>
        <el-table-column
                prop="type"
                label="请求方式"
                width="300">
        el-table-column>
        <el-table-column
                prop="anonymous"
                label="匿名类型"
                width="120">
        el-table-column>
        <el-table-column
                label="操作">
                <el-button @click="handleClick(scope.row)" type="text" size="small">查看el-button>
                <el-button type="text" size="small">编辑el-button>
        el-table-column>
    el-table>
    
div>

<script src="https://cdn.jsdelivr.net/npm/vue/dist/vue.js">script>

<script src="https://unpkg.com/axios/dist/axios.min.js">script>
<script src="https://unpkg.com/element-ui/lib/index.js">script>
<script type="text/javascript">
    var app = new Vue({
        el: '#app',
        methods: {
            getTableData() {
                let _this = this;
                // 为给定 ID 的 user 创建请求
                axios.get('ip/list')
                    .then(function (response) {
                        console.log(response);
                        _this.tableData = response.data.data
                    })
                    .catch(function (error) {
                        console.log(error);
                    });
            }
        },
        created() {
            this.getTableData()
        },
        data: {
            message: 'ip池子代理',
            tableData: [],
            isRefresh: true
        }
    });
script>
body>
html>

效果图

启动之后。访问http://127.0.0.1:8080/index

Spring Boot 2.x: 爬取ip代理池_第6张图片

TODO

  • 数据入库,防止一直调取人家接口(待实现)
  • 缓存,防止一直查询数据库(待实现)
  • 数据库去重,去除无效数据(待实现)
  • 页面可修改,查询列表(待实现)

源码地址

https://github.com/Gleans/SpringBootLearn/tree/master/SpringBoot-ProxyPool

  • 有问题关注公众号《Java Pro》私信我
  • 或者在CSDN私信我(不一定看得到)

你可能感兴趣的:(Spring,Boot2.x,maven,java,爬虫,mysql,spring,boot)