管程序猿

黑*头条_第10章_数据保存&排重&文档解析

文章目录

黑*头条_第10章_数据保存&排重&文档解析
1 数据保存准备
- 1.1 ip代理池
- - 1.1.1 需求分析
  - 1.1.2 实体类
  - 1.1.3 mapper接口定义
  - 1.1.4 service代码
  - 1.1.5 测试
- 1.2 爬虫文章图文附加信息
- - 1.2.1 需求分析
  - 1.2.2 实体类
  - 1.2.3 mapper接口定义
  - 1.2.4 service代码
  - 1.2.5 测试
- 1.3 爬虫文章图文评论信息表
- - 1.3.1 需求分析
  - 1.3.2 实体类
  - 1.3.3 mapper接口定义
  - 1.3.4 service代码
  - 1.3.5 测试
- 1.4 爬虫文章
- - 1.4.1 需求分析
  - 1.4.2 实体类
  - 1.4.3 mapper接口定义
  - 1.4.4 service代码
  - 1.4.5 测试
2 排重
- 2.1集成redis
- 2.2 DbAndRedisScheduler类
- 2.3 测试
3 解析模块
- 3.1 前置数据
- - 3.1.1 HtmlStyle类
  - 3.1.2 AbstractHtmlParsePipeline抽象类
- 3.2 线程池操作类CrawlerThreadPool
- 3.3 Gzip压缩工具类 ZipUtils
- 3.4 Html解析工具类 HtmlParser 类
- 3.5 反射工具类 ReflectUtils类
- 3.6 其他工具类
- 3.7 AbstractProcessFlow
- 3.8 CrawlerHtmlParsePipeline类
- 3.9 综合测试

1 数据保存准备

1.1 ip代理池

1.1.1 需求分析

针对于ip代理池的管理，包括了增删改查，设置可用ip和不可用ip

1.1.2 实体类

ClIpPool类

com.heima.model.crawler.pojos.ClIpPool

@Data
public class ClIpPool {
    
    private Integer id;
    
    private String supplier;
    
    private String ip;
    /**
     * 端口号
     */
    private int port;

    /**
     * 用户名
     */
    private String username;

    /**
     * 密码
     */
    private String password;

    /**
     * 错误码
     */
    private Integer code;

    /**
     * 耗时
     */
    private Integer duration;

    /**
     * 错误信息
     */
    private String error;
    
    private Boolean isEnable;

    
    private String ranges;

    
    private Date createdTime;

    public Integer getId() {
        return id;
    }

    public void setId(Integer id) {
        this.id = id;
    }

    public String getSupplier() {
        return supplier;
    }

    public void setSupplier(String supplier) {
        this.supplier = supplier;
    }

    public String getIp() {
        return ip;
    }

    public void setIp(String ip) {
        this.ip = ip;
    }

    public int getPort() {
        return port;
    }

    public void setPort(int port) {
        this.port = port;
    }

    public String getUsername() {
        return username;
    }

    public void setUsername(String username) {
        this.username = username;
    }

    public String getPassword() {
        return password;
    }

    public void setPassword(String password) {
        this.password = password;
    }

    public Boolean getEnable() {
        return isEnable;
    }

    public void setEnable(Boolean enable) {
        isEnable = enable;
    }

    public String getRanges() {
        return ranges;
    }

    public void setRanges(String ranges) {
        this.ranges = ranges;
    }

    public Date getCreatedTime() {
        return createdTime;
    }

    public void setCreatedTime(Date createdTime) {
        this.createdTime = createdTime;
    }


    public Integer getCode() {
        return code;
    }

    public void setCode(Integer code) {
        this.code = code;
    }

    public Integer getDuration() {
        return duration;
    }

    public void setDuration(Integer duration) {
        this.duration = duration;
    }

    public String getError() {
        return error;
    }

    public void setError(String error) {
        this.error = error;
    }
}

ClIpPoolMapper

com.heima.model.mappers.crawerls.ClIpPoolMapper

public interface ClIpPoolMapper {
   
    int deleteByPrimaryKey(Integer id);

    int insert(ClIpPool record);

    int insertSelective(ClIpPool record);

    ClIpPool selectByPrimaryKey(Integer id);

    int updateByPrimaryKeySelective(ClIpPool record);

    int updateByPrimaryKey(ClIpPool record);

    /**
     * 查询所有数据
     *
     * @param record
     * @return
     */
    List<ClIpPool> selectList(ClIpPool record);

    /**
     * 查询可用的列表
     *
     * @param record
     * @return
     */
    List<ClIpPool> selectAvailableList(ClIpPool record);
}

1.1.3 mapper接口定义

ClIpPoolMapper.xml

mappers/crawerls/ClIpPoolMapper.xml


DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd" >
<mapper namespace="com.heima.model.mappers.crawerls.ClIpPoolMapper">
    <resultMap id="BaseResultMap" type="com.heima.model.crawler.pojos.ClIpPool">
        <id column="id" property="id"/>
        <result column="supplier" property="supplier"/>
        <result column="ip" property="ip"/>
        <result column="port" property="port"/>
        <result column="username" property="username"/>
        <result column="password" property="password"/>
        <result column="code" property="code"/>
        <result column="duration" property="duration"/>
        <result column="error" property="error"/>
        <result column="is_enable" property="isEnable"/>
        <result column="ranges" property="ranges"/>
        <result column="created_time" property="createdTime"/>
    resultMap>

    <sql id="Base_Column_where">
        <where>
            <if test="supplier!=null and supplier!=''">
                and supplier = #{supplier}
            if>
            <if test="ip!=null and ip!=''">
                and ip = #{ip}
            if>
            <if test="port!=null and port!=''">
                and port = #{port}
            if>
            <if test="username!=null and username!=''">
                and username = #{username}
            if>
            <if test="password!=null and password!=''">
                and password = #{password}
            if>
            <if test="code!=null and code!=''">
                and code = #{code}
            if>
            <if test="duration!=null and duration!=''">
                and duration = #{duration}
            if>
            <if test="error!=null and error!=''">
                and error = #{error}
            if>
            <if test="isEnable!=null and isEnable!=''">
                and is_enable = #{isEnable}
            if>
            <if test="ranges!=null and ranges!=''">
                and ranges = #{ranges}
            if>
        where>
    sql>
    <sql id="Base_Column_List">
    id, supplier, ip, port,username,password,code,duration,error,is_enable, ranges, created_time
  sql>
    <select id="selectList" resultMap="BaseResultMap">
        select
        <include refid="Base_Column_List"/>
        from cl_ip_pool
        <include refid="Base_Column_where"/>
    select>
    <select id="selectAvailableList" resultMap="BaseResultMap">
        select
        <include refid="Base_Column_List"/>
        from cl_ip_pool
        <where>
            and is_enable = true
            <if test="duration!=null and duration!=''">
                
            if>
        where>
        order by duration
    select>
    <select id="selectByPrimaryKey" resultMap="BaseResultMap" parameterType="java.lang.Integer">
        select
        <include refid="Base_Column_List"/>
        from cl_ip_pool
        where id = #{id}
    select>
    <delete id="deleteByPrimaryKey" parameterType="java.lang.Integer">
    delete from cl_ip_pool
    where id = #{id}
  delete>
    <insert id="insert" parameterType="com.heima.model.crawler.pojos.ClIpPool"
            useGeneratedKeys="true" keyProperty="id">
    insert into cl_ip_pool (id, supplier, ip, port,username,password,code,duration,error,
      is_enable, ranges, created_time
      )
    values (#{id}, #{supplier}, #{ip}, #{port}, #{username}, #{password},#{code},#{duration},#{error},
      #{isEnable}, #{ranges}, #{createdTime}
      )
  insert>
    <insert id="insertSelective" parameterType="com.heima.model.crawler.pojos.ClIpPool" keyProperty="id" useGeneratedKeys="true">

        insert into cl_ip_pool
        <trim prefix="(" suffix=")" suffixOverrides=",">
            <if test="id != null">
                id,
            if>
            <if test="supplier != null">
                supplier,
            if>
            <if test="ip != null">
                ip,
            if>
            <if test="port != null">
                port,
            if>
            <if test="username != null">
                username,
            if>
            <if test="password != null">
                password,
            if>
            <if test="code != null">
                code,
            if>
            <if test="duration != null">
                duration,
            if>
            <if test="error != null">
                error,
            if>
            <if test="isEnable != null">
                is_enable,
            if>
            <if test="ranges != null">
                ranges,
            if>
            <if test="createdTime != null">
                created_time,
            if>
        trim>
        <trim prefix="values (" suffix=")" suffixOverrides=",">
            <if test="id != null">
                #{id},
            if>
            <if test="supplier != null">
                #{supplier},
            if>
            <if test="ip != null">
                #{ip},
            if>
            <if test="port != null">
                #{port},
            if>
            <if test="username != null">
                #{username},
            if>
            <if test="password != null">
                #{password},
            if>
            <if test="code != null">
                #{code},
            if>
            <if test="duration != null">
                #{duration},
            if>
            <if test="error != null">
                #{error},
            if>
            <if test="isEnable != null">
                #{isEnable},
            if>
            <if test="ranges != null">
                #{ranges},
            if>
            <if test="createdTime != null">
                #{createdTime},
            if>
        trim>
    insert>
    <update id="updateByPrimaryKeySelective" parameterType="com.heima.model.crawler.pojos.ClIpPool">

        update cl_ip_pool
        <set>
            <if test="supplier != null">
                supplier = #{supplier},
            if>
            <if test="ip != null">
                ip = #{ip},
            if>
            <if test="port != null">
                port = #{port},
            if>
            <if test="username != null">
                username = #{username},
            if>
            <if test="password != null">
                password = #{password},
            if>
            <if test="code != null">
                code = #{code},
            if>
            <if test="duration != null">
                duration = #{duration},
            if>
            <if test="error != null">
                error = #{error},
            if>
            <if test="isEnable != null">
                is_enable = #{isEnable},
            if>
            <if test="ranges != null">
                ranges = #{ranges},
            if>
            <if test="createdTime != null">
                created_time = #{createdTime},
            if>
        set>
        where id = #{id}
    update>
    <update id="updateByPrimaryKey" parameterType="com.heima.model.crawler.pojos.ClIpPool">

    update cl_ip_pool
    set supplier = #{supplier},
      ip = #{ip},
      port = #{port},
      username = #{username},
      password = #{password},
      code = #{code},
      duration = #{duration},
      error = #{error},
      is_enable = #{isEnable},
      ranges = #{ranges},
      created_time = #{createdTime}
    where id = #{id}
  update>
mapper>

1.1.4 service代码

CrawlerIpPoolService

com.heima.crawler.service.CrawlerIpPoolService

public interface CrawlerIpPoolService {

    /**
     * 保存方法
     *
     * @param clIpPool
     */
    public void saveCrawlerIpPool(ClIpPool clIpPool);

    /**
     * 检查代理Ip 是否存在
     *
     * @param host
     * @param port
     * @return
     */
    public boolean checkExist(String host, int port);

    /**
     * 更新方法
     *
     * @param clIpPool
     */
    public void updateCrawlerIpPool(ClIpPool clIpPool);

    /**
     * 查询所有数据
     *
     * @param clIpPool
     */
    public List<ClIpPool> queryList(ClIpPool clIpPool);

    /**
     * 获取可用的列表
     *
     * @return
     */
    public List<ClIpPool> queryAvailableList(ClIpPool clIpPool);


    public void delete(ClIpPool clIpPool);


    void unvailableProxy(CrawlerProxy proxy, String errorMsg);
}

CrawlerIpPoolServiceImpl

com.heima.crawler.service.impl.CrawlerIpPoolServiceImpl

@Service
public class CrawlerIpPoolServiceImpl implements CrawlerIpPoolService {
    @Autowired
    private ClIpPoolMapper clIpPoolMapper;

    @Override
    public void saveCrawlerIpPool(ClIpPool clIpPool) {
        clIpPoolMapper.insertSelective(clIpPool);
    }

    @Override
    public boolean checkExist(String host, int port) {
        ClIpPool clIpPool = new ClIpPool();
        clIpPool.setIp(host);
        clIpPool.setPort(port);
        List<ClIpPool> clIpPoolList = clIpPoolMapper.selectList(clIpPool);
        if (null != clIpPoolList && !clIpPoolList.isEmpty()) {
            return true;
        }
        return false;
    }

    @Override
    public void updateCrawlerIpPool(ClIpPool clIpPool) {
        clIpPoolMapper.updateByPrimaryKey(clIpPool);
    }

    @Override
    public List<ClIpPool> queryList(ClIpPool clIpPool) {
        return clIpPoolMapper.selectList(clIpPool);
    }

    @Override
    public List<ClIpPool> queryAvailableList(ClIpPool clIpPool) {
        return clIpPoolMapper.selectAvailableList(clIpPool);
    }

    @Override
    public void delete(ClIpPool clIpPool) {
        clIpPoolMapper.deleteByPrimaryKey(clIpPool.getId());
    }

    @Override
    public void unvailableProxy(CrawlerProxy proxy, String errorMsg) {
        ClIpPool clIpPoolQuery = new ClIpPool();
        clIpPoolQuery.setIp(proxy.getHost());
        clIpPoolQuery.setPort(proxy.getPort());
        clIpPoolQuery.setEnable(true);
        List<ClIpPool> clIpPoolList = clIpPoolMapper.selectList(clIpPoolQuery);
        if (null != clIpPoolList && !clIpPoolList.isEmpty()) {
            for (ClIpPool clIpPool : clIpPoolList) {
                clIpPool.setEnable(false);
                clIpPool.setError(errorMsg);
                clIpPoolMapper.updateByPrimaryKey(clIpPool);
            }
        }
    }
}

1.1.5 测试

@SpringBootTest
@RunWith(SpringRunner.class)
public class CrawlerIpPoolServiceTest {

    @Autowired
    private CrawlerIpPoolService crawlerIpPoolService;

    @Test
    public void testSaveCrawlerIpPool(){
        ClIpPool clIpPool = new ClIpPool();
        clIpPool.setIp("2222.3333.444.5555");
        clIpPool.setPort(1111);
        clIpPool.setEnable(true);
        clIpPool.setCreatedTime(new Date());
        crawlerIpPoolService.saveCrawlerIpPool(clIpPool);
    }


    @Test
    public void testCheckExist(){
        boolean b = crawlerIpPoolService.checkExist("2222.3333.444.555555666555", 1111);
        System.out.println(b);
    }
}

1.2 爬虫文章图文附加信息

1.2.1 需求分析

爬虫文章的附加信息，比如文章的点赞，转发，评论量统计，方便后期追踪数据，反向爬取数据

1.2.2 实体类

ClNewsAdditional

/**
 * 回复
 */
@Data
public class ClNewsAdditional {
    private Integer id;
    private Integer newsId;
    private String url;
    private Integer readCount;
    private Integer likes;
    private Integer comment;
    private Integer forward;
    private Integer unlikes;
    private Integer collection;
    private Date createdTime;
    private Date count;
    private Date updatedTime;
    private Date nextUpdateTime;
    private Integer updateNum;
}

1.2.3 mapper接口定义

ClNewsAdditionalMapper

com.heima.model.mappers.crawerls.ClNewsAdditionalMapper

public interface ClNewsAdditionalMapper {

    
    int deleteByPrimaryKey(Integer id);

    int insert(ClNewsAdditional record);

    
    int insertSelective(ClNewsAdditional record);

    
    ClNewsAdditional selectByPrimaryKey(Integer id);

    
    int updateByPrimaryKeySelective(ClNewsAdditional record);

   
    int updateByPrimaryKey(ClNewsAdditional record);


    /**
     * 按条件查询所有数据
     *
     * @param record
     * @return
     */
    List<ClNewsAdditional> selectList(ClNewsAdditional record);

    /**
     * 获取需要更新的数据
     * @return
     */
    List<ClNewsAdditional> selectListByNeedUpdate(Date currentDate);
}

ClNewsAdditionalMapper.xml

mappers/crawerls/ClNewsAdditionalMapper.xml


DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd" >
<mapper namespace="com.heima.model.mappers.crawerls.ClNewsAdditionalMapper">
    <resultMap id="BaseResultMap" type="com.heima.model.crawler.pojos.ClNewsAdditional">
        <id column="id" property="id"/>
        <result column="news_id" property="newsId"/>
        <result column="url" property="url"/>
        <result column="read_count" property="readCount"/>
        <result column="likes" property="likes"/>
        <result column="comment" property="comment"/>
        <result column="forward" property="forward"/>
        <result column="unlikes" property="unlikes"/>
        <result column="collection" property="collection"/>
        <result column="created_time" property="createdTime"/>
        <result column="count" property="count"/>
        <result column="updated_time" property="updatedTime"/>
        <result column="update_num" property="updateNum"/>
        <result column="next_update_time" property="nextUpdateTime"/>
    resultMap>
    <sql id="Base_Column_List">
    id, news_id, url, read_count, likes, comment, forward, unlikes, collection, created_time, 
    count, updated_time,update_num,next_update_time
  sql>
    <sql id="Base_Column_where">
        <where>
            <if test="newsId!=null and newsId!=''">
                and news_id = #{newsId}
            if>
            <if test="url!=null and url!=''">
                and url = #{url}
            if>
            <if test="readCount!=null and readCount!=''">
                and read_count = #{readCount}
            if>
            <if test="readCount!=null and readCount!=''">
                and read_count = #{readCount}
            if>
        where>
    sql>
    <select id="selectList" resultMap="BaseResultMap">
        select
        <include refid="Base_Column_List"/>
        from cl_news_additional
        <include refid="Base_Column_where"/>
    select>
    <select id="selectListByNeedUpdate" resultMap="BaseResultMap" parameterType="java.util.Date">
        select
        <include refid="Base_Column_List"/>
        from cl_news_additional
        <where>
            
        where>
    select>
    <select id="selectByPrimaryKey" resultMap="BaseResultMap" parameterType="java.lang.Integer">
        select
        <include refid="Base_Column_List"/>
        from cl_news_additional
        where id = #{id}
    select>
    <delete id="deleteByPrimaryKey" parameterType="java.lang.Integer">
    delete from cl_news_additional
    where id = #{id}
  delete>
    <insert id="insert" parameterType="com.heima.model.crawler.pojos.ClNewsAdditional"
            keyProperty="id" useGeneratedKeys="true">
    insert into cl_news_additional (id, news_id, url, 
      read_count, likes, comment, 
      forward, unlikes, collection, 
      created_time, count, updated_time,update_num,next_update_time
      )
    values (#{id}, #{newsId}, #{url},
      #{readCount}, #{likes}, #{comment},
      #{forward}, #{unlikes}, #{collection},
      #{createdTime}, #{count}, #{updatedTime},#{updateNum},#{nextUpdateTime}
      )
  insert>
    <insert id="insertSelective" parameterType="com.heima.model.crawler.pojos.ClNewsAdditional"
            keyProperty="id" useGeneratedKeys="true">

        insert into cl_news_additional
        <trim prefix="(" suffix=")" suffixOverrides=",">
            <if test="id != null">
                id,
            if>
            <if test="newsId != null">
                news_id,
            if>
            <if test="url != null">
                url,
            if>
            <if test="readCount != null">
                read_count,
            if>
            <if test="likes != null">
                likes,
            if>
            <if test="comment != null">
                comment,
            if>
            <if test="forward != null">
                forward,
            if>
            <if test="unlikes != null">
                unlikes,
            if>
            <if test="collection != null">
                collection,
            if>
            <if test="createdTime != null">
                created_time,
            if>
            <if test="count != null">
                count,
            if>
            <if test="updatedTime != null">
                updated_time,
            if>
            <if test="updateNum != null">
                update_num,
            if>

            <if test="nextUpdateTime != null">
                next_update_time,
            if>

        trim>
        <trim prefix="values (" suffix=")" suffixOverrides=",">
            <if test="id != null">
                #{id},
            if>
            <if test="newsId != null">
                #{newsId},
            if>
            <if test="url != null">
                #{url},
            if>
            <if test="readCount != null">
                #{readCount},
            if>
            <if test="likes != null">
                #{likes},
            if>
            <if test="comment != null">
                #{comment},
            if>
            <if test="forward != null">
                #{forward},
            if>
            <if test="unlikes != null">
                #{unlikes},
            if>
            <if test="collection != null">
                #{collection},
            if>
            <if test="createdTime != null">
                #{createdTime},
            if>
            <if test="count != null">
                #{count},
            if>
            <if test="updatedTime != null">
                #{updatedTime},
            if>
            <if test="updateNum != null">
                #{updateNum},
            if>
            <if test="nextUpdateTime != null">
                #{nextUpdateTime},
            if>
        trim>
    insert>
    <update id="updateByPrimaryKeySelective"
            parameterType="com.heima.model.crawler.pojos.ClNewsAdditional">
        update cl_news_additional
        <set>
            <if test="newsId != null">
                news_id = #{newsId},
            if>
            <if test="url != null">
                url = #{url},
            if>
            <if test="readCount != null">
                read_count = #{readCount},
            if>
            <if test="likes != null">
                likes = #{likes},
            if>
            <if test="comment != null">
                comment = #{comment},
            if>
            <if test="forward != null">
                forward = #{forward},
            if>
            <if test="unlikes != null">
                unlikes = #{unlikes},
            if>
            <if test="collection != null">
                collection = #{collection},
            if>
            <if test="createdTime != null">
                created_time = #{createdTime},
            if>
            <if test="count != null">
                count = #{count},
            if>
            <if test="updatedTime != null">
                updated_time = #{updatedTime},
            if>
            <if test="updateNum != null">
                update_num = #{updateNum},
            if>
            <if test="nextUpdateTime != null">
                next_update_time = #{nextUpdateTime},
            if>
        set>
        where id = #{id}
    update>
    <update id="updateByPrimaryKey" parameterType="com.heima.model.crawler.pojos.ClNewsAdditional">
    update cl_news_additional
    set news_id = #{newsId},
      url = #{url},
      read_count = #{readCount},
      likes = #{likes},
      comment = #{comment},
      forward = #{forward},
      unlikes = #{unlikes},
      collection = #{collection},
      created_time = #{createdTime},
      count = #{count},
      updated_time = #{updatedTime},
       update_num = #{updateNum},
      next_update_time = #{nextUpdateTime}
    where id = #{id}
  update>
mapper>

1.2.4 service代码

CrawlerNewsAdditionalService

com.heima.crawler.service.CrawlerNewsAdditionalService

public interface CrawlerNewsAdditionalService {

    void saveAdditional(ClNewsAdditional clNewsAdditional);

    public List<ClNewsAdditional> queryListByNeedUpdate(Date currentDate);

    List<ClNewsAdditional> queryList(ClNewsAdditional clNewsAdditional);

    public boolean checkExist(String url);

    public ClNewsAdditional getAdditionalByUrl(String url);

    /**
     * 是否是已存在的URL
     *
     * @return
     */
    public boolean isExistsUrl(String url);

    public void updateAdditional(ClNewsAdditional clNewsAdditional);

    public List<ParseItem> toParseItem(List<ClNewsAdditional> additionalList);

    public List<ParseItem> queryIncrementParseItem(Date currentDate);
}

CrawlerNewsAdditionalServiceImpl

com.heima.crawler.service.impl.CrawlerNewsAdditionalServiceImpl

@Service
public class CrawlerNewsAdditionalServiceImpl implements CrawlerNewsAdditionalService {

    @Autowired
    private ClNewsAdditionalMapper clNewsAdditionalMapper;

    public List<ClNewsAdditional> queryList(ClNewsAdditional clNewsAdditional) {
        return clNewsAdditionalMapper.selectList(clNewsAdditional);
    }

    /**
     * 获取待更新的数据
     *
     * @return
     */
    public List<ClNewsAdditional> queryListByNeedUpdate(Date currentDate) {
        return clNewsAdditionalMapper.selectListByNeedUpdate(currentDate);
    }

    @Override
    public ClNewsAdditional getAdditionalByUrl(String url) {
        ClNewsAdditional clNewsAdditional = new ClNewsAdditional();
        clNewsAdditional.setUrl(url);
        List<ClNewsAdditional> additionalList = queryList(clNewsAdditional);
        if (null != additionalList && !additionalList.isEmpty()) {
            return additionalList.get(0);
        }
        return null;
    }


    /**
     * 是否是已存在的URL
     *
     * @return
     */
    public boolean isExistsUrl(String url) {
        boolean isExistsUrl = false;
        if (StringUtils.isNotEmpty(url)) {
            ClNewsAdditional clNewsAdditional = getAdditionalByUrl(url);
            if (null != clNewsAdditional) {
                isExistsUrl = true;
            }
        }
        return isExistsUrl;
    }

    @Override
    public boolean checkExist(String url) {
        ClNewsAdditional clNewsAdditional = new ClNewsAdditional();
        clNewsAdditional.setUrl(url);
        List<ClNewsAdditional> clNewsAdditionalList = clNewsAdditionalMapper.selectList(clNewsAdditional);
        if (null != clNewsAdditionalList && !clNewsAdditionalList.isEmpty()) {
            return true;
        }
        return false;
    }


    @Override
    public void updateAdditional(ClNewsAdditional clNewsAdditional) {
        clNewsAdditionalMapper.updateByPrimaryKeySelective(clNewsAdditional);
    }

    @Override
    public void saveAdditional(ClNewsAdditional clNewsAdditional) {
        clNewsAdditionalMapper.insertSelective(clNewsAdditional);
    }

    /**
     * 转换为ParseItem
     *
     * @param additionalList
     * @return
     */
    public List<ParseItem> toParseItem(List<ClNewsAdditional> additionalList) {
        List<ParseItem> parseItemList = new ArrayList<ParseItem>();
        if (null != additionalList && !additionalList.isEmpty()) {
            for (ClNewsAdditional additional : additionalList) {
                ParseItem parseItem = toParseItem(additional);
                if (null != parseItem) {
                    parseItemList.add(parseItem);
                }
            }
        }
        return parseItemList;
    }

    private ParseItem toParseItem(ClNewsAdditional additional) {
        CrawlerParseItem crawlerParseItem = null;
        if (null != additional) {
            crawlerParseItem = new CrawlerParseItem();
            crawlerParseItem.setUrl(additional.getUrl());
        }
        return crawlerParseItem;
    }

    /**
     * 获取增量统计数据
     * @return
     */
    public List<ParseItem> queryIncrementParseItem(Date currentDate) {
        List<ClNewsAdditional> clNewsAdditionalList = queryListByNeedUpdate(currentDate);
        List<ParseItem> parseItemList = toParseItem(clNewsAdditionalList);
        return parseItemList;
    }
}

1.2.5 测试

@SpringBootTest
@RunWith(SpringRunner.class)
public class CrawlerNewsAdditionalServiceTest {

    @Autowired
    private CrawlerNewsAdditionalService crawlerNewsAdditionalService;

    @Test
    public void testQueryList(){
        ClNewsAdditional clNewsAdditional = new ClNewsAdditional();
        clNewsAdditional.setUrl("https://blog.csdn.net/weixin_43976602/article/details/96971651");
        List<ClNewsAdditional> clNewsAdditionals = crawlerNewsAdditionalService.queryList(clNewsAdditional);
        System.out.println(clNewsAdditionals);
    }

    @Test
    public void testCheckExist(){
        boolean b = crawlerNewsAdditionalService.checkExist("https://blog.csdn.net/weixin_43976602/article/details/96971651");
        System.out.println(b);
    }
}

1.3 爬虫文章图文评论信息表

1.3.1 需求分析

保存文章的评论信息

1.3.2 实体类

com.heima.model.crawler.pojos.ClNewsComment

/**
 * 文章评论
 */
@Data
public class ClNewsComment implements Serializable {
    /**
     * 主键
     */
    private Integer id;
    /**
     * 文章ID
     */
    private Integer newsId;

    /**
     * 评论人
     */
    private String username;

    /**
     * 评论内容
     */
    private String content;

    /**
     * 评论日期
     */
    private Date commentDate;

    /**
     * 创建日期
     */
    private Date createdDate;
}

1.3.3 mapper接口定义

ClNewsCommentMapper

com.heima.model.mappers.crawerls.ClNewsCommentMapper

public interface ClNewsCommentMapper {

    
    int deleteByPrimaryKey(Integer id);

   
    int insert(ClNewsComment record);

    
    int insertSelective(ClNewsComment record);

    
    ClNewsComment selectByPrimaryKey(Integer id);

    
    int updateByPrimaryKeySelective(ClNewsComment record);

    
    int updateByPrimaryKey(ClNewsComment record);


    /**
     * 按条件查询所有数据
     *
     * @param record
     * @return
     */
    List<ClNewsComment> selectList(ClNewsComment record);
}

ClNewsCommentMapper.xml

mappers/admin/ClNewsCommentMapper.xml


DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd" >
<mapper namespace="com.heima.model.mappers.crawerls.ClNewsCommentMapper">
    <resultMap id="BaseResultMap" type="com.heima.model.crawler.pojos.ClNewsComment">
        <id column="id" property="id"/>
        <result column="news_id" property="newsId"/>
        <result column="username" property="username"/>
        <result column="content" property="content"/>
        <result column="comment_date" property="commentDate"/>
        <result column="created_date" property="createdDate"/>
    resultMap>
    <sql id="Base_Column_List">
      id,news_id,username,content,comment_date,created_date
    sql>
    <sql id="Base_Column_where">
        <where>
            <if test="newsId!=null and newsId!=''">
                news_id = #{newsId}
            if>
            <if test="username!=null and username!=''">
                username = #{username}
            if>
            <if test="content!=null and content!=''">
                content = #{content}
            if>
            <if test="commentDate!=null and commentDate!=''">
                comment_date = #{commentDate}
            if>
            <if test="createdDate!=null and createdDate!=''">
                created_date = #{createdDate}
            if>
        where>
    sql>
    <select id="selectList" resultMap="BaseResultMap">
        select
        <include refid="Base_Column_List"/>
        from cl_news_comments
        <include refid="Base_Column_where"/>
    select>
    <select id="selectByPrimaryKey" resultMap="BaseResultMap" parameterType="java.lang.Integer">
        select
        <include refid="Base_Column_List"/>
        from cl_news_comments
        where id = #{id}
    select>
    <delete id="deleteByPrimaryKey" parameterType="java.lang.Integer">
        delete from cl_news_comments
        where id = #{id}
    delete>
    <insert id="insert" parameterType="com.heima.model.crawler.pojos.ClNewsAdditional">

        insert into cl_news_comments (id, news_id, username,
        content, comment_date, created_time
        )
        values (#{id}, #{newsId}, #{username},
        #{content}, #{commentDate}, #{createTime}
        )
    insert>
    <insert id="insertSelective" parameterType="com.heima.model.crawler.pojos.ClNewsAdditional">

        insert into cl_news_comments
        <trim prefix="(" suffix=")" suffixOverrides=",">
            <if test="id != null">
                id,
            if>
            <if test="newsId != null">
                news_id,
            if>
            <if test="username != null">
                username,
            if>
            <if test="content != null">
                content,
            if>
            <if test="commentDate != null">
                comment_date,
            if>
            <if test="createdDate != null">
                created_date,
            if>
        trim>
        <trim prefix="values (" suffix=")" suffixOverrides=",">
            <if test="id != null">
                #{id},
            if>
            <if test="newsId != null">
                #{newsId},
            if>
            <if test="username != null">
                #{username},
            if>
            <if test="content != null">
                #{content},
            if>
            <if test="commentDate != null">
                #{commentDate},
            if>
            <if test="createdDate != null">
                #{createdDate},
            if>
        trim>
    insert>
    <update id="updateByPrimaryKeySelective"
            parameterType="com.heima.model.crawler.pojos.ClNewsAdditional">
        update cl_news_comments
        <set>
            <if test="newsId != null">
                news_id = #{newsId},
            if>
            <if test="username != null">
                username = #{username},
            if>
            <if test="content != null">
                content = #{content},
            if>
            <if test="commentDate != null">
                comment_date = #{commentDate},
            if>
            <if test="createdDate != null">
                created_date = #{createdDate},
            if>
        set>
        where id = #{id}
    update>
    <update id="updateByPrimaryKey" parameterType="com.heima.model.crawler.pojos.ClNewsAdditional">

        update cl_news_comments
        set news_id = #{newsId},
        username = #{username},
        content = #{content},
        comment_date = #{commentDate},
        created_date = #{createdDate},
        where id = #{id}
    update>
mapper>

1.3.4 service代码

CrawlerNewsCommentService

com.heima.crawler.service.CrawlerNewsCommentService

public interface CrawlerNewsCommentService {
    public void saveClNewsComment(ClNewsComment clNewsComment);
}

ClNewsCommentServiceImpl

com.heima.crawler.service.impl.ClNewsCommentServiceImpl

@Service
public class ClNewsCommentServiceImpl implements CrawlerNewsCommentService {
    @Autowired
    private ClNewsCommentMapper clNewsCommentMapper;

    @Override
    public void saveClNewsComment(ClNewsComment clNewsComment) {
        clNewsCommentMapper.insertSelective(clNewsComment);
    }
}

1.3.5 测试

1.4 爬虫文章

1.4.1 需求分析

文章的增删改查

1.4.2 实体类

ClNews

com.heima.model.crawler.pojos.ClNews

/**
 * 文章
 */
public class ClNews {
    private Integer id;
    private Integer taskId;
    private String title;
    private String name;
    private int type;
    private Integer channelId;
    private String labels;
    private Date originalTime;
    private Date createdTime;
    private Date submitedTime;
    private Byte status;
    private Date publishTime;
    private String reason;
    private Integer articleId;
    private Integer no;
    private String content;
    private String labelIds;
    public String getUnCompressContent() {
        if (StringUtils.isNotEmpty(content)) {
            return ZipUtils.gunzip(content);
        }
        return content;
    }


    public Integer getId() {
        return id;
    }

    public void setId(Integer id) {
        this.id = id;
    }

    public Integer getTaskId() {
        return taskId;
    }

    public void setTaskId(Integer taskId) {
        this.taskId = taskId;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public int getType() {
        return type;
    }

    public void setType(int type) {
        this.type = type;
    }

    public Integer getChannelId() {
        return channelId;
    }

    public void setChannelId(Integer channelId) {
        this.channelId = channelId;
    }

    public String getLabels() {
        return labels;
    }

    public void setLabels(String labels) {
        this.labels = labels;
    }

    public Date getOriginalTime() {
        return originalTime;
    }

    public void setOriginalTime(Date originalTime) {
        this.originalTime = originalTime;
    }

    public Date getCreatedTime() {
        return createdTime;
    }

    public void setCreatedTime(Date createdTime) {
        this.createdTime = createdTime;
    }

    public Date getSubmitedTime() {
        return submitedTime;
    }

    public void setSubmitedTime(Date submitedTime) {
        this.submitedTime = submitedTime;
    }

    public Byte getStatus() {
        return status;
    }

    public void setStatus(Byte status) {
        this.status = status;
    }

    public Date getPublishTime() {
        return publishTime;
    }

    public void setPublishTime(Date publishTime) {
        this.publishTime = publishTime;
    }

    public String getReason() {
        return reason;
    }

    public void setReason(String reason) {
        this.reason = reason;
    }

    public Integer getArticleId() {
        return articleId;
    }

    public void setArticleId(Integer articleId) {
        this.articleId = articleId;
    }

    public Integer getNo() {
        return no;
    }

    public void setNo(Integer no) {
        this.no = no;
    }

    public String getContent() {
        return content;
    }

    public void setContent(String content) {
        this.content = content;
    }

    public String getLabelIds() {
        return labelIds;
    }

    public void setLabelIds(String labelIds) {
        this.labelIds = labelIds;
    }
}

1.4.3 mapper接口定义

ClNewsMapper

com.heima.model.mappers.crawerls.ClNewsMapper

public interface ClNewsMapper {
    
    int deleteByPrimaryKey(Integer id);
    
    int insert(ClNews record);
    
    int insertSelective(ClNews record);
    
    ClNews selectByPrimaryKey(Integer id);
    
    int updateByPrimaryKeySelective(ClNews record);
    
    int updateStatus(ClNews record);
    
    int updateByPrimaryKeyWithBLOBs(ClNews record);
    
    int updateByPrimaryKey(ClNews record);
    /**
     * 按条件查询所有数据
     *
     * @param record
     * @return
     */
    List<ClNews> selectList(ClNews record);

    void deleteByUrl(String url);

    ClNews selectByIdAndStatus(ClNews param);
}

ClNewsMapper.xml

mappers/crawerls/ClNewsMapper.xml


DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd" >
<mapper namespace="com.heima.model.mappers.crawerls.ClNewsMapper">
    <resultMap id="BaseResultMap" type="com.heima.model.crawler.pojos.ClNews">
        <id column="id" property="id"/>
        <result column="task_id" property="taskId"/>
        <result column="title" property="title"/>
        <result column="name" property="name"/>
        <result column="type" property="type"/>
        <result column="channel_id" property="channelId"/>
        <result column="labels" property="labels"/>
        <result column="label_ids" property="labelIds"/>
        <result column="original_time" property="originalTime"/>
        <result column="created_time" property="createdTime"/>
        <result column="submited_time" property="submitedTime"/>
        <result column="status" property="status"/>
        <result column="publish_time" property="publishTime"/>
        <result column="reason" property="reason"/>
        <result column="article_id" property="articleId"/>
        <result column="no" property="no"/>
    resultMap>
    <resultMap id="ResultMapWithBLOBs" type="com.heima.model.crawler.pojos.ClNews"
               extends="BaseResultMap">
        <result column="content" property="content" jdbcType="LONGVARCHAR"/>
    resultMap>
    <sql id="Base_Column_List">
    id, task_id, title, name, type, channel_id, labels,label_ids original_time, created_time,
    submited_time, status, publish_time, reason, article_id, no
  sql>
    <sql id="Blob_Column_List">
    ,content
  sql>
    <sql id="Base_Column_where">
        <where>
            <if test="id!=null and id!=''">
                and id = #{id}
            if>
            <if test="title!=null and title!=''">
                and title = #{title}
            if>
            <if test="name!=null and name!=''">
                and name = #{name}
            if>
            <if test="type!=null and type!=''">
                and type = #{type}
            if>
            <if test="status!=null and status!=''">
                and status = #{status}
            if>
        where>
    sql>
    <select id="selectList" resultMap="ResultMapWithBLOBs">
        select
        <include refid="Base_Column_List"/>
        <include refid="Blob_Column_List"/>
        from cl_news
        <include refid="Base_Column_where"/>
    select>
    <select id="selectByIdAndStatus" resultMap="ResultMapWithBLOBs">
        select
        <include refid="Base_Column_List"/>
        <include refid="Blob_Column_List"/>
        from cl_news
        <include refid="Base_Column_where"/>
    select>
    <select id="selectByPrimaryKey" resultMap="ResultMapWithBLOBs" parameterType="java.lang.Integer">
        select
        <include refid="Base_Column_List"/>
        ,
        <include refid="Blob_Column_List"/>
        from cl_news
        where id = #{id}
    select>
    <delete id="deleteByPrimaryKey" parameterType="java.lang.Integer">
    delete from cl_news
    where url = #{id}
  delete>
    <delete id="deleteByUrl" parameterType="java.lang.String">
    delete from cl_news
    where id = #{url}
  delete>
    <insert id="insert" parameterType="com.heima.model.crawler.pojos.ClNews" useGeneratedKeys="true"
            keyProperty="id">
    insert into cl_news (id, task_id,title,
      name, type, channel_id,
      labels,label_ids, original_time, created_time,
      submited_time, status, publish_time,
      reason, article_id, no,
      content)
    values (#{id}, #{taskId},#{title},
      #{name}, #{type}, #{channelId},
      #{labels}, #{labelIds}, #{originalTime}, #{createdTime},
      #{submitedTime}, #{status,jdbcType=TINYINT}, #{publishTime},
      #{reason}, #{articleId}, #{no},
      #{content,jdbcType=LONGVARCHAR})
  insert>
    <insert id="insertSelective" parameterType="com.heima.model.crawler.pojos.ClNews"
            keyProperty="id" useGeneratedKeys="true">
        insert into cl_news
        <trim prefix="(" suffix=")" suffixOverrides=",">
            <if test="id != null">
                id,
            if>
            <if test="taskId != null">
                task_id,
            if>
            <if test="title != null">
                title,
            if>
            <if test="name != null">
                name,
            if>
            <if test="type != null">
                type,
            if>
            <if test="channelId != null">
                channel_id,
            if>
            <if test="labels != null">
                labels,
            if>
            <if test="labelIds != null">
                label_ids,
            if>
            <if test="originalTime != null">
                original_time,
            if>
            <if test="createdTime != null">
                created_time,
            if>
            <if test="submitedTime != null">
                submited_time,
            if>
            <if test="status != null">
                status,
            if>
            <if test="publishTime != null">
                publish_time,
            if>
            <if test="reason != null">
                reason,
            if>
            <if test="articleId != null">
                article_id,
            if>
            <if test="no != null">
                no,
            if>
            <if test="content != null">
                content,
            if>
        trim>
        <trim prefix="values (" suffix=")" suffixOverrides=",">
            <if test="id != null">
                #{id},
            if>
            <if test="taskId != null">
                #{taskId},
            if>

            <if test="title != null">
                #{title},
            if>
            <if test="name != null">
                #{name},
            if>
            <if test="type != null">
                #{type},
            if>
            <if test="channelId != null">
                #{channelId},
            if>
            <if test="labels != null">
                #{labels},
            if>
            <if test="labelIds != null">
                #{labelIds},
            if>
            <if test="originalTime != null">
                #{originalTime},
            if>
            <if test="createdTime != null">
                #{createdTime},
            if>
            <if test="submitedTime != null">
                #{submitedTime},
            if>
            <if test="status != null">
                #{status,jdbcType=TINYINT},
            if>
            <if test="publishTime != null">
                #{publishTime},
            if>
            <if test="reason != null">
                #{reason},
            if>
            <if test="articleId != null">
                #{articleId},
            if>
            <if test="no != null">
                #{no},
            if>
            <if test="content != null">
                #{content,jdbcType=LONGVARCHAR},
            if>
        trim>
    insert>
    <update id="updateStatus" parameterType="com.heima.model.crawler.pojos.ClNews">
        update cl_news
        <set>
            <if test="submitedTime != null">
                submited_time = #{submitedTime},
            if>
            <if test="status != null">
                status = #{status,jdbcType=TINYINT},
            if>
            <if test="publishTime != null">
                publish_time = #{publishTime},
            if>
            <if test="reason != null">
                reason = #{reason},
            if>
            <if test="no != null">
                no = #{no},
            if>
        set>
        where id = #{id}
    update>
    <update id="updateByPrimaryKeySelective" parameterType="com.heima.model.crawler.pojos.ClNews">
        update cl_news
        <set>
            <if test="taskId != null">
                task_id = #{taskId},
            if>
            <if test="title != null">
                title = #{title},
            if>
            <if test="name != null">
                name = #{name},
            if>
            <if test="type != null">
                type = #{type},
            if>
            <if test="channelId != null">
                channel_id = #{channelId},
            if>
            <if test="labels != null">
                labels = #{labels},
            if>
            <if test="labelIds != null">
                label_ids = #{labelIds},
            if>
            <if test="originalTime != null">
                original_time = #{originalTime},
            if>
            <if test="createdTime != null">
                created_time = #{createdTime},
            if>
            <if test="submitedTime != null">
                submited_time = #{submitedTime},
            if>
            <if test="status != null">
                status = #{status,jdbcType=TINYINT},
            if>
            <if test="publishTime != null">
                publish_time = #{publishTime},
            if>
            <if test="reason != null">
                reason = #{reason},
            if>
            <if test="articleId != null">
                article_id = #{articleId},
            if>
            <if test="no != null">
                no = #{no},
            if>
            <if test="content != null">
                content = #{content,jdbcType=LONGVARCHAR},
            if>
        set>
        where id = #{id}
    update>
    <update id="updateByPrimaryKeyWithBLOBs" parameterType="com.heima.model.crawler.pojos.ClNews">
    update cl_news
    set task_id = #{taskId},
      title = #{title},
      name = #{name},
      type = #{type},
      channel_id = #{channelId},
      labels = #{labels},
      label_ids = #{labelIds},
      original_time = #{originalTime},
      created_time = #{createdTime},
      submited_time = #{submitedTime},
      status = #{status,jdbcType=TINYINT},
      publish_time = #{publishTime},
      reason = #{reason},
      article_id = #{articleId},
      no = #{no},
      content = #{content,jdbcType=LONGVARCHAR}
    where id = #{id}
  update>
    <update id="updateByPrimaryKey" parameterType="com.heima.model.crawler.pojos.ClNews">
    update cl_news
    set task_id = #{taskId},
      url = #{url},
      title = #{title},
      name = #{name},
      type = #{type},
      channel_id = #{channelId},
      labels = #{labels},
      label_ids = #{labelIds},
      reading_num = #{readingNum},
      original_time = #{originalTime},
      created_time = #{createdTime},
      submited_time = #{submitedTime},
      status = #{status,jdbcType=TINYINT},
      publish_time = #{publishTime},
      reason = #{reason},
      article_id = #{articleId},
      no = #{no}
    where id = #{id}
  update>
mapper>

1.4.4 service代码

CrawlerNewsService

com.heima.crawler.service.CrawlerNewsService

public interface CrawlerNewsService {
    public void saveNews(ClNews clNews);
    public void updateNews(ClNews clNews);
    public void deleteByUrl(String url);
    public List<ClNews> queryList(ClNews clNews);
}

CrawlerNewsServiceImp

com.heima.crawller.service.impl.CrawlerNewsServiceImpl

@Service
public class CrawlerNewsServiceImpl implements CrawlerNewsService {
    @Autowired
    private ClNewsMapper clNewsMapper;
    @Override
    public void saveNews(ClNews clNews) {
        clNewsMapper.insertSelective(clNews);
    }
    @Override
    public void deleteByUrl(String url) {
        clNewsMapper.deleteByUrl(url);
    }
    @Override
    public List<ClNews> queryList(ClNews clNews) {
        return clNewsMapper.selectList(clNews);
    }
    @Override
    public void updateNews(ClNews clNews) {
        clNewsMapper.updateByPrimaryKey(clNews);
    }
}

1.4.5 测试

2 排重

下载后的内容需要进行排重，一个URL不能重复下载，这里用到了redis排重分两步

redis排重

数据库排重

2.1集成redis

(1)heima-leadnews-common模块集成redis

redis.properties

#redis config
spring.redis.host=127.0.0.1
spring.redis.port=6379
spring.redis.password=123456
spring.redis.timeout=90000
spring.redis.lettuce.pool.max-active=8
spring.redis.lettuce.pool.max-idle=8
spring.redis.lettuce.pool.max-wait=-1
spring.redis.lettuce.pool.min-idle=0

创建配置类：com.heima.common.redis.RedisConfiguration

@Configuration
@ConfigurationProperties(prefix = "spring.redis")
@PropertySource("classpath:redis.properties")
public class RedisConfiguration extends RedisAutoConfiguration {
}

(2)heima-leadnews-crawler微服务中引入redis的配置

创建配置类：com.heima.crawler.config.RedisConfig

@Configuration
@ComponentScan("com.heima.common.redis")
public class RedisConfig {
}

2.2 DbAndRedisScheduler类

该类的主要作用是为了防重复

/**
 * URL防重复
 */
@Log4j2
public class DbAndRedisScheduler extends RedisScheduler implements ProcessFlow {

    @Autowired
    private CrawlerHelper crawlerHelper;

    @Autowired
    private CrawlerNewsAdditionalService crawlerNewsAdditionalService;

    public DbAndRedisScheduler(String host) {
        super(host);
    }

    public DbAndRedisScheduler(JedisPool pool) {
        super(pool);
    }

    /**
     * 是否重复
     * @param request request请求
     * @param task 任务
     * @return
     */
    @Override
    public boolean isDuplicate(Request request, Task task) {
        String handelType = crawlerHelper.getHandelType(request);
        boolean isExist = false;
        //正向统计才尽心排重
        if (CrawlerEnum.HandelType.FORWARD.name().equals(handelType)) {
            log.info("URL排重开始，URL:{},documentType:{}", request.getUrl(), handelType);
            isExist = super.isDuplicate(request, task);
            if (!isExist) {
                isExist = crawlerNewsAdditionalService.isExistsUrl(request.getUrl());
            }
            log.info("URL排重结束，URL:{}，handelType:{},isExist：{}", request.getUrl(), handelType, isExist);
        } else {
            log.info("反向抓取，不进行URL排重");
        }
        return isExist;
    }


    @Override
    public void handel(ProcessFlowData processFlowData) {

    }

    @Override
    public CrawlerEnum.ComponentType getComponentType() {
        return CrawlerEnum.ComponentType.SCHEDULER;
    }

    @Override
    public int getPriority() {
        return 123;
    }
}

交给spring管理,在CrawlerConfig配置中配置初始化当前类

@Value("${redis.host}")
private String redisHost;
@Value("${redis.port}")
private int reidsPort;
@Value("${redis.timeout}")
private int reidstimeout;
@Value("${redis.password}")
private String reidsPassword;

@Bean
public DbAndRedisScheduler getDbAndRedisScheduler() {
    GenericObjectPoolConfig genericObjectPoolConfig = new GenericObjectPoolConfig();
    JedisPool jedisPool = new JedisPool(genericObjectPoolConfig, redisHost, reidsPort, reidstimeout, null, 0);
    return new DbAndRedisScheduler(jedisPool);
}

2.3 测试

直接使用ProcessingFlowManagerTest这个类测试即可

3 解析模块

3.1 前置数据

3.1.1 HtmlStyle类

com.heima.model.crawler.core.label.HtmlStyle

public class HtmlStyle {
    private Map<String, String> styleMap = new HashMap<>();

    public void addStyle(String key, String value) {
        styleMap.put(key, value);
    }

    public void addStyle(Map<String, String> map) {
        styleMap.putAll(map);
    }

    public String getCssStyle() {
        StringBuilder sb = new StringBuilder();
        for (Map.Entry<String, String> entry : styleMap.entrySet()) {
            sb.append(entry.getKey()).append(":'").append(entry.getValue()).append("',");
        }
        return StringUtils.removeEnd(sb.toString(), ",");
    }

    public Map<String, String> getStyleMap() {
        return styleMap;
    }

    public void setStyleMap(Map<String, String> styleMap) {
        this.styleMap = styleMap;
    }
}

3.1.2 AbstractHtmlParsePipeline抽象类

com.heima.crawler.process.parse.AbstractHtmlParsePipeline

/**
 * Html 解析抽抽象类，定义了公用的方法以及抽象模板
 * 
 * 

 * Pipeline负责抽取结果的处理，包括计算、持久化到文件、数据库等。WebMagic默认提供了“输出到控制台”和“保存到文件”两种结果处理方案。
 * 
 * Pipeline定义了结果保存的方式，如果你要保存到指定数据库，则需要编写对应的Pipeline。对于一类需求一般只需编写一个Pipeline。
 *
 * @param 
 */
@Log4j2
public abstract class AbstractHtmlParsePipeline<T> extends AbstractProcessFlow implements Pipeline {

    @Autowired
    private CrawlerHelper crawlerHelper;


    public void handel(ProcessFlowData processFlowData) {

    }


    /**
     * 这里传入的是处理好的对象
     * 对结构数据进行清洗以及存储
     * 是 Pipeline的主要入口方法
     *
     * @param resultItems ResultItems保存了抽取结果，它是一个Map结构，
     * @param task
     */
    @Override
    public void process(ResultItems resultItems, Task task) {
        long currentTime = System.currentTimeMillis();
        String url = resultItems.getRequest().getUrl();
        String documentType = crawlerHelper.getDocumentType(resultItems.getRequest());
        String handelType = crawlerHelper.getHandelType(resultItems.getRequest());
        log.info("开始解析抽取后的数据，url:{},handelType:{}", url, handelType);
        if (!CrawlerEnum.DocumentType.PAGE.name().equals(documentType)) {
            log.error("不符合的文档类型，url:{},documentType:{},handelType:{}", url, documentType, handelType);
            return;
        }
        ParseItem parseItem = crawlerHelper.getParseItem(resultItems.getRequest());
        if (null != parseItem && StringUtils.isNotEmpty(url)) {
            Map<String, Object> grabParameterMap = resultItems.getAll();
            preParameterHandel(grabParameterMap);
            if (url.equals(parseItem.getInitialUrl())) {
                //通过泛着进行设置属性
                ReflectUtils.setPropertie(parseItem, grabParameterMap, true);
                //设置处理类型
                parseItem.setHandelType(crawlerHelper.getHandelType(resultItems.getRequest()));
                handelHtmlData((T) parseItem);
            }
        }
        log.info("解析抽取后的数据完成，url:{},handelType:{},耗时：{}", url, handelType, System.currentTimeMillis() - currentTime);
    }

    /**
     * 前置参数处理
     */
    public abstract void preParameterHandel(Map<String, Object> parameter);

    /**
     * html 数据处理以及清洗存储
     *
     * @param t
     */
    public abstract void handelHtmlData(T t);


    /**
     * 获取解析表达式
     *
     * @return
     */
    public String getParseExpression() {
        return "p,pre,h1,h2,h3,h4,h5";
    }

    /**
     * 获取默认的Html 样式
     *
     * @return
     */
    public Map<String, HtmlStyle> getDefHtmlStyleMap() {
        Map<String, HtmlStyle> styleMap = new HashMap<String, HtmlStyle>();
        //h1 数据添加
        HtmlStyle h1Style = new HtmlStyle();
        h1Style.addStyle("font-size", "22px");
        h1Style.addStyle("line-height", "24px");
        styleMap.put("h1", h1Style);
        //h2 数据添加
        HtmlStyle h2Style = new HtmlStyle();
        h2Style.addStyle("font-size", "18px");
        h2Style.addStyle("line-height", "20px");
        styleMap.put("h2", h2Style);
        //h3 数据添加
        HtmlStyle h3Style = new HtmlStyle();
        h3Style.addStyle("font-size", "16px");
        h3Style.addStyle("line-height", "18px");
        styleMap.put("h3", h3Style);
        //h4 数据添加
        HtmlStyle h4Style = new HtmlStyle();
        h4Style.addStyle("font-size", "14px");
        h4Style.addStyle("line-height", "16px");
        styleMap.put("h4", h4Style);
        //h5 数据添加
        HtmlStyle h5Style = new HtmlStyle();
        h5Style.addStyle("font-size", "12px");
        h5Style.addStyle("line-height", "14px");
        styleMap.put("h5", h5Style);
        //h6 数据添加
        HtmlStyle h6Style = new HtmlStyle();
        h6Style.addStyle("font-size", "10px");
        h6Style.addStyle("line-height", "12px");
        styleMap.put("h6", h6Style);
        return styleMap;
    }

    /**
     * 获取组件类型
     *
     * @return
     */
    @Override
    public CrawlerEnum.ComponentType getComponentType() {
        return CrawlerEnum.ComponentType.PIPELINE;
    }
}

3.2 线程池操作类CrawlerThreadPool

com.heima.crawler.process.thread.CrawlerThreadPool

/**
 * 线程池处理类
 */
@Log4j2
public class CrawlerThreadPool {
    /**
     * 线程池最大连接数 IO密集型 2n+1
     */

    private static final int threadNum = Runtime.getRuntime().availableProcessors();
    /**
     * 创建一个阻塞队列
     */
    private static final ArrayBlockingQueue queue = new ArrayBlockingQueue<Runnable>(10000);
    /**
     * 创建一个线程池
     * 核心线程数 1
     * 最大线程数 2n+1
     * 等待超时时间60秒
     * 阻塞队列大小 10000
     */
    private static final ExecutorService executorService = new ThreadPoolExecutor(1, threadNum,
            60L, TimeUnit.SECONDS, queue) {

        @Override
        protected void beforeExecute(Thread t, Runnable r) {
            log.info("线程池开始执行任务，threadName:{},线程池堆积数量：{}", t.getName(), queue.size());
        }

        @Override
        protected void afterExecute(Runnable r, Throwable t) {
            log.info("线程池开始执行完成");
            if (null != t) {
                log.error(t.getLocalizedMessage());
            }
        }
    };


    /**
     * 提交一个线程
     *
     * @param runnable
     */
    public static void submit(Runnable runnable) {
        log.info("线程池添加任务,线程池堆积任务数量：{},最大线程数:{}", queue.size(), threadNum);
        executorService.execute(runnable);
    }


}

3.3 Gzip压缩工具类 ZipUtils

com.heima.model.crawler.core.parse.ZipUtils

/**
 * 字符串压缩
 */
public class ZipUtils {


    /**
     * 使用gzip进行压缩
     */
    public static String gzip(String primStr) {
        if (primStr == null || primStr.length() == 0) {
            return primStr;
        }


        ByteArrayOutputStream out = new ByteArrayOutputStream();


        GZIPOutputStream gzip = null;
        try {
            gzip = new GZIPOutputStream(out);
            gzip.write(primStr.getBytes());
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (gzip != null) {
                try {
                    gzip.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        return new sun.misc.BASE64Encoder().encode(out.toByteArray());
    }


    /**
     * 
     * Description:使用gzip进行解压缩
     * 
     *
     * @param compressedStr
     * @return
     */
    public static String gunzip(String compressedStr) {
        if (compressedStr == null) {
            return null;
        }


        ByteArrayOutputStream out = new ByteArrayOutputStream();
        ByteArrayInputStream in = null;
        GZIPInputStream ginzip = null;
        byte[] compressed = null;
        String decompressed = null;
        try {
            compressed = new sun.misc.BASE64Decoder().decodeBuffer(compressedStr);
            in = new ByteArrayInputStream(compressed);
            ginzip = new GZIPInputStream(in);


            byte[] buffer = new byte[1024];
            int offset = -1;
            while ((offset = ginzip.read(buffer)) != -1) {
                out.write(buffer, 0, offset);
            }
            decompressed = out.toString();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (ginzip != null) {
                try {
                    ginzip.close();
                } catch (IOException e) {
                }
            }
            if (in != null) {
                try {
                    in.close();
                } catch (IOException e) {
                }
            }
            if (out != null) {
                try {
                    out.close();
                } catch (IOException e) {
                }
            }
        }


        return decompressed;
    }


    /**
     * 使用zip进行压缩
     *
     * @param str 压缩前的文本
     * @return 返回压缩后的文本
     */
    public static final String zip(String str) {
        if (str == null)
            return null;
        byte[] compressed;
        ByteArrayOutputStream out = null;
        ZipOutputStream zout = null;
        String compressedStr = null;
        try {
            out = new ByteArrayOutputStream();
            zout = new ZipOutputStream(out);
            zout.putNextEntry(new ZipEntry("0"));
            zout.write(str.getBytes());
            zout.closeEntry();
            compressed = out.toByteArray();
            compressedStr = new sun.misc.BASE64Encoder().encodeBuffer(compressed);
        } catch (IOException e) {
            compressed = null;
        } finally {
            if (zout != null) {
                try {
                    zout.close();
                } catch (IOException e) {
                }
            }
            if (out != null) {
                try {
                    out.close();
                } catch (IOException e) {
                }
            }
        }
        return compressedStr;
    }


    /**
     * 使用zip进行解压缩
     *
     * @param compressedStr 压缩后的文本
     * @return 解压后的字符串
     */
    public static final String unzip(String compressedStr) {
        if (compressedStr == null) {
            return null;
        }


        ByteArrayOutputStream out = null;
        ByteArrayInputStream in = null;
        ZipInputStream zin = null;
        String decompressed = null;
        try {
            byte[] compressed = new sun.misc.BASE64Decoder().decodeBuffer(compressedStr);
            out = new ByteArrayOutputStream();
            in = new ByteArrayInputStream(compressed);
            zin = new ZipInputStream(in);
            zin.getNextEntry();
            byte[] buffer = new byte[1024];
            int offset = -1;
            while ((offset = zin.read(buffer)) != -1) {
                out.write(buffer, 0, offset);
            }
            decompressed = out.toString();
        } catch (IOException e) {
            decompressed = null;
        } finally {
            if (zin != null) {
                try {
                    zin.close();
                } catch (IOException e) {
                }
            }
            if (in != null) {
                try {
                    in.close();
                } catch (IOException e) {
                }
            }
            if (out != null) {
                try {
                    out.close();
                } catch (IOException e) {
                }
            }
        }
        return decompressed;
    }
}

3.4 Html解析工具类 HtmlParser 类

com.heima.crawler.utils.HtmlParser


/**
 * Html解析工具
 * 将html转换为特定格式
 */
@Log4j2
public class HtmlParser {

    /**
     * 需要处理的Html 标签
     */
    private final CrawlerEnum.HtmlType[] specialHtmlTypeArray =
            new CrawlerEnum.HtmlType[]{
                    CrawlerEnum.HtmlType.A_TAG,
                    CrawlerEnum.HtmlType.CODE_TAG,
                    CrawlerEnum.HtmlType.H1_TAG,
                    CrawlerEnum.HtmlType.H2_TAG,
                    CrawlerEnum.HtmlType.H3_TAG,
                    CrawlerEnum.HtmlType.H4_TAG,
                    CrawlerEnum.HtmlType.H5_TAG};

    /**
     * 默认css 设置
     */
    private Map<String, HtmlStyle> defaultStyleMap = null;

    /**
     * css 表达式
     */
    private String cssExpression = null;


    public static HtmlParser getHtmlParser(String cssExpression, Map<String, HtmlStyle> defaultStyleMap) {
        return new HtmlParser(cssExpression, defaultStyleMap);
    }


    public HtmlParser(String cssExpression, Map<String, HtmlStyle> defaultStyleMap) {
        this.cssExpression = cssExpression;
        this.defaultStyleMap = defaultStyleMap;
    }

    /**
     * 解析html 内容
     *
     * @param content
     * @return
     */
    public List<HtmlLabel> parseHtml(String content) {
        long currentTime = System.currentTimeMillis();
        log.info("开始解析文章内容");
        Document document = Jsoup.parse(content);
        Elements elements = document.select(cssExpression);
        List<HtmlLabel> htmlLabelList = parseElements(elements);
        log.info("解析文章内容完成，耗时：{}", System.currentTimeMillis() - currentTime);
        return htmlLabelList;
    }


    /**
     * 解析Html内容并转换为json数据
     *
     * @param content
     * @return
     */
    public String parserHtmlToJson(String content) {
        List<HtmlLabel> htmlLabelList = parseHtml(content);
        return JSON.toJSONString(htmlLabelList);
    }

    /**
     * 解析Html dom树
     *
     * @param elements
     * @return
     */
    private List<HtmlLabel> parseElements(Elements elements) {
        List<HtmlLabel> htmlLabelList = new ArrayList<HtmlLabel>();
        elements.forEach(new Consumer<Element>() {
            @Override
            public void accept(Element element) {
                List<HtmlLabel> labelList = parserElement(element, new ParserCallBack() {
                    @Override
                    public void callBack(Elements elements) {
                        parseElements(elements);
                    }
                });
                htmlLabelList.addAll(labelList);
            }
        });
        return htmlLabelList;
    }

    /**
     * 解析Html 元素
     *
     * @param element
     * @param callBack
     * @return
     */
    private List<HtmlLabel> parserElement(Element element, ParserCallBack callBack) {
        List<HtmlLabel> htmlLabelList = new ArrayList<HtmlLabel>();
        //校验元素是否需要处理
        if (isNeedHandel(element)) {
            HtmlLabel htmlLabel = parseNodeByByElement(element);
            htmlLabelList.add(htmlLabel);
        } else {
            //获取元素所有的子节点
            List<Node> childNodes = element.childNodes();
            //解析节点列表
            List<HtmlLabel> list = parserNodeList(childNodes, callBack);
            htmlLabelList.addAll(list);
        }
        return htmlLabelList;
    }

    /**
     * 解析 Html 节点列表
     *
     * @param nodeList
     * @param callBack
     * @return
     */
    private List<HtmlLabel> parserNodeList(List<Node> nodeList, ParserCallBack callBack) {
        List<HtmlLabel> htmlLabelList = new ArrayList<HtmlLabel>();
        if (null != nodeList && !nodeList.isEmpty()) {
            List<Element> elementList = new ArrayList<Element>();
            for (Node node : nodeList) {
                //检查节点是否需要处理
                if (isNeedHandel(node)) {
                    //解析node节点
                    HtmlLabel htmlLabel = parseNode(node);
                    if (null != htmlLabel) {
                        htmlLabelList.add(htmlLabel);
                    }
                } else {
                    //如果不需要处理并且是element元素就加入list中交给回调方法进行递归处理
                    if (node instanceof Element) {
                        elementList.add((Element) node);
                    }
                }
            }
            if (!elementList.isEmpty()) {
                //调用回调方法进行递归处理
                callBack.callBack(new Elements(elementList));
            }
        }
        return htmlLabelList;
    }

    /**
     * 解析Html 节点
     *
     * @param node
     * @return
     */
    private HtmlLabel parseNode(Node node) {
        HtmlLabel htmlLabel = null;
        if (null != node) {
            if (node instanceof TextNode) {
                htmlLabel = parseNodeByTextNode((TextNode) node);
            } else if (node instanceof Element) {
                htmlLabel = parseNodeByByElement((Element) node);
            }
        }
        return htmlLabel;
    }

    /**
     * 获取文本类型的数据节点
     *
     * @param textNode
     * @return
     */
    private HtmlLabel parseNodeByTextNode(TextNode textNode) {
        HtmlLabel htmlLabel = null;
        if (null != textNode) {
            String text = textNode.getWholeText();
            if (StringUtils.isNotBlank(text)) {
                htmlLabel = new HtmlLabel();
                htmlLabel.setValue(textNode.getWholeText());
                htmlLabel.setType("text");
            }
        }
        return htmlLabel;
    }

    /**
     * 解析Element 元素
     *
     * @param element
     * @return
     */
    private HtmlLabel parseNodeByByElement(Element element) {
        HtmlLabel htmlLabel = null;
        if (null != element) {
            String tagName = element.tagName();
            if (CrawlerEnum.HtmlType.A_TAG.getLabelName().equals(tagName)) {
                // explanLabel = getExplanLabelByaLink(element);
            } else if (CrawlerEnum.HtmlType.IMG_TAG.getLabelName().equals(tagName)) {
                htmlLabel = parseNodeByImage(element);
            } else if (CrawlerEnum.HtmlType.CODE_TAG.getLabelName().equals(tagName)) {
                htmlLabel = parseNodeByCode(element);
            } else {
                htmlLabel = parseNodeByOther(element);
            }
        }
        return htmlLabel;
    }


    /**
     * 获取A 标签的链接
     *
     * @param element
     * @return
     */
    private HtmlLabel parseNodeByByaLink(Element element) {
        HtmlLabel htmlLabel = null;
        if (null != element) {
            String link = element.attr("href");
            String text = element.ownText();
            htmlLabel = new HtmlLabel();
            htmlLabel.setValue(text);
            // explanLabel.setLink(link);
            htmlLabel.setType(CrawlerEnum.HtmlType.A_TAG.getDataType());
        }
        return htmlLabel;
    }

    /**
     * 获取图片的信息
     *
     * @param element
     * @return
     */
    private HtmlLabel parseNodeByImage(Element element) {
        HtmlLabel htmlLabel = null;
        if (null != element) {
            String src = element.attr("src");
            src = imageUrlHandel(src);
            String width = element.attr("width");
            String height = element.attr("height");
            htmlLabel = new HtmlLabel();
            HtmlStyle htmlStyle = new HtmlStyle();
            htmlStyle.addStyle("width", width + "px");
            htmlStyle.addStyle("height", height + "px");
            htmlLabel.setValue(src);
            htmlLabel.setStyle(htmlStyle.getCssStyle());
            htmlLabel.setType(CrawlerEnum.HtmlType.IMG_TAG.getDataType());
        }
        return htmlLabel;
    }

    /**
     * 处理图片URL带参数问题
     *
     * @param src
     * @return
     */
    private String imageUrlHandel(String src) {
        if (StringUtils.isNotEmpty(src) && src.contains("?")) {
            src = src.substring(0, src.indexOf("?"));
        }
        return src;
    }

    /**
     * 获取代码的数据
     *
     * @param element
     * @return
     */
    private HtmlLabel parseNodeByCode(Element element) {
        HtmlLabel htmlLabel = null;
        if (null != element) {
            String text = element.ownText();
            htmlLabel = new HtmlLabel();
            htmlLabel.setValue(text);
            htmlLabel.setType(CrawlerEnum.HtmlType.CODE_TAG.getDataType());
        }
        return htmlLabel;
    }

    /**
     * 其他数据处理
     *
     * @param element
     * @return
     */
    private HtmlLabel parseNodeByOther(Element element) {
        HtmlLabel htmlLabel = null;
        if (null != element) {
            HtmlStyle htmlStyle = defaultStyleMap.get(element.tagName());
            String text = element.ownText();
            htmlLabel = new HtmlLabel();
            htmlLabel.setValue(text);
            htmlLabel.setType("text");
            if (null != htmlStyle) {
                htmlLabel.setStyle(htmlStyle.getCssStyle());
            }

        }
        return htmlLabel;
    }

    /**
     * 检查节点否需要处理
     *
     * @return
     */
    private boolean isNeedHandel(Node node) {
        boolean flag = false;
        //没有子节点
        if (node.childNodes().isEmpty()) {
            flag = true;
        } else {
            if (null != node && node instanceof Element) {
                Element element = (Element) node;
                flag = isNeedHandel(element);
            }
        }
        return flag;
    }

    /**
     * 校验是否需要进行处理
     *
     * @param element
     * @return
     */
    private boolean isNeedHandel(Element element) {
        boolean flag = false;
        if (null != element) {
            String tagName = element.tagName();
            for (CrawlerEnum.HtmlType htmlType : specialHtmlTypeArray) {
                if (htmlType.getLabelName().toLowerCase().equals(tagName.toLowerCase())) {
                    flag = true;
                    break;
                }
            }
        }
        return flag;
    }


    /**
     * 内部接口
     */
    private interface ParserCallBack {
        void callBack(Elements elements);
    }

    public Map<String, HtmlStyle> getDefaultStyleMap() {
        return defaultStyleMap;
    }

    public void setDefaultStyleMap(Map<String, HtmlStyle> defaultStyleMap) {
        this.defaultStyleMap = defaultStyleMap;
    }

    public String getCssExpression() {
        return cssExpression;
    }

    public void setCssExpression(String cssExpression) {
        this.cssExpression = cssExpression;
    }
}

3.5 反射工具类 ReflectUtils类

com.heima.utils.common.ReflectUtils


public class ReflectUtils {


    /**
     * 转换为Map
     *
     * @param bean
     * @return
     */
    public static Map<String, Object> beanToMap(Object bean) {
        PropertyDescriptor[] propertyDescriptorArray = getPropertyDescriptorArray(bean);
        Map<String, Object> parameterMap = new HashMap<String, Object>();
        for (PropertyDescriptor propertyDescriptor : propertyDescriptorArray) {
            Object value = getPropertyDescriptorValue(bean, propertyDescriptor);
            parameterMap.put(propertyDescriptor.getName(), value);
        }
        return parameterMap;
    }

    /**
     * 通过反射设置属性
     *
     * @param bean
     * @param key
     * @param value
     */
    public static void setPropertie(Object bean, String key, Object value) {
        if (null != bean && StringUtils.isNotEmpty(key)) {
            PropertyDescriptor[] descriptor = getPropertyDescriptorArray(bean);
            PropertyDescriptor propertyDescriptor = getPropertyDescriptor(descriptor, key);
            setPropertyDescriptorValue(bean, propertyDescriptor, value);
        }
    }

    /**
     * 通过反射设置属性
     *
     * @param bean
     * @param key
     * @param value
     * @param skipExist 是否跳过已存在的属性
     */
    public static void setPropertie(Object bean, String key, Object value, boolean skipExist) {
        if (null != bean && StringUtils.isNotEmpty(key)) {
            if (skipExist) {
                Object propValue = getPropertie(bean, key);
                if (null == propValue) {
                    setPropertie(bean, key, value);
                }
            } else {
                setPropertie(bean, key, value);
            }
        }
    }


    /**
     * 通过反射将map的key value 映射到实体类中
     *
     * @param bean
     * @param skipExist 是否跳过已存在的属性
     */
    public static void setPropertie(Object bean, Map<String, Object> parameterMap, boolean skipExist) {
        if (null != bean && null != parameterMap && !parameterMap.isEmpty()) {
            for (Map.Entry<String, Object> entry : parameterMap.entrySet()) {
                setPropertie(bean, entry.getKey(), entry.getValue());
            }
        }

    }


    /**
     * 获取属性的值
     *
     * @param bean
     * @param key
     * @return
     */
    public static Object getPropertie(Object bean, String key) {
        Object value = null;
        if (null != bean && StringUtils.isNotEmpty(key)) {
            PropertyDescriptor[] descriptor = getPropertyDescriptorArray(bean);
            PropertyDescriptor propertyDescriptor = getPropertyDescriptor(descriptor, key);
            value = getPropertyDescriptorValue(bean, propertyDescriptor);
        }
        return value;
    }


    public static Object getPropertyDescriptorValue(Object bean, PropertyDescriptor propertyDescriptor) {
        Object value = null;
        if (null != propertyDescriptor) {
            Method readMethod = propertyDescriptor.getReadMethod();
            value = invok(readMethod, bean, propertyDescriptor.getPropertyType(), null);
        }
        return value;
    }


    public static void setPropertyDescriptorValue(Object bean, PropertyDescriptor propertyDescriptor, Object value) {
        if (null != propertyDescriptor) {
            Method writeMethod = propertyDescriptor.getWriteMethod();
            invok(writeMethod, bean, propertyDescriptor.getPropertyType(), value);
        }
    }

    /**
     * 获取 PropertyDescriptor 属性
     *
     * @param propertyDescriptorArray
     * @param key
     * @return
     */
    public static PropertyDescriptor getPropertyDescriptor(PropertyDescriptor[] propertyDescriptorArray, String key) {
        PropertyDescriptor propertyDescriptor = null;
        for (PropertyDescriptor descriptor : propertyDescriptorArray) {
            String fieldName = descriptor.getName();
            if (fieldName.equals(key)) {
                propertyDescriptor = descriptor;
                break;
            }
        }
        return propertyDescriptor;
    }


    /**
     * 获取 PropertyDescriptor 属性
     *
     * @param bean
     * @param key
     * @return
     */
    public static PropertyDescriptor getPropertyDescriptor(Object bean, String key) {
        PropertyDescriptor[] propertyDescriptorArray = getPropertyDescriptorArray(bean);
        return getPropertyDescriptor(propertyDescriptorArray, key);
    }


    /**
     * invok 调用方法
     *
     * @param methodName
     * @param bean
     * @param targetType
     * @param value
     * @return
     */
    public static Object invok(String methodName, Object bean, Class<?> targetType, Object value) {
        Object resultValue = null;
        if (StringUtils.isNotEmpty(methodName) && null != bean) {
            Method method = getMethod(bean.getClass(), methodName);
            if (null != method) {
                resultValue = invok(method, bean, targetType, value);
            }
        }
        return resultValue;
    }

    /**
     * 调用 invok 方法
     *
     * @param method
     * @param bean
     * @param value
     */
    public static Object invok(Method method, Object bean, Class<?> targetType, Object value) {
        //  System.out.println("method:" + method.getName() + "   bean:" + bean.getClass().getName() + "     " + value);
        Object resultValue = null;
        if (null != method && null != bean) {
            try {
                int count = method.getParameterCount();
                if (count >= 1) {
                    if (null != value) {
                        value = ConvertUtils.convert(value, targetType);
                    }
                    resultValue = method.invoke(bean, value);
                } else {
                    resultValue = method.invoke(bean);
                }

            } catch (IllegalAccessException e) {
                e.printStackTrace();
            } catch (InvocationTargetException e) {
                e.printStackTrace();
            }
        }
        return resultValue;
    }

    /**
     * 获取内省的属性
     *
     * @param bean
     * @return
     */
    public static PropertyDescriptor[] getPropertyDescriptorArray(Object bean) {
        BeanInfo beanInfo = null;
        PropertyDescriptor[] propertyDescriptors = null;
        try {
            beanInfo = Introspector.getBeanInfo(bean.getClass());
        } catch (IntrospectionException e) {
            e.printStackTrace();
        }
        if (null != beanInfo) {
            propertyDescriptors = beanInfo.getPropertyDescriptors();
        }
        return propertyDescriptors;
    }

    /**
     * 获取method 方法
     *
     * @param clazz
     * @param methodName
     * @return
     */
    private static Method getMethod(Class clazz, String methodName) {
        Method method = null;
        if (null != clazz) {
            try {
                method = clazz.getDeclaredMethod(methodName);
            } catch (NoSuchMethodException e) {
                e.printStackTrace();
            }
        }
        return method;
    }

    private static Object getBean(Class clazz) {
        Object bean = null;
        if (null != clazz) {
            try {
                bean = clazz.newInstance();
            } catch (InstantiationException e) {
                e.printStackTrace();
            } catch (IllegalAccessException e) {
                e.printStackTrace();
            }
        }
        return bean;
    }

    /**
     * 同步 bean 中的数据
     *
     * @param oldBean
     * @param newBean
     * @param 
     */
    public static <T> void syncBeanData(T oldBean, T newBean) {
        PropertyDescriptor[] descriptorArray = getPropertyDescriptorArray(newBean);
        for (PropertyDescriptor propertyDescriptor : descriptorArray) {
            Object newValue = getPropertyDescriptorValue(newBean, propertyDescriptor);
            Object oldValue = getPropertyDescriptorValue(oldBean, propertyDescriptor);
            if (null == newValue && oldValue != null) {
                setPropertyDescriptorValue(newBean, propertyDescriptor, oldValue);
            }
        }
    }


    /**
     * 通过反射获取class字节码文件
     *
     * @param className
     * @return
     */
    public static Class getClassForName(String className) {
        Class clazz = null;
        if (StringUtils.isNotEmpty(className)) {
            try {
                clazz = Class.forName(className);
            } catch (ClassNotFoundException e) {
                e.printStackTrace();
            }
        }
        return clazz;
    }


    /**
     * 通过反射获取对象
     *
     * @param className
     * @return
     */
    public static Object getClassForBean(String className) {
        Object bean = null;
        Class clazz = getClassForName(className);
        if (null != clazz) {
            try {
                bean = clazz.newInstance();
            } catch (InstantiationException e) {
                e.printStackTrace();
            } catch (IllegalAccessException e) {
                e.printStackTrace();
            }
        }
        return bean;
    }

    /**
     * 获取属性字段的注解属性
     *
     * @param bean
     * @param propertyDescriptor
     * @return
     */
    public static Annotation[] getFieldAnnotations(Object bean, PropertyDescriptor propertyDescriptor) {
        List<Field> fieldList = Arrays.asList(bean.getClass().getDeclaredFields()).stream().filter(f -> f.getName().equals(propertyDescriptor.getName())).collect(Collectors.toList());
        if (null != fieldList && fieldList.size() > 0) {
            return fieldList.get(0).getDeclaredAnnotations();
        }
        return null;
    }

    /**
     * 获取属性字段的注解属性
     *
     * @param bean
     * @param key
     * @return
     */
    public static Annotation[] getFieldAnnotations(Object bean, String key) {
        PropertyDescriptor propertyDescriptor = getPropertyDescriptor(bean, key);
        return getFieldAnnotations(bean, propertyDescriptor);
    }


}

3.6 其他工具类

com.heima.crawler.utils.DateUtils;
com.heima.common.common.util.HMStringUtils;
com.heima.common.kafka.KafkaSender;
com.heima.common.kafka.messages.SubmitArticleAuthMessage

3.7 AbstractProcessFlow

com.heima.crawler.process.AbstractProcessFlow

  /**
     * 获取原始的请求的JSON数据
     *
     * @param url
     * @param parameterMap
     * @return
     */
    public String getOriginalRequestJsonData(String url, Map<String, String> parameterMap) {
        //获取代理
        CrawlerProxy proxy = crawlerProxyProvider.getRandomProxy();

        //获取Cookie列表
        List<CrawlerCookie> cookieList = cookieHelper.getCookieEntity(url, proxy);
        //通过HttpClient方式来获取数据
        String jsonData = getHttpClientRequestData(url, parameterMap, cookieList, proxy);
        //如果不是JSON 说明数据抓取失败则通过SeleniumUtils的方式来获取数据
        if (!isJson(jsonData)) {
            CrawlerHtml crawlerHtml = getSeleniumRequestData(url, parameterMap, proxy);
            jsonData = seleniumClient.getJsonData(crawlerHtml);
        }
        return jsonData;
    }


 /**
     * 验证 字符串是否是json格式
     *
     * @param jsonData
     * @return
     */
    public boolean isJson(String jsonData) {
        boolean isJson = false;
        try {
            isJson = JsonValidator.getJsonValidator().validate(jsonData);
        } catch (ExecutionException e) {
            e.printStackTrace();
        }
        return isJson;
    }

3.8 CrawlerHtmlParsePipeline类

com.heima.crawler.process.parse.impl.CrawlerHtmlParsePipeline

解析后的数据存储，因为数据量较大，所有使用线程池来进行操作控制并发线程数

/**
 * 对 AbstractHtmlParsePipeline 抽象类的实现，
 * 对应数据要转换为何种格式
 * 具体对象要怎么存储，
 * 数据库的存储方式
 * 
 * 对评论的抓取以及存储
 */
@Component
@Log4j2
public class CrawlerHtmlParsePipeline extends AbstractHtmlParsePipeline<CrawlerParseItem> {

    private static final ResourceBundle resourceBundle = ResourceBundle.getBundle("crawler");
    private static final String csdn_comment_url = resourceBundle.getString("csdn.comment.url");


    /**
     * 下次更新时间
     */
    private static final String[] next_update_hour_array = resourceBundle.getString("crawler.nextupdatehours").split(",");


    @Autowired
    private KafkaSender kafkaSender;

    @Autowired
    private CrawlerNewsService crawlerNewsService;

    @Autowired
    private CrawlerNewsCommentService clNewsCommentService;

    @Autowired
    private CrawlerNewsAdditionalService crawlerNewsAdditionalService;


    @Autowired
    private AdLabelService adLabelService;


    @Override
    public void preParameterHandel(Map<String, Object> parameter) {
        String readCount = HMStringUtils.toString(parameter.get("readCount"));
        if (StringUtils.isNotEmpty(readCount)) {
            readCount = readCount.split(" ")[1];
            if (StringUtils.isNotEmpty(readCount)) {
                parameter.put("readCount", readCount);
            }
        }
    }

    /**
     * Html 数据处理的入口
     *
     * @param parseItem
     */
    @Override
    public void handelHtmlData(final CrawlerParseItem parseItem) {
        long currentTime = System.currentTimeMillis();
        log.info("将数据加入线程池进行执行，url:{},handelType:{}", parseItem.getUrl(), parseItem.getHandelType());
        CrawlerThreadPool.submit(() -> {
            //正向抓取
            if (CrawlerEnum.HandelType.FORWARD.name().equals(parseItem.getHandelType())) {
                log.info("开始处理消息,url:{},handelType:{}", parseItem.getUrl(), parseItem.getHandelType());
                //  kafkaSender.sendCrawlerParseItemMessage(parseItem);
                addParseItemMessage(parseItem);
                //逆向抓取
            } else if (CrawlerEnum.HandelType.REVERSE.name().equals(parseItem.getHandelType())) {
                //更新附加数据
                updateAdditional(parseItem);
            }
            log.info("处理文章数据完成，url:{},handelType:{}，耗时：{}", parseItem.getUrl(), parseItem.getHandelType(), System.currentTimeMillis() - currentTime);
        });

    }

    /**
     * 添加解析后的消息
     *
     * @throws ExecutionException
     * @throws InterruptedException
     */
    public void addParseItemMessage(CrawlerParseItem parseItem) {
        long currentTime = System.currentTimeMillis();
        String url = null;
        String handelType = null;
        if (null != parseItem) {
            url = parseItem.getUrl();
            handelType = parseItem.getHandelType();
            log.info("开始添加数据,url:{},handelType:{}", url, parseItem.getHandelType());
            //添加文章数据
            ClNews clNews = addClNewsData(parseItem);
            if (null != clNews) {
                //添加附加数据
                addAdditional(parseItem, clNews);
                //添加回复数据 只有评论数>0 才能进行添加
                if (null != parseItem && null != parseItem.getCommentCount() && parseItem.getCommentCount() > 0) {
                    addCommentData(parseItem, clNews);
                }
                sendSubmitArticleAutoMessage(clNews.getId());
            }
        }
        log.info("添加数据完成,url:{},handelType:{},耗时:{}", url, handelType, System.currentTimeMillis() - currentTime);
    }

    /**
     * 发起自动主动审核
     *
     * @param clNewId
     */
    public void sendSubmitArticleAutoMessage(Integer clNewId) {
        log.info("开始发送自动审核消息,id:{}", clNewId);
        SubmitArticleAuto submitArticleAuto = new SubmitArticleAuto();
        submitArticleAuto.setArticleId(clNewId);
        submitArticleAuto.setType(SubmitArticleAuto.ArticleType.CRAWLER);
        SubmitArticleAuthMessage submitArticleAuthMessage = new SubmitArticleAuthMessage(submitArticleAuto);
        kafkaSender.sendSubmitArticleAuthMessage(submitArticleAuthMessage);
        log.info("发送自动审核消息完成,id:{}", clNewId);
    }

    /**
     * 文章内容的处理
     *
     * @param parseItem
     * @return
     */
    private ClNews addClNewsData(CrawlerParseItem parseItem) {
        log.info("开始添加文章内容");
        ClNews clNews = null;
        if (null != parseItem) {
            HtmlParser htmlParser = HtmlParser.getHtmlParser(getParseExpression(), getDefHtmlStyleMap());
            //将Html内容转换为HtmlLabel 对象类别
            List<HtmlLabel> htmlLabelList = htmlParser.parseHtml(parseItem.getContent());
            //获取文章类型
            int type = getType(htmlLabelList);
            parseItem.setDocType(type);
            String josnStr = JSON.toJSONString(htmlLabelList);
            parseItem.setCompressContent(ZipUtils.gzip(josnStr));
            ClNewsAdditional clNewsAdditional = crawlerNewsAdditionalService.getAdditionalByUrl(parseItem.getUrl());
            if (null == clNewsAdditional) {
                clNews = toClNews(parseItem);
                long currentTime = System.currentTimeMillis();
                log.info("开始插入新的文章");
                crawlerNewsService.saveNews(clNews);
                log.info("插入新的文章完成，耗时：{}", System.currentTimeMillis() - currentTime);
            } else {
                log.info("文章URL已存在不重复添加，URL：{}", clNewsAdditional.getUrl());
            }
        }
        log.info("添加文章内容完成");
        return clNews;
    }


    /**
     * 处理文章附加信息
     *
     * @param parseItem
     * @param clNews
     */
    public void addAdditional(CrawlerParseItem parseItem, ClNews clNews) {
        long currentTime = System.currentTimeMillis();
        log.info("开始处理文章附加数据");
        if (null != parseItem && null != clNews) {
            ClNewsAdditional clNewsAdditional = toClNewsAdditional(parseItem, clNews);
            crawlerNewsAdditionalService.saveAdditional(clNewsAdditional);
        }
        log.info("文章附加数据处理完成,耗时：{}", System.currentTimeMillis() - currentTime);
    }

    /**
     * 逆向数据更新
     *
     * @param parseItem
     */
    public void updateAdditional(CrawlerParseItem parseItem) {
        long currentTime = System.currentTimeMillis();
        log.info("开始更新文章附加数据");
        if (null != parseItem) {
            ClNewsAdditional clNewsAdditional = crawlerNewsAdditionalService.getAdditionalByUrl(parseItem.getUrl());
            if (null != clNewsAdditional) {
                clNewsAdditional.setNewsId(null);
                clNewsAdditional.setUrl(null);
                //阅读量设置
                clNewsAdditional.setReadCount(parseItem.getReadCount());
                //评论数设置
                clNewsAdditional.setComment(parseItem.getCommentCount());
                //点赞数设置
                clNewsAdditional.setLikes(parseItem.getLikes());
                //更新数据设置
                clNewsAdditional.setUpdatedTime(new Date());
                clNewsAdditional.setUpdateNum(clNewsAdditional.getUpdateNum() + 1);
                int nextUpdateHours = getNextUpdateHours(clNewsAdditional.getUpdateNum());
                clNewsAdditional.setNextUpdateTime(DateUtils.addHours(new Date(), nextUpdateHours));
                crawlerNewsAdditionalService.updateAdditional(clNewsAdditional);
            }
        }
        log.info("更新文章附加数据完成，耗时：{}", System.currentTimeMillis() - currentTime);
    }

    /**
     * 处理评论数据
     *
     * @param parseItem
     */
    public void addCommentData(CrawlerParseItem parseItem, ClNews clNews) {
        long currentTime = System.currentTimeMillis();
        log.info("开始获取文章评论数据");
        List<ClNewsComment> commentList = getCommentData(parseItem);
        if (null != commentList && !commentList.isEmpty()) {
            for (ClNewsComment comment : commentList) {
                comment.setNewsId(clNews.getId());
                clNewsCommentService.saveClNewsComment(comment);
            }
        }
        log.info("获取文章评论数据完成，耗时：{}", System.currentTimeMillis() - currentTime);
    }

    /**
     * 转换为数据库对象
     *
     * @param parseItem
     * @return
     */
    private ClNews toClNews(CrawlerParseItem parseItem) {
        ClNews clNews = new ClNews();
        clNews.setName(parseItem.getAuthor());
        clNews.setLabels(parseItem.getLabels());
        clNews.setContent(parseItem.getCompressContent());
        clNews.setLabelIds(adLabelService.getLableIds(parseItem.getLabels()));
        Integer channelId = adLabelService.getAdChannelByLabelIds(clNews.getLabelIds());
        clNews.setChannelId(channelId);
        clNews.setTitle(parseItem.getTitle());
        clNews.setType(parseItem.getDocType());
        clNews.setStatus((byte) 1);
        clNews.setCreatedTime(new Date());
        String releaseDate = parseItem.getReleaseDate();
        if (StringUtils.isNotEmpty(releaseDate)) {
            clNews.setOriginalTime(DateUtils.stringToDate(releaseDate, DateUtils.DATE_TIME_FORMAT_CHINESE));
        }
        return clNews;
    }


    /**
     * 转换为文章附加信息
     *
     * @param parseItem
     * @param clNews
     * @return
     */
    private ClNewsAdditional toClNewsAdditional(CrawlerParseItem parseItem, ClNews clNews) {
        ClNewsAdditional clNewsAdditional = null;
        if (null != parseItem) {
            clNewsAdditional = new ClNewsAdditional();
            //设置文章ID
            clNewsAdditional.setNewsId(clNews.getId());
            //设置阅读数
            clNewsAdditional.setReadCount(parseItem.getReadCount());
            //设置回复数
            clNewsAdditional.setComment(parseItem.getCommentCount());
            //设置点赞数
            clNewsAdditional.setLikes(parseItem.getLikes());
            //设置URL
            clNewsAdditional.setUrl(parseItem.getUrl());
            //设置更新时间
            clNewsAdditional.setUpdatedTime(new Date());
            //设置创建时间
            clNewsAdditional.setCreatedTime(new Date());
            //设置更新次数
            clNewsAdditional.setUpdateNum(0);
            //设置下次更新时间
            int nextUpdateHour = getNextUpdateHours(clNewsAdditional.getUpdateNum());
            /**
             * 设置下次更新时间
             */
            clNewsAdditional.setNextUpdateTime(DateUtils.addHours(new Date(), nextUpdateHour));
        }
        return clNewsAdditional;
    }

    /**
     * 获取下次更新时间
     *
     * @param count
     * @return
     */
    private int getNextUpdateHours(Integer count) {
        if (null != next_update_hour_array && next_update_hour_array.length > count) {
            return Integer.parseInt(next_update_hour_array[count]);
        } else {
            return 2 << count;
        }
    }

    /**
     * 获取图文类型
     * 
     * 0：无图片
     * 1：单图
     * 2：多图
     *
     * @param htmlLabelList
     * @return
     */
    public int getType(List<HtmlLabel> htmlLabelList) {
        int type = 0;
        int num = 0;
        if (null != htmlLabelList && !htmlLabelList.isEmpty()) {
            for (HtmlLabel htmlLabel : htmlLabelList) {
                if (CrawlerEnum.HtmlType.IMG_TAG.getDataType().equals(htmlLabel.getType())) {
                    num++;
                }
            }
        }
        if (num == 0) {
            type = 0;
        } else if (num == 1) {
            type = 1;
        } else {
            type = 2;
        }
        return type;
    }


    /**
     * 获取评论列表
     *
     * @param parseItem
     * @return
     */
    private List<ClNewsComment> getCommentData(ParseItem parseItem) {
        //构建评论的URl
        String buildCommentUrl = buildCommentUrl(parseItem);
        //调用父类方法进行HttpClient请求进行获取数据
        String jsonData = getOriginalRequestJsonData(buildCommentUrl, null);
        //解析获取的JSON数据
        List<ClNewsComment> commentList = analysisCommentJsonData(jsonData);
        return commentList;
    }

    /**
     * 解析评论数据
     *
     * @param jsonData
     * @return
     */

    public List<ClNewsComment> analysisCommentJsonData(String jsonData) {
        if (StringUtils.isEmpty(jsonData)) {
            return null;
        }
        List<ClNewsComment> commentList = new ArrayList<ClNewsComment>();
        JSONObject jsonObject = JSON.parseObject(jsonData);
        Map<String, Object> map = jsonObject.getObject("data", Map.class);
        JSONArray jsonArray = (JSONArray) map.get("list");
        if (null != jsonArray) {
            List<Map> dataInfoList = jsonArray.toJavaList(Map.class);
            for (Map<String, Object> dataInfo : dataInfoList) {
                JSONObject infoObject = (JSONObject) dataInfo.get("info");
                Map<String, Object> infoMap = infoObject.toJavaObject(Map.class);
                ClNewsComment comment = new ClNewsComment();
                comment.setContent(HMStringUtils.toString(infoMap.get("Content")));
                comment.setUsername(HMStringUtils.toString(infoMap.get("UserName")));
                Date date = DateUtils.stringToDate(HMStringUtils.toString(infoMap.get("PostTime")), DateUtils.DATE_TIME_FORMAT);
                comment.setCommentDate(date);
                comment.setCreatedDate(new Date());
                commentList.add(comment);
            }
        }
        return commentList;
    }

    /**
     * 生成评论访问链接
     *
     * @param parseItem
     * @return
     */
    private String buildCommentUrl(ParseItem parseItem) {
        String buildCommentUrl = csdn_comment_url;
        Map<String, Object> map = ReflectUtils.beanToMap(parseItem);
        for (Map.Entry<String, Object> entry : map.entrySet()) {
            String key = entry.getKey();
            String buildKey = "${" + key + "}";
            Object value = entry.getValue();
            if (null != value) {
                String strValue = value.toString();
                buildCommentUrl = buildCommentUrl.replace(buildKey, strValue);
            }
        }
        return buildCommentUrl;
    }


    @Override
    public int getPriority() {
        return 1000;
    }
}

3.9 综合测试

启动测试，是否能存储数据到数据库中

你可能感兴趣的:(项目笔记,java,数据库)

解析JSON的这 6 种方案（带示例）浪九天企业级开发效率提升 java spring spring boot spring cloud
目录1.使用Gson详细解释运用场景代码示例2.使用Jackson详细解释运用场景代码示例3.使用JSON.simple详细解释运用场景代码示例4.使用org.json详细解释运用场景代码示例5.使用FastJSON详细解释运用场景代码示例6.使用Moshi详细解释运用场景代码示例在Java中，有多种方案可以用于解析JSON数据，以下为你详细介绍6种常见的方案：1.使用Gson详细解释Gson是G
逆天！外包都开始嫌弃外包了。。。 java
大家好，我是R哥。最近看我的Java面试群里聊天真的笑死了。。。外包卡学历不说，外包都要提交之前干过的所有公司社保流水来验证年限不说，现在连外包都开始嫌弃从外包出来的了？真是滑天下之大稽，现在有的外包公司都不要脸到这地步了吗？说到外包，我发现很多程序员对外包公司offer都有一种复杂的感情，既离不开它，又看不上它，食之无味，弃之可惜，拿了个外包的offer纠结万分。很多大厂项目一多、人手不够的时候
利用Java爬虫获取衣联网商品详情：实战指南 Jason-河山 java 爬虫开发语言
在电商领域，获取商品详情是数据分析和市场研究的重要环节。衣联网作为知名的电商平台，提供了丰富的服装商品资源。本文将详细介绍如何利用Java编写爬虫程序，通过商品ID获取衣联网商品详情。一、准备工作（一）环境搭建Java安装：确保已安装Java开发环境，推荐使用JDK11或更高版本。开发工具配置：使用IntelliJIDEA或Eclipse等Java开发工具，创建一个新的Maven项目。依赖库添加：
面试了一个 7 年 Java 程序员，结果真让我哭笑不得。。。 java
大家好，我是R哥。作为一名资深的Java程序员、面试官，同时也做后端面试辅导，面试过许多人，也见过不少神奇的面试经历。但昨晚的一次模拟面试，真的让我哭笑不得。这兄弟来自92名校，毕业7年，干了几个中厂，想冲大厂，目标：40K，于是想模拟面试一下，体验下我们导师的实力。模拟面试之前，说自己八股文准备好了，面试完，竟然连许多常见的八股文都答不上来，而且他还很疑惑地问我：“你们的面试题是哪来的？怎么和我
Yarn：包管理优化与工作空间的最佳实践
在现代前端开发中，包管理工具是不可或缺的工具之一。Yarn作为一个快速、可靠且安全的包管理工具，相对于npm，提供了一些独特的功能和优化，尤其是在工作空间管理和性能优化方面尤为突出。本文将深入探讨Yarn的专业使用，包括其工作空间的强大功能、性能优化技术以及在大型项目中的最佳实践。Yarn简介Yarn是由Facebook开发的一个JavaScript包管理工具，它旨在解决npm的一些关键问题，如安
如何在Spring Boot中实现数据加密后端springboot
如何在SpringBoot中实现数据加密大家好，我是免费搭建查券返利机器人省钱赚佣金就用微赚淘客系统3.0的小编，也是冬天不穿秋裤，天冷也要风度的程序猿！一、数据加密的重要性与应用场景在当今信息安全日益受到重视的背景下，数据加密成为保护敏感信息不被未授权访问的重要手段。SpringBoot作为一种流行的Java开发框架，提供了多种方式来实现数据加密，适用于用户密码、数据库连接、敏感配置等场景。二、
ES6 解构赋值详解修己xj web es6 javascript es6
ES6是JavaScript语言的一次重大更新，引入了许多新特性和语法改进，其中解构赋值是一个非常实用和灵活的语法特性。它可以让我们从数组或对象中提取值，并赋给对应的变量，让代码变得更加简洁和易读。本文将深入探讨ES6解构赋值的语法、用法及其在实际开发中的应用。数组解构赋值数组解构赋值允许我们通过类似模式匹配的方式，从数组中提取值并赋给变量，即只要等会两边的变量模式相同，左边的变量就会被赋予对应的
java vscode跳转类定义_快速使用 vscode 进行 Java 编程 weixin_39894932 java vscode跳转类定义
任何一个程序员都有自己喜爱的编辑器、工具、开发利器，有这样一群人，对于vim这种上古神器难以驾驭、IDE又太笨重，这时候多了一个选择vscode！！！vscode重新定义了编辑器，它开源、免费、Runseverywhere，是一款介于IDE和编辑器之间的产物，我们不能用IDE的所有特性都往它身上压，如果都可以的话不就是IDE吗？不就是吗？所以用起来的感觉你懂的，美滋滋(๑•̀ㅂ•́)✧那么这家伙都
ES6之解构 Hopebearer_ ES6 es6 前端 javascript 开发语言 ecmascript
文章目录ES6之解构一、数组解构1.基本解构2.部分解构3.默认值4.剩余参数5.嵌套解构6.交换变量二、对象解构1.基本解构2.重命名3.默认值4.剩余参数三、函数参数结构1.数组参数解构2.对象参数解构3.默认值四、注意事项1.解构顺序2.undefined情况3.剩余元素4.对象的方法解构ES6之解构解构是JavaScript（ES6及以后版本）中一种非常强大的语法特性，它允许我们按照一定模
Java的定时器Timer和TimerTask使用全解析程序员总部 java java python 开发语言
在Java编程中，定时任务是一个常见的需求。无论是定期执行某些操作，还是在特定时间点执行任务，Java提供的Timer和TimerTask类就可以帮助我们轻松实现这些功能。今天将详细介绍如何使用这两个类，包括任务的执行和暂停。理解Timer和TimerTaskTimer是一个可以安排任务在指定的时间或周期性地执行的类。TimerTask是一个抽象类，表示要被定时执行的任务。使用Timer时，我们需
HIBERNATE - 符合Java习惯的关系数据库持久化 popkiler Atleap代码读解 hibernate 数据库 java session class payment
HIBERNATE-符合Java习惯的关系数据库持久化Hibernate2参考文档2.1.1TableofContents前言1.在Tomcat中快速上手1.1.开始Hibernate之旅1.2.第一个可持久化类1.3.映射cat1.4.与猫同乐1.5.结语2.体系结构2.1.总览2.2.持久化对象标识（PersistentObjectIdentity）2.3.JMX集成2.4.JCA支持3.Se
CentOS停更；阿里发布全新操作系统（Anolis OS）萌褚 Linux 运维
镜像下载、域名解析、时间同步请点击阿里云开源镜像站Linux系统对于Java程序员来说，就好比“乞丐手里的碗”，任何业务都离不开他的身影，因为服务端的广泛使用，也因此衍生出了各种不同的发行版，其中我个人用的最多、且最喜欢的就是CentOS；不幸的是，2021年底CentOS8宣布停止了维护；不过，喜欢CentOS的朋友们不用为此而难过；21年的云栖大会上，阿里云发布全新操作系统“龙蜥”（Anoli
ES6解构赋值详解漫天转悠 ES6 es6 前端 ecmascript
ES6解构赋值详解ES6解构赋值是JavaScript语言的一项强大特性，它允许从数组或对象中提取数据，并将其赋值给变量。这一特性不仅简化了代码，提高了可读性，还增强了代码的灵活性。本文将详细介绍ES6解构赋值的基本概念、语法、应用场景以及一些高级用法。1.基本概念解构赋值是对赋值运算符的扩展。它允许按照一定的模式，从数组或对象中提取值，并赋值给变量。这种语法使得从复杂数据结构中提取数据变得更加简
web前端期末大作业：婚纱网页主题网站设计——唯一旅拍婚纱公司网站HTML+CSS+JavaScript IT-司马青衫前端课程设计 html
‍静态网站的编写主要是用HTMLDⅣV+CSSJS等来完成页面的排版设计‍，一般的网页作业需要融入以下知识点：div布局、浮动定位、高级css、表格、表单及验证、js轮播图、音频视频Fash的应用、uli、下拉导航栏、鼠标划过效果等知识点，学生网页作业源码，制作水平和原创度都适合学习或交作业用，记得点赞。精彩专栏推荐【作者主页——获取更多优质源码】【web前端期末大作业——毕设项目精品实战案例(1
一文理清：阿里系数据中台-数据治理工具集(傻傻也能分清楚） Debug_Snail Hadoop Big Data 技术工具人工智能 hadoop 数据仓库
阿里云提供的大数据与数据分析产品种类较多，各产品的定位和核心功能有所不同。以下是对DataWorks、MaxCompute、Dataphin、AnalyticDBforMySQL（ADB）、QuickBI、EMR的详细梳理。一、核心产品定位与功能DataWorks定位：一站式大数据开发治理平台，提供数据集成、开发、调度、治理、服务等全链路能力。核心功能：数据集成：支持异构数据源（如数据库、OSS、
Vue初体验码上跑步 vue.js 前端
Vue基础Vue是什么？Vue是javascript的渐进式框架。Vue初识Vue工作时必须要创建一个Vue的实例，并且传入一个配置对象。root容器里的代码是符合html的语法但是新添加了一些Vue语法，在这些地方Vue会自动进行解析。root容器里的代码称为Vue模版。Vue实例和容器是一一对应的。在实际开发中只有一个Vue，配合组件使用。在vue里的插值{{}}内部只要写js表达式就能正常解
MAC电脑配置VSCode写JAVA davidson1471 macos vscode java homebrew
一、安装JDK（Homebrew安装openJDK）使用清华源安装Homebrew清华源安装Home-brew教程Homebrew安装JDK8brewtapadoptopenjdk/openjdkbrewinstall--caskadoptopenjdk8查看JDK路径/usr/libexec/java_home二、配置VSCodeMAC配置VSCode教程
【推荐项目】044-中医门诊管理系统蜗牛 | ICU 推荐项目 java 前端框架毕设 spring mybatis
044-中医门诊管理系统介绍javaspringbootvuejs中医门诊管理系统为了帮助您更清晰地整理中医门诊管理系统的功能，我将按照您提供的角色和功能进行归类和整理。以下是整理后的功能列表：角色划分核心管理员医生药师用户（普通患者）管理端功能（核心管理员）统计用户统计（按注册时间、活跃度等）药物销售统计处方统计咨询统计用户管理用户列表（查看、编辑、删除）用户角色分配用户权限管理医生管理医生列表
2.10 Spring Boot定时任务：@Scheduled与Quartz对比分析 Sendingab spring boot 后端 java
SpringBoot定时任务：@Scheduled与Quartz对比分析一、核心特性对比特性**@Scheduled**Quartz依赖复杂度内置于Spring（零配置）需额外依赖与配置任务持久化不支持（内存存储）支持（数据库持久化）动态任务管理仅静态配置支持运行时增删改查分布式支持需自行实现原生集群支持调度策略固定速率/延迟Cron表达式/日历触发错误处理简单异常捕获完善的重试与错误日志机制性能
学习笔记09——并发编程之线程基础码代码的小仙女高级开发必备技能学习笔记 python
线程基础1.1进程与线程的区别，Java中线程的实现（用户线程与内核线程）进程是操作系统分配资源的基本单位，而线程是CPU调度的基本单位。每个进程有独立的内存空间，而同一进程内的线程共享内存.可以从资源分配、切换开销、通信方式和独立性四个方面来比较两者的区别资源分配进程：操作系统分配资源（如内存、文件句柄等）的基本单位，拥有独立的地址空间。线程：隶属于进程，共享进程的资源（如内存、文件等），是CP
【高级RAG技巧】使用二阶段检索器平衡检索的效率和精度深度学习机器大语言模型深度学习入门人工智能语言模型
一传统方法之前的文章已经介绍过向量数据库在RAG（RetrievalAugmentedGenerative）中的应用，本文将会讨论另一个重要的工具-Embedding模型。一般来说，构建生产环境下的RAG系统是直接使用Embedding模型对用户输入的Query进行向量化表示，并且从已经构建好的向量数据库中检索出相关的段落用户大模型生成。但是这种方法很明显会受到Embedding模型性能的影响，比
Java学习笔记——并发编程（三） __________习惯 java java
一、wait和notifywait和notify原理Owner线程发现条件不满足，调用wait方法，即可进入WaitSet变为WAITING状态BLOCKED和WAITING的线程都处于阻塞状态，不占用CPU时间片BLOCKED线程会在Owner线程释放锁时唤醒WAITING线程会在Owner线程调用notify或notifyAll时唤醒，但唤醒后并不意味着立刻获得锁，仍需进入EntryList重
mysql 数据库部署 IT 古月方源网络安全运维网络数据库
以下是基于CentOS7系统部署MySQL数据库的详细步骤及常见问题解决方案：一、卸载旧版本MySQL/MariaDB停止服务并检查残留systemctlstopmariadb#停止MariaDB服务rpm-qa|grepmariadb#检查MariaDB安装包rpm-e--nodepsmariadb-libs-*#强制卸载MariaDB及其依赖包rm-rf/etc/my.cnf/var/lib/
JavaScript -闭包嗷呜~嗷呜~呜呜~ JavaScript 前端 javascript 开发语言
闭包定义:函数声明时会保存其所在的作用域(词法环境),必然有一个全局作用域,除了全局作用域剩余的对于当前函数来说叫--闭包闭包特征:函数在声明时会保存其所在的所有作用域(词法环境)闭包本质:作用域中所使用到的值组成的对象闭包的作用:把使用到的来自于其他作用域的值保存起来,保障函数在执行时能顺利运行window里面的ashow对象拥有一个scopes属性,其中存放了函数使用到的其他作用域中的值:这些
java面试题框架篇老汤姆. 面试 java spring boot 开发语言
文章目录1.Spring框架1.1Spring两大核心：IOC与AOPIOCDIAOP切面=切入点表达式+通知方法关于JDK代理和CGlib代理总结(高程/架构)!!!AOP常用注解1.2BeanFactory(懒加载初始bean)和ApplicationContext(立即初始bean)有什么区别1.3Spring框架用到了哪些设计模式1.4spring框架的优缺点1.5Spring常用注解2.
java ik分词器大波V5 java 开发语言
org.apache.lucenelucene-core7.4.0org.apache.lucenelucene-analyzers-common7.4.0com.github.mageseik-analyzer8.5.0publicstaticvoidmain(String[]args)throwsException{Stringtext="今天是个好日子";//创建一个StringReader
【护网行动】最新版护网知识总结，零基础入门到精通，收藏这篇就够了网络安全小宇哥 oracle 数据库安全 web安全计算机网络网络安全网络
一、基础知识1.SQL注入：一种攻击手段，通过在数据库查询中注入恶意SQL代码，获取、篡改或删除数据库数据。（1）危害：数据库增删改查、敏感数据窃取、提权/写入shell。（2）类型：按注入点（字符型、数字型、搜索型）、提交方式（get、post、cookie）、执行效果（联合、报错、布尔、时间）分类。（3）注入方式：包括information_schema注入、基于函数报错注入（如updatex
Vue3 基础教程：从入门到实践 (保姆级教学) 前段技术人学习前端 vue.js vue
一、Vue3简介Vue.js是一款用于构建用户界面的JavaScript框架，而Vue3作为其最新的主要版本，带来了诸多令人瞩目的改进与新特性，使其在前端开发领域备受青睐。（一）Vue3的优势性能提升：Vue3重写了虚拟DOM算法，显著提高了挂载、更新和渲染的速度。在处理大型列表或频繁数据更新的场景时，Vue3的表现更为出色，能够为用户带来更流畅的交互体验。例如，一个包含大量商品信息的电商产品列表
javaweb内置对象request,response,out,session,Application等菜鸟小T java http struts
JSP内置对象是一些不用声明，也不用像一般的java代码一样需要用new去获取实例的对象，这些对象可以直接在JSP页面的脚本部分使用。一、request对象：1.response和request对象是JSP的内置对象中比较重要的两个，这两个对象提供了对服务器和浏览器通信方法的控制，在JSP中，内置对象request封装了用户提交的信息，即使用HTTP协议处理客户请求时，表单提交的数据就存放在req
做了6年的Java，mysql去重查询方法 m0_57768082 程序员 java 经验分享面试
前言：求职季在即，技巧千万条，硬实力才是关键，听说今年疫情大环境不好，更要好好准备才行。MySQL是Java程序员面向高级的必备技能，很多朋友在面试时经常在这里折戟沉沙，饮恨不已。熟练掌握MySQL知识，在实践中具有很强的操作性，尤其是在互联网行业，不仅要写好代码、实现功能，而且还要在高并发的情况下能够正常运转。这篇文章总结了许多关于MySQL方面的知识总结，以及面试多家总结出来的常问面试题，希望
[黑洞与暗粒子]没有光的世界 comsci
无论是相对论还是其它现代物理学,都显然有个缺陷,那就是必须有光才能够计算但是,我相信,在我们的世界和宇宙平面中,肯定存在没有光的世界.... 那么,在没有光的世界,光子和其它粒子的规律无法被应用和考察,那么以光速为核心的 &nbs
jQuery Lazy Load 图片延迟加载 aijuans jquery
基于 jQuery 的图片延迟加载插件，在用户滚动页面到图片之后才进行加载。对于有较多的图片的网页，使用图片延迟加载，能有效的提高页面加载速度。版本： jQuery v1.4.4+ jQuery Lazy Load v1.7.2 注意事项：需要真正实现图片延迟加载，必须将真实图片地址写在 data-original 属性中。若 src
使用Jodd的优点 Kai_Ge jodd
1. 简化和统一 controller ，抛弃 extends SimpleFormController ，统一使用 implements Controller 的方式。 2. 简化 JSP 页面的 bind, 不需要一个字段一个字段的绑定。 3. 对 bean 没有任何要求，可以使用任意的 bean 做为 formBean。使用方法简介
jpa Query转hibernate Query 120153216 Hibernate
public List<Map> getMapList(String hql, Map map) { org.hibernate.Query jpaQuery = entityManager.createQuery(hql); if (null != map) { for (String parameter : map.keySet()) { jp
Django_Python3添加MySQL/MariaDB支持 2002wmj mariaDB
现状首先，[email protected] 中默认的引擎为 django.db.backends.mysql 。但是在Python3中如果这样写的话，会发现 django.db.backends.mysql 依赖 MySQLdb[5] ，而 MySQLdb 又不兼容 Python3 于是要找一种新的方式来继续使用MySQL。 MySQL官方的方案首先据MySQL文档[3]说，自从MySQL
在SQLSERVER中查找消耗IO最多的SQL 357029540 SQL Server
返回做IO数目最多的50条语句以及它们的执行计划。 select top 50 (total_logical_reads/execution_count) as avg_logical_reads, (total_logical_writes/execution_count) as avg_logical_writes, (tot
spring UnChecked 异常官方定义！ 7454103 spring
如果你接触过spring的事物管理！那么你必须明白 spring的非捕获异常！即 unchecked 异常！因为 spring 默认这类异常事物自动回滚！！ public static boolean isCheckedException(Throwable ex) { return !(ex instanceof RuntimeExcep
mongoDB 入门指南、示例 adminjun java mongodb 操作
一、准备工作 1、下载mongoDB 下载地址：http://www.mongodb.org/downloads 选择合适你的版本相关文档：http://www.mongodb.org/display/DOCS/Tutorial 2、安装mongoDB A、不解压模式：将下载下来的mongoDB-xxx.zip打开，找到bin目录，运行mongod.exe就可以启动服务，默
CUDA 5 Release Candidate Now Available aijuans CUDA
The CUDA 5 Release Candidate is now available at http://developer.nvidia.com/<wbr></wbr>cuda/cuda-pre-production. Now applicable to a broader set of algorithms, CUDA 5 has advanced fe
Essential Studio for WinRT网格控件测评 Axiba JavaScript html5
Essential Studio for WinRT界面控件包含了商业平板应用程序开发中所需的所有控件，如市场上运行速度最快的grid 和chart、地图、RDL报表查看器、丰富的文本查看器及图表等等。同时，该控件还包含了一组独特的库，用于从WinRT应用程序中生成Excel、Word以及PDF格式的文件。此文将对其另外一个强大的控件——网格控件进行专门的测评详述。网格控件功能 1、
java 获取windows系统安装的证书或证书链 bewithme windows
有时需要获取windows系统安装的证书或证书链，比如说你要通过证书来创建java的密钥库。有关证书链的解释可以查看此处。 public static void main(String[] args) { SunMSCAPI providerMSCAPI = new SunMSCAPI(); S
NoSQL数据库之Redis数据库管理(set类型和zset类型) bijian1013 redis 数据库 NoSQL
4.sets类型 Set是集合，它是string类型的无序集合。set是通过hash table实现的，添加、删除和查找的复杂度都是O(1)。对集合我们可以取并集、交集、差集。通过这些操作我们可以实现sns中的好友推荐和blog的tag功能。 sadd：向名称为key的set中添加元
异常捕获何时用Exception，何时用Throwable bingyingao
用Exception的情况 try { //可能发生空指针、数组溢出等异常 } catch (Exception e) {
【Kafka四】Kakfa伪分布式安装 bit1129 kafka
在http://bit1129.iteye.com/blog/2174791一文中，实现了单Kafka服务器的安装，在Kafka中，每个Kafka服务器称为一个broker。本文简单介绍下，在单机环境下Kafka的伪分布式安装和测试验证 1. 安装步骤 Kafka伪分布式安装的思路跟Zookeeper的伪分布式安装思路完全一样，不过比Zookeeper稍微简单些(不
Project Euler bookjovi haskell
Project Euler是个数学问题求解网站，网站设计的很有意思，有很多problem，在未提交正确答案前不能查看problem的overview，也不能查看关于problem的discussion thread，只能看到现在problem已经被多少人解决了，人数越多往往代表问题越容易。看看problem 1吧： Add all the natural num
Java-Collections Framework学习与总结-ArrayDeque BrokenDreams Collections
表、栈和队列是三种基本的数据结构，前面总结的ArrayList和LinkedList可以作为任意一种数据结构来使用，当然由于实现方式的不同，操作的效率也会不同。这篇要看一下java.util.ArrayDeque。从命名上看
读《研磨设计模式》-代码笔记-装饰模式-Decorator bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ import java.io.BufferedOutputStream; import java.io.DataOutputStream; import java.io.FileOutputStream; import java.io.Fi
Maven学习(一) chenyu19891124 Maven私服
学习一门技术和工具总得花费一段时间，5月底6月初自己学习了一些工具，maven+Hudson+nexus的搭建，对于maven以前只是听说，顺便再自己的电脑上搭建了一个maven环境，但是完全不了解maven这一强大的构建工具，还有ant也是一个构建工具，但ant就没有maven那么的简单方便，其实简单点说maven是一个运用命令行就能完成构建，测试，打包，发布一系列功
[原创]JWFD工作流引擎设计----节点匹配搜索算法(用于初步解决条件异步汇聚问题) 补充 comsci 算法工作 PHP 搜索引擎嵌入式
本文主要介绍在JWFD工作流引擎设计中遇到的一个实际问题的解决方案，请参考我的博文"带条件选择的并行汇聚路由问题"中图例A2描述的情况(http://comsci.iteye.com/blog/339756),我现在把我对图例A2的一个解决方案公布出来，请大家多指点节点匹配搜索算法(用于解决标准对称流程图条件汇聚点运行控制参数的算法) 需要解决的问题：已知分支
Linux中用shell获取昨天、明天或多天前的日期 daizj linux shell 上几年昨天获取上几个月
在Linux中可以通过date命令获取昨天、明天、上个月、下个月、上一年和下一年 # 获取昨天 date -d 'yesterday' # 或 date -d 'last day' # 获取明天 date -d 'tomorrow' # 或 date -d 'next day' # 获取上个月 date -d 'last month' #
我所理解的云计算 dongwei_6688 云计算
在刚开始接触到一个概念时，人们往往都会去探寻这个概念的含义，以达到对其有一个感性的认知，在Wikipedia上关于“云计算”是这么定义的，它说： Cloud computing is a phrase used to describe a variety of computing co
YII CMenu配置 dcj3sjt126com yii
Adding id and class names to CMenu We use the id and htmlOptions to accomplish this. Watch. //in your view $this->widget('zii.widgets.CMenu', array( 'id'=>'myMenu', 'items'=>$this-&g
设计模式之静态代理与动态代理 come_for_dream 设计模式
静态代理与动态代理代理模式是java开发中用到的相对比较多的设计模式，其中的思想就是主业务和相关业务分离。所谓的代理设计就是指由一个代理主题来操作真实主题，真实主题执行具体的业务操作，而代理主题负责其他相关业务的处理。比如我们在进行删除操作的时候需要检验一下用户是否登陆，我们可以删除看成主业务，而把检验用户是否登陆看成其相关业务
【转】理解Javascript 系列 gcc2ge JavaScript
理解Javascript_13_执行模型详解摘要: 在《理解Javascript_12_执行模型浅析》一文中,我们初步的了解了执行上下文与作用域的概念，那么这一篇将深入分析执行上下文的构建过程，了解执行上下文、函数对象、作用域三者之间的关系。函数执行环境简单的代码:当调用say方法时，第一步是创建其执行环境，在创建执行环境的过程中，会按照定义的先后顺序完成一系列操作:1.首先会创建一个
Subsets II hcx2013 set
Given a collection of integers that might contain duplicates, nums, return all possible subsets. Note: Elements in a subset must be in non-descending order. The solution set must not conta
Spring4.1新特性——Spring缓存框架增强 jinnianshilongnian spring4
目录 Spring4.1新特性——综述 Spring4.1新特性——Spring核心部分及其他 Spring4.1新特性——Spring缓存框架增强 Spring4.1新特性——异步调用和事件机制的异常处理 Spring4.1新特性——数据库集成测试脚本初始化 Spring4.1新特性——Spring MVC增强 Spring4.1新特性——页面自动化测试框架Spring MVC T
shell嵌套expect执行命令 liyonghui160com
一直都想把expect的操作写到bash脚本里,这样就不用我再写两个脚本来执行了,搞了一下午终于有点小成就,给大家看看吧. 系统:centos 5.x 1.先安装expect yum -y install expect 2.脚本内容: cat auto_svn.sh #!/bin/bash
Linux实用命令整理 pda158 linux
0. 基本命令　　linux 基本命令整理　　1. 压缩解压　　tar -zcvf a.tar.gz a #把a压缩成a.tar.gz 　　tar -zxvf a.tar.gz #把a.tar.gz解压成a 　　2. vim小结　　2.1 vim替换　　:m,ns/word_1/word_2/gc
独立开发人员通向成功的29个小贴士 shoothao 独立开发
概述：本文收集了关于独立开发人员通向成功需要注意的一些东西,对于具体的每个贴士的注解有兴趣的朋友可以查看下面标注的原文地址。明白你从事独立开发的原因和目的。保持坚持制定计划的好习惯。万事开头难，第一份订单是关键。培养多元化业务技能。提供卓越的服务和品质。谨小慎微。营销是必备技能。学会组织，有条理的工作才是最有效率的。 “独立
JAVA中堆栈和内存分配原理 uule java
1、栈、堆 1.寄存器：最快的存储区, 由编译器根据需求进行分配,我们在程序中无法控制.2. 栈：存放基本类型的变量数据和对象的引用，但对象本身不存放在栈中，而是存放在堆（new 出来的对象）或者常量池中（字符串常量对象存放在常量池中。）3. 堆：存放所有new出来的对象。4. 静态域：存放静态成员（static定义的）5. 常量池：存放字符串常量和基本类型常量（public static f