masonsxu

基于SpringBoot + MyBatis + WebMagic的爬虫

1、爬虫功能模块介绍

1）项目结构总览

数据的存储
数据导出为excel格式进行查看
WebMagic爬虫逻辑编写
后端与前端建立通信掌握爬虫进度
Web前端逻辑编写
SpringBoot整合项目启动类

下面是各个模块代码的详细介绍

2、数据的存储

1）实体类对象

每个页面的存储对象都要相应的建立一个实体类对象

这里我拿institutioninfo举例

/**
 * @author 90934
 * @create 2020/2/2
 * @since 1.0.0
 * 只写变量名称，get、set和tostring()全部都直接自动生成
 */
@Entity(name = "institution_info")
public class InstitutionInfo {

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;
    private String name;
    private String rnumber;
    private String oname;
    private String cperson;
    private String cnumber;
    private String pcode;
    private String fnumber;
    private String weburl;
    private String email;
    private String address;
    private String start;
    private String end;
    private String abasis;
    private String parameter;
    private String baseinfoid;
}

2）构造存储数据的方法

构造dao层中的数据库接口

/**
 * @author 90934
 * 继承JpaRepository中的方法，减少代码的书写量
 */
public interface InstitutionInfoDao extends JpaRepository<InstitutionInfo, Long> {
}

3）构造存储的接口和实现方法

构造存储的service接口

/**
 * @author 90934
 */
@Component
public interface InstitutionInfoService {
    /**
     * fetch data by rule id
     *
     * @param institutionInfo rule id
     */
    void save(InstitutionInfo institutionInfo);

    /**
     * fetch data by rule id
     *
     * @param institutionInfo rule id
     * @return Result
     */
    List<InstitutionInfo> findInstitutionInfo(InstitutionInfo institutionInfo);

    /**
     * 查询数据库中所有的数据
     *
     * @return InstitutionInfo
     */
    List<InstitutionInfo> findAll();

}

构造service接口的实现类impl

/**
 * @author 90934
 */
@Service
public class InstitutionInfoServiceImpl implements InstitutionInfoService {


    private InstitutionInfoDao institutionInfoDao;

    @Autowired
    public void setInstitutionInfoDao(InstitutionInfoDao institutionInfoDao) {
        this.institutionInfoDao = institutionInfoDao;
    }

    @Override
    @Transactional(rollbackFor = Exception.class)
    public void save(InstitutionInfo institutionInfo) {
        //根据机构名称查询数据
        InstitutionInfo param = new InstitutionInfo();
        param.setName(institutionInfo.getName());

        //执行查询
        List<InstitutionInfo> list = this.findInstitutionInfo(param);

        //判断查询结果是否为空
        if (list.size() == 0) {
            //如果结果为空，表示机构基本信息不存在，需要更新数据库
            this.institutionInfoDao.save(institutionInfo);
        }
        //打开注释，将爬取的数据显示到web端页面进行查看，注意当爬虫数据过快已造成页面崩溃
        try {
            ProductWebSocket.sendInfo("已成功采集 1 条数据！");
        } catch (IOException e) {
            e.printStackTrace();
        }

    }

    @Override
    public List<InstitutionInfo> findInstitutionInfo(InstitutionInfo institutionInfo) {

        //设置查询条件
        Example<InstitutionInfo> example = Example.of(institutionInfo);
        //执行查询
        return this.institutionInfoDao.findAll(example);
    }

    @Override
    public List<InstitutionInfo> findAll() {
        return this.institutionInfoDao.findAll();
    }

}

3、数据导出为excel格式进行查看

根据此博客内容进行修改：https://www.cnblogs.com/wlxslsb/p/10931130.html

1）构造实体类

构造两个实体类对象，分别存储表的信息和表中的字段名称

构造数据库中表信息的实体

/**
 * @author 90934
 * @date 2020/2/29 23:47
 * @description 导出主题表
 * @since 0.1.0
 * 只写变量名称，get、set和tostring()全部都直接自动生成
 */

public class ExportBean {

    private Integer id;

    private String exportCode;

    private String exportName;

    private List<ExportFieldBean> fieldBeanList;
}

构造数据库中表信息的实体

DROP TABLE IF EXISTS `export`;
CREATE TABLE `export`  (
  `id` int(32) UNSIGNED NOT NULL AUTO_INCREMENT COMMENT '主键',
  `exportCode` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '导出主题英文名',
  `exportName` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '导出主题中文名',
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 4 CHARACTER SET = utf8 COLLATE = utf8_general_ci COMMENT = '导出主题表' ROW_FORMAT = Dynamic;

构造表中的字段信息的实体对象

/**
 * @author 90934
 * @date 2020/2/29 23:49
 * @description 导出字段表
 * @since 0.1.0
 * 只写变量名称，get、set和tostring()全部都直接自动生成
 */

public class ExportFieldBean {
    private Integer id;

    private Integer exportId;

    private String fieldCode;

    private String fieldName;

    private Integer sort;

    private ExportBean exportBean;

构造表中字段信息需要的sql语句

DROP TABLE IF EXISTS `export_field`;
CREATE TABLE `export_field`  (
  `id` int(11) UNSIGNED NOT NULL AUTO_INCREMENT COMMENT '主键',
  `exportId` int(11) UNSIGNED NULL DEFAULT NULL COMMENT '导出主表ID',
  `fieldCode` varchar(55) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '字段英文名',
  `fieldName` varchar(64) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '字段中文名',
  `sort` int(11) UNSIGNED NULL DEFAULT 1 COMMENT '排序字段',
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 40 CHARACTER SET = utf8 COLLATE = utf8_general_ci COMMENT = '导出字段表' ROW_FORMAT = Dynamic;

2）使用MyBatis构造mapper对象

使用级联操作查询数据库中相应表和表中字段的信息

构造ExportMapper接口查询数据库中表信息,级联在查询当前表信息的时候同时查询其中的表中的字段信息

**
 * @author 90934
 * @date 2020/2/29 23:51
 * @description 导出字段的方法接口
 * @since 0.1.0
 */
public interface ExportMapper {
    /**
     * 获取各个字段的名称
     * @param exportKey 字段名称
     * @return ExportBean
     */
    @Select("select * from export where exportCode = #{exportKey}")
    @Results({
            @Result(property="fieldBeanList",column="id",one=@One(select="com.hellof.crawler.mapper.ExportFieldMapper" +
                    ".getExportFieldBeanByExportid"))
    })
    ExportBean getExportByExportKey(String exportKey);
}

构造ExportFieldMapper接口查询相应表中的字段信息

/**
 * @author 90934
 * @date 2020/3/1 19:33
 * @description 获取相关联的field对象
 * @since 0.1.0
 */

public interface ExportFieldMapper {
    /**
     * 获取各个field表中的字段信息
     * @param exportid 识别属于表中信息的字段
     * @return n
     */
    @Select("select * from export_field where exportId = #{exportid}")
    ExportFieldBean getExportFieldBeanByExportid(String exportid);
}

3）构造导出为Excel格式的Service接口及其impl实现类

构造IExportExcelService接口和其相对应的IExportExcelServiceImpl实现类

IExportExcelService方法接口

public interface IExportExcelService {
    /**
     * 根据exportKey查询需要导出的字段，并匹配list每个类中字段来到出excel
     * @param exportKey 数据库存储的导出英文名
     * @param fileName 文件名
     * @param list 要到出的数据
     * @param req 请求
     * @param resp 响应
     */
    public void exportExcelWithDispose(String exportKey, String fileName, List<?> list, HttpServletRequest req,
                                       HttpServletResponse resp);
}

IEportExcelServiceImpl方法接口实现类

@Service
public class ExportExcelServiceImpl implements IExportExcelService {
    private ExportMapper exportMapper;

    @Resource
    public void setExportMapper(ExportMapper exportMapper) {
        this.exportMapper = exportMapper;
    }

    @Override
    public void exportExcelWithDispose(String exportKey, String fileName, List<?> list, HttpServletRequest req, HttpServletResponse resp) {
        List<ExportFieldBean> fieldBeans = this.exportMapper.getExportByExportKey(exportKey).getFieldBeanList();
        try {
            SXSSFWorkbook sxssfWorkbook = new SXSSFWorkbook();
            SXSSFSheet sheet1 = sxssfWorkbook.createSheet(fileName);
            SXSSFRow headRow = sheet1.createRow(0);
            headRow.createCell(0).setCellValue("序号");
            for (ExportFieldBean fieldBean: fieldBeans){
                headRow.createCell(headRow.getLastCellNum()).setCellValue(fieldBean.getFieldName());
            }
            int index = 0;
            SXSSFRow bodyRow = null;
            JSONArray jsonArray = JSONArray.fromObject(list);
            for (Object obj:jsonArray){
                bodyRow = sheet1.createRow(sheet1.getLastRowNum() + 1);
                bodyRow.createCell(0).setCellValue(index++);
                int flag = 0;
                for (ExportFieldBean fieldBean: fieldBeans){
                    if (flag == 0){
                        bodyRow.createCell(bodyRow.getLastCellNum()).setCellValue((Integer) ((JSONObject)obj).get(fieldBean.getFieldCode()));
                        flag = 1;
                    }else {
                        bodyRow.createCell(bodyRow.getLastCellNum()).setCellValue((String) ((JSONObject)obj).get(fieldBean.getFieldCode()));
                    }

                }
            }
            FileOutputStream outputStream = new FileOutputStream(fileName + ".xlsx");
            sxssfWorkbook.write(outputStream);
            outputStream.close();
            sxssfWorkbook.close();
            //打开注释，将爬取的数据显示到web端页面进行查看，注意当爬虫数据过快已造成页面崩溃
            ProductWebSocket.sendInfo("已成功导出 " + list.size() + " 条数据！");
        }catch (Exception e){
            e.printStackTrace();
        }
    }
}

4、WebMagic爬虫逻辑代码

WebMagic是一个优秀的可二次开发的爬虫框架，此代码逻辑就是采用这个框架进行编写的

WebMagic框架对于爬虫中的大多数方法都有一个包装好的方法，使用者不必重复进行代码的编写

在使用WebMagic框架的时候，只需要重写PageProcessorr方法就可以完成一个爬虫的构建，如果有额外的需求只需要完成响应方法的重写就可以构建一个优秀的爬虫项目。

官方中文说明文档：http://webmagic.io/docs/zh/posts/ch1-overview/

1）重写PageProcessor方法完成代码逻辑

/**
 * @author 90934
 * @date 2020/3/2 18:25
 * @description 重写PageProcessor方法
 * @since 0.1.0
 */
@Component
public class TestProcessor implements PageProcessor {
    @Override
    public void process(Page page) {
        
    }

    @Override
    public Site getSite() {
        return site;
    }
}

重写process()方法发现链接，将相应链接加入到爬虫队列中等待进程，判断爬虫队列中的连接的类型调用相应的爬取规则，因为现在大多数网页的页面内容是使用ajax技术异步加载出来的，所以必须使用一些自动化测试包来加载网页，通过分析网页的组成来发现链接(本人使用的是Selenium)

@Override
public void process(Page page) {
    //判断url的类型
    String queryData = page.getUrl().regex("query\\w+").toString();
    String institutionInfo = "queryOrgInfo1";
    String domainInfo = "queryPublishSignatory";
    String scopeInfo = "queryPublishIBAbilityQuery";
    if (queryData.equals(institutionInfo)) {
        //判断url目的地址是不是queryOrgInfo1
        this.saveInstitutionInfo(page);
    } else if (queryData.equals(domainInfo)) {
        //判断url目的地址是不是queryPublishSignatory
        this.saveDomainInfo(page);
    } else if (queryData.equals(scopeInfo)) {
        //判断url目的地址是不是queryPublishIBAbilityQuery
        this.saveScopeInfo(page);
    } else {
        List<String> urls = new ArrayList<>();
        System.setProperty("webdriver.chrome.driver", "src/main/resources/static/chromedriver.exe");
        WebDriver driver = new ChromeDriver();
        String url = page.getUrl().toString();
        driver.get(url);
        List<String> addressList = GetAddress.resultData();
        //方便测试
//            List addressList = new ArrayList<>();
//            addressList.add("北京");
//            addressList.add("天津");
        for (String address : addressList) {
            WebElement orgAddress = driver.findElement(By.id("orgAddress"));
            orgAddress.clear();
            orgAddress.sendKeys(address);

            WebElement btn = driver.findElement(By.className("btn"));
            btn.click();
            try {
                Thread.sleep(3000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
            boolean accept = true;
            while (accept) {
                try {
                    WebElement pirbutton1 = driver.findElement(By.xpath("//*[@id=\"pirlbutton1\"]"));
                    pirbutton1.click();
                    Thread.sleep(5000);

                    boolean flagStr = driver.findElement(By.id("pirlAuthInterceptDiv_c")).isDisplayed();

                    if (!flagStr) {
                        accept = false;
                    }
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
            boolean flagStr = driver.findElement(By.xpath("//*[@id=\"pirlAuthInterceptDiv_c\"]")).isDisplayed();
            if (!flagStr) {
                int maxPage = Integer.parseInt(driver.findElement(By.id("yui-pg0-0-totalPages-span")).getText());
                for (int num = 0; num < maxPage; num++) {
                    Html html = Html.create(driver.findElement(By.xpath("//*")).getAttribute("outerHTML"));
                    List<Selectable> list = html.css("div.yui-dt-liner a").nodes();
                    if (list.size() != 0) {
                        for (Selectable selectable : list) {
                            //获取id
                            String urlStr = selectable.regex("id\\=\\w+").toString();
                            //组装url放入待爬取队列
                            urls.add("https://las.cnas.org.cn/LAS/publish/queryOrgInfo1.action?" + urlStr);
                        }
                    }
                    if (num <= maxPage - 2) {
                        WebElement pageNext = driver.findElement(By.xpath("/html/body/div[5]/table/tbody/tr/td[4]/a/img"));
                        pageNext.click();
                        try {
                            Thread.sleep(3000);
                        } catch (InterruptedException e) {
                            e.printStackTrace();
                        }
                        boolean acceptStr = true;
                        while (acceptStr) {
                            try {
                                //获得点击确定按钮的对象
                                WebElement pirbutton1 = driver.findElement(By.xpath("//*[@id=\"pirlbutton1\"]"));
                                pirbutton1.click();
                                Thread.sleep(4000);
								//判断页面是否还有验证码
                                boolean flagStr1 = driver.findElement(By.id("pirlAuthInterceptDiv_c")).isDisplayed();

                                if (!flagStr1) {
                                    acceptStr = false;
                                }
                            } catch (InterruptedException e) {
                                e.printStackTrace();
                            }
                        }
                    }
                }
            }
        }
        driver.close();
        driver.quit();
        //将待爬取url放入队列中
        page.addTargetRequests(urls);
    }

}

根据重写的process方法中判断的链接网页的类型调用相应的方法，经过分析知道本页面的链接中在往后的爬取中可分为三种类型：InstitutionInfo、DomainInfo、ScopeInfo；分别构造他们的方法，对页面经行解析

InstitutionInfo

/**
 * 解析页面，获取机构基本信息，保存数据
 */

private void saveInstitutionInfo(Page page) {
    //创建机构基本信息对象
    InstitutionInfo institutionInfo = new InstitutionInfo();

    //解析页面
    Html html = page.getHtml();

    //获取对象，封装到对象中
    institutionInfo.setName(html.css("div.T1", "text").toString());
    institutionInfo.setRnumber(Jsoup.parse(html.css("span.clabel").nodes().get(0).toString()).text());
    institutionInfo.setOname(Jsoup.parse(html.css("span.clabel").nodes().get(1).toString()).text());
    institutionInfo.setCperson(Jsoup.parse(html.css("span.clabel").nodes().get(2).toString()).text());
    institutionInfo.setCnumber(Jsoup.parse(html.css("span.clabel").nodes().get(3).toString()).text());
    institutionInfo.setPcode(Jsoup.parse(html.css("span.clabel").nodes().get(4).toString()).text());
    institutionInfo.setFnumber(Jsoup.parse(html.css("span.clabel").nodes().get(5).toString()).text());
    institutionInfo.setWeburl(Jsoup.parse(html.css("span.clabel").nodes().get(6).toString()).text());
    institutionInfo.setEmail(Jsoup.parse(html.css("span.clabel").nodes().get(7).toString()).text());
    institutionInfo.setAddress(Jsoup.parse(html.css("span.clabel").nodes().get(8).toString()).text());
    institutionInfo.setStart(Jsoup.parse(html.css("span.clabel").nodes().get(9).toString()).text());
    institutionInfo.setEnd(Jsoup.parse(html.css("span.clabel").nodes().get(10).toString()).text());
    institutionInfo.setAbasis(Jsoup.parse(html.css("span.clabel").nodes().get(11).toString()).text());
    try {
        institutionInfo.setParameter(Jsoup.parse(html.css("span.clabel").nodes().get(12).toString()).text());
    } catch (Exception e) {
        e.printStackTrace();
        institutionInfo.setParameter(null);
    }

    //获取结构化数据的url
    String dataUrl = html.xpath("/html/body/div/table[2]/tbody/tr/td/a/@onclick")
            .regex("foId\\=\\w+").regex("\\=\\w+").regex("[^\\=]\\w+")
            .toString();
    institutionInfo.setBaseinfoid(dataUrl);
    if (dataUrl != null && dataUrl.length() != 0) {
        //认可的授权签字人及领域
        String domainUrl = "https://las.cnas.org.cn/LAS/publish/queryPublishSignatory.action?baseinfoId=" + dataUrl;
        //放进待爬取队列
        page.addTargetRequest(domainUrl);

        //认可的检验能力范围
        String scopeUrl =
                "https://las.cnas.org.cn/LAS/publish/queryPublishIBAbilityQuery.action?baseinfoId=" + dataUrl;
        //放进待爬取队列
        page.addTargetRequest(scopeUrl);
    }
    //把结果保存起来
    page.putField("institutionInfo", institutionInfo);
}

DomainInfo，这个方法比较特殊，因为涉及到json数据的解析，所以在存储数据的时候需要一次存储多组数据，并且在存储的时候需要与之前的数据构造相应的联系例：属于XXX公司的XXX信息，所以需要在这个地方进行一个匹配，通过查询数据库中已存入的信息的某个字段来判断这个json数据属于XXX公司的并将它们组合起来进行存储；这个地方为了避免方法调用时出现空指针错误，所以构造一个方法来装配接下来需要调用的方法。SpringUtil

SpringUtil

/**
 * @author 90934
 * @date 2020/2/13 2:03
 * @description 解决this.方法无效的工具类
 * @since 0.1.0
 */

@Component
public class SpringUtil implements ApplicationContextAware {

    private static ApplicationContext applicationContext = null;

    @Override
    public void setApplicationContext(@NonNull ApplicationContext applicationContext) throws BeansException {
        SpringUtil.applicationContext = applicationContext;
    }

    public static <T> T getBean(Class<T> cla) {
        return applicationContext.getBean(cla);
    }

    public static <T> T getBean(String name, Class<T> cal) {
        return applicationContext.getBean(name, cal);
    }

    public static Object getBean(String name){
        return applicationContext.getBean(name);
    }

    public static String getProperty(String key) {
        return applicationContext.getBean(Environment.class).getProperty(key);
    }
}

DomainInfo

/**
 * 装配Service接口，稍后调用查询方法查询相应的字段联系XXX公司和JSon数据的联系
 */
private InstitutionInfoService institutionInfoService;

@Autowired
public void setInstitutionInfoService(InstitutionInfoService institutionInfoService) {
    this.institutionInfoService = institutionInfoService;
}
/**
 * 解析josn格式数据，存储认可的授权签字人及领域数据
 */
private void saveDomainInfo(Page page) {
    //创建一个机构基本信息的实体类查询机构的基本信息，准备整合数据
    InstitutionInfo param = new InstitutionInfo();

    //创建认可的签字人及领域实体类对象
    List<InstitutionDomain> institutionDomains = new ArrayList<>();

    //解析json数据, 分离json数据
    Json json = page.getJson();
    //获取json数据中的baseinfoid的值
    String baseinfoid = page.getUrl().regex("foId\\=\\w+").regex("\\=\\w+").regex("[^\\=]\\w+").toString();
    //获取json数据中的所有姓名的值
    List<String> nameCh = json.jsonPath("$.data[*].nameCh").all();
    //获取json数据中的所有评估说明的值
    List<String> content = json.jsonPath("$.data[*].assessmentcontent").all();
    //获取json数据中的所有授权签字领域的值
    List<String> authorizedFieldCh = json.jsonPath("$.data[*].authorizedFieldCh").all();
    //获取json数据中的所有说明的值
    List<String> note = json.jsonPath("$.data[*].note").all();
    //获取json数据中的所有状态的值status
    List<String> status;
    status = json.jsonPath("$.data[*].status").all();

    param.setBaseinfoid(baseinfoid);
    List<InstitutionInfo> institutionInfos = getService().institutionInfoService.findInstitutionInfo(param);
    String name;
    String date;
    if (institutionInfos.size() != 0) {
        InstitutionInfo institutionInfo = institutionInfos.get(0);
        name = institutionInfo.getName();
        date = institutionInfo.getStart() + "-" + institutionInfo.getEnd();
        for (int num = 0; num < nameCh.size(); num++) {
            //创建认可的签字人及领域实体类对象
            InstitutionDomain institutionDomain = new InstitutionDomain();

            institutionDomain.setIname(name);
            institutionDomain.setVperiod(date);
            institutionDomain.setName(nameCh.get(num));
            institutionDomain.setContent(content.get(num));
            institutionDomain.setDomain(authorizedFieldCh.get(num));
            institutionDomain.setDescription(note.get(num));
            if ("0".equals(status.get(num))) {
                institutionDomain.setStatus("有效");
            } else {
                institutionDomain.setStatus("无效");
            }
            institutionDomains.add(institutionDomain);
        }
        page.putField("institutionDomains", institutionDomains);
    }
}

ScopeInfo

/**
 * 解析josn格式数据，存储认可的检验能力范围数据
 * 同理也需要同XXX公司建立联系
 */
private void saveScopeInfo(Page page) {
    //创建一个机构基本信息的实体类查询机构的基本信息，准备整合数据
    InstitutionInfo param = new InstitutionInfo();

    List<InstitutionTest> institutionTests = new ArrayList<>();

    Json json = page.getJson();
    //获取json数据中的baseinfoid的值
    String baseinfoid = page.getUrl().regex("foId\\=\\w+").regex("\\=\\w+").regex("[^\\=]\\w+").toString();
    List<String> bigTypeName = json.jsonPath("$.data[*].bigTypeName").all();
    List<String> typeName = json.jsonPath("$.data[*].typeName").all();
    List<String> num = json.jsonPath("$.data[*].num").all();
    List<String> fieldch = json.jsonPath("$.data[*].fieldch").all();
    List<String> detnum = json.jsonPath("$.data[*].detnum").all();
    List<String> descriptCh = json.jsonPath("$.data[*].descriptCh").all();
    List<String> stdNum = json.jsonPath("$.data[*].stdNum").all();
    List<String> standardCh = json.jsonPath("$.data[*].standardCh").all();
    List<String> order = json.jsonPath("$.data[*].order").all();
    List<String> restrictCh = json.jsonPath("$.data[*].restrictCh").all();
    List<String> status = json.jsonPath("$.data[*].status").all();

    param.setBaseinfoid(baseinfoid);
    List<InstitutionInfo> institutionInfos = getService().institutionInfoService.findInstitutionInfo(param);
    String name;
    String date;
    if (institutionInfos.size() != 0) {
        InstitutionInfo institutionInfo = institutionInfos.get(0);
        name = institutionInfo.getName();
        date = institutionInfo.getStart() + "-" + institutionInfo.getEnd();
        for (int i = 0; i < fieldch.size(); i++) {
            InstitutionTest institutionTest = new InstitutionTest();
            institutionTest.setIname(name);
            institutionTest.setVperiod(date);
            institutionTest.setBigtypename(bigTypeName.get(i));
            institutionTest.setTypename(typeName.get(i));
            institutionTest.setNum(num.get(i));
            institutionTest.setFieldch(fieldch.get(i));
            institutionTest.setDetnum(detnum.get(i));
            institutionTest.setDescriptch(descriptCh.get(i));
            institutionTest.setStdnum(stdNum.get(i));
            institutionTest.setStandardchorder(standardCh.get(i) + " " + order.get(i));
            institutionTest.setRestrictch(restrictCh.get(i));
            if ("0".equals(status.get(i))) {
                institutionTest.setStatus("有效");
            } else {
                institutionTest.setStatus("无效");
            }
            institutionTests.add(institutionTest);
        }
        page.putField("institutionTests", institutionTests);
    }

}

Site

/**
 * 进行爬虫的一些单独的设置
 */
private Site site = Site.me()
        //设置编码格式
        .setCharset("utf8")
        //设置超时时间
        .setTimeOut(10 * 1000)
        //设置重试的间隔时间
        .setRetrySleepTime(3 * 1000)
        //设置重试的次数
        .setRetryTimes(3);

@Override
public Site getSite() {
    return site;
}

Start

private InstitutionInfoPipeline institutionInfoPipeline;
private InstitutionDomainPipeline institutionDomainPipeline;
private InstitutionTestPipeline institutionTestPipeline;

@Autowired
public void setInstitutionInfoPipeline(InstitutionInfoPipeline institutionInfoPipeline) {
    this.institutionInfoPipeline = institutionInfoPipeline;
}

@Autowired
public void setInstitutionDomainPipeline(InstitutionDomainPipeline institutionDomainPipeline) {
    this.institutionDomainPipeline = institutionDomainPipeline;
}

@Autowired
public void setInstitutionTestPipeline(InstitutionTestPipeline institutionTestPipeline) {
    this.institutionTestPipeline = institutionTestPipeline;
}

/**
 * 爬虫的启动方法
 */
public void start(PageProcessor pageProcessor) {
    //开启线程
    Spider.create(pageProcessor)
            .addUrl("https://las.cnas.org.cn/LAS/publish/externalQueryIB.jsp")
            .setScheduler(new QueueScheduler().setDuplicateRemover(new BloomFilterDuplicateRemover(100000)))
            .addPipeline(this.institutionInfoPipeline)
            .addPipeline(this.institutionDomainPipeline)
            .addPipeline(this.institutionTestPipeline)
            .thread(10)
            .run();
    try {
        ProductWebSocket.sendInfo("爬虫采集已结束,请在数据库中进行查看,或导出为Excel格式进行查看！");
    } catch (Exception e) {
        e.printStackTrace();
    }
}

构造装配Service的方法调用SpringUtil类

/**
 * @return SpringUtil工具类
 */

private JobProcessor getService() {
    return SpringUtil.getBean(this.getClass());
}

2）重写Pipeline类构建数据通道存储爬虫的结果

1.InstitutionInfoPipeline

/**
 * @author 90934
 * @date 2020/2/3 18:31
 * @description 定制的pipeline类
 * @since 0.1.0
 */
@Component
public class InstitutionInfoPipeline implements Pipeline {

    private InstitutionInfoService institutionInfoService;

    @Autowired
    public void setInstitutionInfoService(InstitutionInfoService institutionInfoService) {
        this.institutionInfoService = institutionInfoService;
    }

    @Override
    public void process(ResultItems resultItems, Task task) {
        //获取封装好的机构基本信息对象
        InstitutionInfo institutionInfo = resultItems.get("institutionInfo");

        //判断数据是否为不空
        if (institutionInfo != null) {
            //如果不为空把数据保存在数据库中
            this.institutionInfoService.save(institutionInfo);
        }
    }
}

2.InstitutionDomainPipeline

/**
 * @author 90934
 * @date 2020/2/7 12:27
 * @description 定制认可的签字人及领域pipeline类
 * @since 0.1.0
 */
@Component
public class InstitutionDomainPipeline implements Pipeline {
    private InstitutionDomainService institutionDomainService;

    @Autowired
    public void setInstitutionDomainService(InstitutionDomainService institutionDomainService) {
        this.institutionDomainService = institutionDomainService;
    }

    @Override
    public void process(ResultItems resultItems, Task task) {
        //获取封装好的pojo对象
        List<InstitutionDomain> institutionDomains = resultItems.get("institutionDomains");
        //判断数据是否为不空
        if (institutionDomains != null) {
            this.institutionDomainService.save(institutionDomains);
        }
    }
}

3.InstitutionTestPipeline

/**
 * @author 90934
 * @date 2020/2/9 21:41
 * @description 自定义的认可的检验能力范围pipeline通道
 * @since 0.1.0
 */
@Component
public class InstitutionTestPipeline implements Pipeline {
    private InstitutionTestService institutionTestService;

    @Autowired
    public void setInstitutionTestService(InstitutionTestService institutionTestService) {
        this.institutionTestService = institutionTestService;
    }

    @Override
    public void process(ResultItems resultItems, Task task) {
        //获取封装好的pojo实体类
        List<InstitutionTest> institutionTests = resultItems.get("institutionTests");
        if (institutionTests != null){
            this.institutionTestService.save(institutionTests);
        }
    }
}

3）构造一个读取文档类GetAddress，读取address文档中的字段信息，根据字段信息来爬取相应区域的数据

/**
 * @author 90934
 * @date 2020/2/4 21:36
 * @description 读取txt文档中的地址信息，导入爬虫进程
 * @since 0.1.0
 */

public class GetAddress {
    public static List<String> resultData() {
        List<String> list = new ArrayList<>();
        //读取更新后的机构基本信息url
        String pathname = "classpath:static/address.txt";
        try {
            File file = ResourceUtils.getFile(pathname);
            FileReader reader = new FileReader(file);
            BufferedReader bufferedReader = new BufferedReader(reader);
            String line = null;
            while ((line = bufferedReader.readLine()) != null) {
                //向List中添加url
                list.add(line);
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        return list;
    }
}

5、后端与前端建立通信掌握爬虫进度

1）MyEndpointConfigure

/**
 * @author 90934
 * @date 2020/2/28 23:13
 * @description 通信节点配置类
 * @since 0.1.0
 */

public class MyEndpointConfigure extends ServerEndpointConfig.Configurator implements ApplicationContextAware {
    private static volatile BeanFactory content;

    @Override
    public <T> T getEndpointInstance(Class<T> clazz) {
        return content.getBean(clazz);
    }

    @Override
    public void setApplicationContext(@NonNull ApplicationContext applicationContext) throws BeansException {
        MyEndpointConfigure.content = applicationContext;
    }
}

2）WebSocketConfig

/**
 * @author 90934
 * @date 2020/2/27 13:08
 * @description websocket的configuration配置文件
 * @since 0.1.0
 */

import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.web.socket.server.standard.ServerEndpointExporter;

@Configuration
public class WebSocketConfig {
    @Bean
    public ServerEndpointExporter serverEndpointExporter() {
        return new ServerEndpointExporter();
    }

    @Bean
    public MyEndpointConfigure newConfigure() {
        return new MyEndpointConfigure();
    }
}

3）ProductWebSocket

/**
 * @author 90934
 * @date 2020/2/28 12:10
 * @description WebSocket服务端
 * @since 0.1.0
 */
@ServerEndpoint("/")
@Component
@Slf4j
public class ProductWebSocket {
    /**
     * 静态变量，用来记录当前在线连接数，应该把设计成线程安全的
     */
    private static final AtomicInteger ONLINE_COUNT = new AtomicInteger(0);
    /**
     * concurrent包的线程安全set，用来存放每个客户端对应的ProductWebSocket对象。
     */
    private static CopyOnWriteArrayList<ProductWebSocket> webSocketSet = new CopyOnWriteArrayList<>();
    /**
     * 与某个客户端的链接会话，需要通过它来给客户发送数据
     */
    private Session session;

    /**
     * 链接建立成功调用的方法
     */
    @OnOpen
    public void onOpen(@PathParam("userId") String userId, Session session) {
        log.info("开始爬虫进程，请稍后");
        this.session = session;
        webSocketSet.add(this);
        addOnlineCount();
        if (userId != null) {
            List<String> totalPushMsgs = new ArrayList<String>();
            totalPushMsgs.add("爬虫启动成功--当前爬虫启动数量为: " + getOnlineCount());
            if (!totalPushMsgs.isEmpty()) {
                totalPushMsgs.forEach(this::sendMessage);
            }
        }
    }

    /**
     * 连接关闭调用方法
     */
    @OnClose
    public void onClose() {
        log.info("爬虫已关闭");
        webSocketSet.remove(this);
        subOnlineCount();
    }

    /**
     * 收到客户端消息后调用的方法
     *
     * @param message 客户端发过来的消息
     */
    @OnMessage
    public void onMessage(String message) {
        log.info("当前爬取的数据为: " + message);
    }

    /**
     * 发生错误时调用
     */
    @OnError
    public void onError(Session session, Throwable error) {
        log.info("websocket出现错误!");
        error.printStackTrace();
    }

    public void sendMessage(String message) {
        try {
            this.session.getBasicRemote().sendText(message);
            log.info("数据获取成功，数据为: " + message);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * 多数据返回方法
     */
    public static void sendInfo(String message) throws IOException {
        for (ProductWebSocket productWebSocket : webSocketSet) {
            productWebSocket.sendMessage(message);
        }
    }

    public static synchronized int getOnlineCount() {
        return ONLINE_COUNT.get();
    }

    /**
     * 爬虫启动数量加一
     */
    public static synchronized void addOnlineCount() {
        ONLINE_COUNT.incrementAndGet();
    }

    /**
     * 爬虫启动数减一
     */
    public static synchronized void subOnlineCount() {
        ONLINE_COUNT.decrementAndGet();
    }
}

6、Web前端逻辑编写

1）MyWebController

/**
 * @author 90934
 * @date 2020/2/18 14:26
 * @description 爬虫项目启动Web端页面逻辑
 * @since 0.1.0
 */
@RestController
@ServerEndpoint("/websocket")
public class MyWebController {
    private JobProcessor jobProcessor;
    private InstitutionInfoService institutionInfoService;
    private IExportExcelService iExportExcelService;
    private InstitutionDomainService institutionDomainService;
    private InstitutionTestService institutionTestService;

    @Autowired
    public void setInstitutionDomainService(InstitutionDomainService institutionDomainService) {
        this.institutionDomainService = institutionDomainService;
    }

    @Autowired
    public void setInstitutionTestService(InstitutionTestService institutionTestService) {
        this.institutionTestService = institutionTestService;
    }

    @Autowired
    public void setiExportExcelService(IExportExcelService iExportExcelService) {
        this.iExportExcelService = iExportExcelService;
    }

    @Autowired
    public void setInstitutionInfoService(InstitutionInfoService institutionInfoService) {
        this.institutionInfoService = institutionInfoService;
    }

    @Autowired
    public void setJobProcessor(JobProcessor jobProcessor) {
        this.jobProcessor = jobProcessor;
    }

    @PostMapping("/crawler/run")
    public void runCrawler() {
        this.jobProcessor.start(jobProcessor);
    }

    @PostMapping("/exportExcelInfo")
    public void exportExcelInfo(HttpServletRequest req, HttpServletResponse resp) throws Exception {
        List<InstitutionInfo> list = this.institutionInfoService.findAll();
        this.iExportExcelService.exportExcelWithDispose("institution_info", "机构基本信息" + UUID.randomUUID().toString(),
                list, req, resp);
        try {
            ProductWebSocket.sendInfo("数据导出成功");
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    @PostMapping("/exportExcelDomain")
    public void exportExcelDomain(HttpServletRequest req, HttpServletResponse resp) throws Exception {
        List<InstitutionDomain> list = this.institutionDomainService.findAll();
        this.iExportExcelService.exportExcelWithDispose("institution_domain", "机构已正式公布的授权签字人及领域" + UUID.randomUUID().toString(),
                list, req, resp);
        try {
            ProductWebSocket.sendInfo("数据导出成功");
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    @PostMapping("/exportExcelTest")
    public void exportExcelTest(HttpServletRequest req, HttpServletResponse resp) throws Exception {
        List<InstitutionTest> list = this.institutionTestService.findAll();
        this.iExportExcelService.exportExcelWithDispose("institution_test", "机构已正式公布的检验能力范围" + UUID.randomUUID().toString(),
                list, req, resp);
        try {
            ProductWebSocket.sendInfo("数据导出成功");
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

2）index.html

<html lang="en" xmlns:th="http://www.thymeleaf.org">
<head>
    <meta charset="utf-8">
    <title>爬虫启动页面title>
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <link href="css/bootstrap.min.css" rel="stylesheet">
    <link href="css/bootstrap-responsive.min.css" rel="stylesheet">
    <link href="css/site.css" rel="stylesheet">
head>
<body>
<div class="container">
    <div class="spanCrawler">
        <h1>
            crawler Slate
        h1>
        <div class="well crawler-slate">
            <p>You have not created any projects yet.p>
            <button th:type="button" th:onclick="crawler()" class="btn btn-primary"><i
                    class="icon-plus icon-white">i> 开始你的爬虫
            button>
            <button th:type="button" th:onclick="exportExcelInfo()" class="btn btn-primary"><i
                    class="icon-plus icon-white">i> 导出机构基本信息
            button>
            <button th:type="button" th:onclick="exportExcelDomain()" class="btn btn-primary"><i
                    class="icon-plus icon-white">i> 导出签字人及领域
            button>
            <button th:type="button" th:onclick="exportExcelTest()" class="btn btn-primary"><i
                    class="icon-plus icon-white">i> 导出检验能力范围
            button>
        div>
        <div class="well crawler-slate" style="overflow-y: scroll; height: 65%">
            <table id="message" th:class="ridge">
            table>
        div>
    div>
div>
div>
<script src="js/jquery.min.js">script>
<script src="js/bootstrap.min.js">script>
<script src="js/site.js">script>
body>
html>

3）js文件和css文件太过冗长，有选择的进行展示

let websocket = null;

//判断当前浏览器是否支持WebSocket
if ('WebSocket' in window) {
	//连接WebSocket节点
	websocket = new WebSocket("ws://localhost:8080/");
} else {
	alert('Not support websocket')
}


//连接发生错误的回调方法
websocket.onerror = function () {
	setMessageInnerHTML("爬虫通道连接失败");
};


//连接成功建立的回调方法
websocket.onopen = function (event) {
	setMessageInnerHTML("爬虫通道链接成功");
}


//接收到消息的回调方法
websocket.onmessage = function (event) {
	setMessageInnerHTML(event.data);
}


//连接关闭的回调方法
websocket.onclose = function () {
	setMessageInnerHTML("爬虫通道已关闭");
}


//监听窗口关闭事件，当窗口关闭时，主动去关闭websocket连接，防止连接还没断开就关闭窗口，server端会抛异常。
window.onbeforeunload = function () {
	websocket.close();
	setMessageInnerHTML("爬虫通道已关闭");
}


//将消息显示在网页上
function setMessageInnerHTML(innerHTML) {
	if (innerHTML.toString() === "爬虫采集已结束,请在数据库中进行查看,或导出为Excel格式进行查看！") {
		alert(document.getElementById('message').innerHTML = "" + innerHTML + "")
	} else {
		document.getElementById('message').innerHTML += "" + innerHTML + "";
	}
}


//开启爬虫
function crawler() {
	$.ajax({
		url: '/crawler/run',
		type: 'post',
		contentType: 'application/json;charset=utf-8',
		success: function () {
			console.log("爬取成功！")
		},
		error: function (error) {
			console.log('接口不通' + error);
		},
	})
}

function exportExcelInfo() {
	$.ajax({
		url: '/exportExcelInfo',
		type: 'post',
		contentType: 'application/json;charset=utf-8',
		success: function () {
			console.log("导出数据成功")
		},
		error: function (error) {
			console.log('接口不通' + error);
		},
	})
}

function exportExcelDomain() {
	$.ajax({
		url: '/exportExcelDomain',
		type: 'post',
		contentType: 'application/json;charset=utf-8',
		success: function () {
			console.log("导出数据成功")
		},
		error: function (error) {
			console.log('接口不通' + error);
		},
	})
}

function exportExcelTest() {
	$.ajax({
		url: '/exportExcelTest',
		type: 'post',
		contentType: 'application/json;charset=utf-8',
		success: function () {
			console.log("导出数据成功")
		},
		error: function (error) {
			console.log('接口不通' + error);
		},
	})
}

table.ridge {
    text-align: center;
    display: block ruby
}

td.dashed {
    border-style: dashed;
}

7、SpringBoot整合项目启动类

1）SpringBoot项目的入口

/**
 * @author 90934
 */
@SpringBootApplication
@MapperScan(basePackages = {"com.hellof.crawler.mapper"})
public class CrawlerApplication {

    public static void main(String[] args) {
        SpringApplication.run(CrawlerApplication.class, args);
    }

}

2）application.properties配置文件

#DB Configuration:
spring.datasource.driverClassName=com.mysql.cj.jdbc.Driver
spring.datasource.url=jdbc:mysql://127.0.0.1:3306/jigou_data?useUnicode=true&characterEncoding=utf8&rewriteBatchedStatements=true&useSSL=false&serverTimezone=UTC&allowPublicKeyRetrieval=true
spring.datasource.username=root
spring.datasource.password=123456

#JPA Configuration:
spring.jpa.database=MySQL
spring.jpa.show-sql=true

3）logback-spring.xml配置文件



<configuration scan="false" scanPeriod="60 seconds" debug="false">
    
    <property name="LOG_HOME" value="/hellof/log" />
    
    <property name="appName" value="hellof-springboot"/>
    
    <appender name="stdout" class="ch.qos.logback.core.ConsoleAppender">
        
        <layout class="ch.qos.logback.classic.PatternLayout">
            <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{50} - %msg%npattern>
        layout>
    appender>

    
    <appender name="appLogAppender" class="ch.qos.logback.core.rolling.RollingFileAppender">
        
        <file>${LOG_HOME}/${appName}.logfile>
        
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            
            <fileNamePattern>${LOG_HOME}/${appName}-%d{yyyy-MM-dd}-%i.logfileNamePattern>
            
            <MaxHistory>365MaxHistory>
            
            <timeBasedFileNamingAndTriggeringPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedFNATP">
                <maxFileSize>100MBmaxFileSize>
            timeBasedFileNamingAndTriggeringPolicy>
        rollingPolicy>
        
        <layout class="ch.qos.logback.classic.PatternLayout">
            <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [ %thread ] - [ %-5level ] [ %logger{50} : %line ] - %msg%npattern>
        layout>
    appender>

    
    
    <logger name="com.hellof.crawler" level="debug" />
    
    <logger name="org.springframework" level="debug" additivity="false"/>



    
    <root level="info">
        <appender-ref ref="stdout" />
        <appender-ref ref="appLogAppender" />
    root>
configuration>

4）pom.xml


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0modelVersion>
    <parent>
        <groupId>org.springframework.bootgroupId>
        <artifactId>spring-boot-starter-parentartifactId>
        <version>2.2.5.RELEASEversion>
        <relativePath/> 
    parent>
    <groupId>com.hellofgroupId>
    <artifactId>crawlerartifactId>
    <version>0.0.1-SNAPSHOTversion>
    <name>crawlername>
    <description>Demo project for Spring Bootdescription>

    <properties>
        <java.version>1.8java.version>
    properties>

    <dependencies>
        <dependency>
            <groupId>org.springframework.bootgroupId>
            <artifactId>spring-boot-starter-webartifactId>
        dependency>
        
        <dependency>
            <groupId>org.mybatis.spring.bootgroupId>
            <artifactId>mybatis-spring-boot-starterartifactId>
            <version>2.1.1version>
        dependency>

        <dependency>
            <groupId>org.springframework.bootgroupId>
            <artifactId>spring-boot-starter-loggingartifactId>
        dependency>
        <dependency>
            <groupId>org.springframework.bootgroupId>
            <artifactId>spring-boot-starter-thymeleafartifactId>
        dependency>
        <dependency>
            <groupId>org.springframework.bootgroupId>
            <artifactId>spring-boot-starter-websocketartifactId>
        dependency>
        <dependency>
            <groupId>org.springframework.bootgroupId>
            <artifactId>spring-boot-starter-data-jpaartifactId>
        dependency>
        <dependency>
            <groupId>mysqlgroupId>
            <artifactId>mysql-connector-javaartifactId>
            <scope>runtimescope>
        dependency>
        <dependency>
            <groupId>org.projectlombokgroupId>
            <artifactId>lombokartifactId>
            <optional>trueoptional>
        dependency>
        <dependency>
            <groupId>org.seleniumhq.seleniumgroupId>
            <artifactId>selenium-javaartifactId>
            <version>3.141.59version>
        dependency>
        
        <dependency>
            <groupId>us.codecraftgroupId>
            <artifactId>webmagic-coreartifactId>
            <version>0.7.3version>
            <exclusions>
                <exclusion>
                    <artifactId>log4jartifactId>
                    <groupId>log4jgroupId>
                exclusion>
            exclusions>
        dependency>
        
        <dependency>
            <groupId>us.codecraftgroupId>
            <artifactId>webmagic-extensionartifactId>
            <version>0.7.3version>
        dependency>

        
        <dependency>
            <groupId>org.apache.commonsgroupId>
            <artifactId>commons-lang3artifactId>
        dependency>

        
        <dependency>
            <groupId>org.springframework.bootgroupId>
            <artifactId>spring-boot-starter-data-elasticsearchartifactId>
        dependency>
        <dependency>
            <groupId>junitgroupId>
            <artifactId>junitartifactId>
            <scope>testscope>
        dependency>
        
        
        <dependency>
            <groupId>net.sf.json-libgroupId>
            <artifactId>json-libartifactId>
            <version>2.4version>
            <classifier>jdk15classifier>
        dependency>

        
        <dependency>
            <groupId>cn.afterturngroupId>
            <artifactId>easypoi-baseartifactId>
            <version>4.1.3version>
            <exclusions>
                <exclusion>
                    <artifactId>guavaartifactId>
                    <groupId>com.google.guavagroupId>
                exclusion>
            exclusions>
        dependency>
        <dependency>
            <groupId>cn.afterturngroupId>
            <artifactId>easypoi-webartifactId>
            <version>4.1.3version>
        dependency>
        <dependency>
            <groupId>cn.afterturngroupId>
            <artifactId>easypoi-annotationartifactId>
            <version>4.1.3version>
        dependency>
        
        
        <dependency>
            <groupId>org.apache.poigroupId>
            <artifactId>poiartifactId>
            <version>4.1.2version>
        dependency>
        
        <dependency>
            <groupId>org.apache.poigroupId>
            <artifactId>poi-ooxmlartifactId>
            <version>4.1.2version>
        dependency>
        
        <dependency>
            <groupId>org.apache.poigroupId>
            <artifactId>poi-ooxml-schemasartifactId>
            <version>4.1.2version>
        dependency>
        
        <dependency>
            <groupId>commons-fileuploadgroupId>
            <artifactId>commons-fileuploadartifactId>
            <version>1.4version>
        dependency>
        
        <dependency>
            <groupId>commons-iogroupId>
            <artifactId>commons-ioartifactId>
            <version>2.6version>
        dependency>

        <dependency>
            <groupId>org.springframework.bootgroupId>
            <artifactId>spring-boot-starter-testartifactId>
            <scope>testscope>
            <exclusions>
                <exclusion>
                    <groupId>org.junit.vintagegroupId>
                    <artifactId>junit-vintage-engineartifactId>
                exclusion>
            exclusions>
        dependency>
        <dependency>
            <groupId>org.xmlunitgroupId>
            <artifactId>xmlunit-coreartifactId>
        dependency>
    dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.bootgroupId>
                <artifactId>spring-boot-maven-pluginartifactId>
            plugin>
        plugins>
    build>
project>

8、项目运行结果展示

1）项目首页

2）开始爬虫

本项目中的爬虫因为网页采用的异步加载的初始信息，所以使用的是selenium进行页面的渲染，并且有着验证码阻碍爬虫的进行，考虑的短时间内的效率问题，所以没有使用机器识别，需人工手动输入验证码内容以获取初始信息。

人工验证获得初始链接
提示爬虫完成进度
可将爬取到的数据进行导出处理，导出格式为Excel文件
打开数据库进行查看爬取的信息
打开Excel文件查看导出数据

3）项目运行时可能出现的问题

问题：当爬取速度过快或数据过多的是啥web页面可能会崩溃
1. 解决办法：刷新页面即可，爬虫进程并不会影响
如果需要更换爬虫网址，只需要模仿JobProcess类重新编写爬虫逻辑、或者直接修改Jobprocess类即可
当更换爬虫网址的时候除：websocket、util、mapper、io、export这五个包中的方法和SpringBoot启动类不需要更改以外，其他包中的类都需要进行相应的修改。

项目源代码地址：https://github.com/masonsxu/Crawler-Job-JiGou

你可能感兴趣的:(基于SpringBoot + MyBatis + WebMagic的爬虫)

斤斤计较的婚姻到底有多难？白心之岂必有为
很多人私聊我会问到在哪个人群当中斤斤计较的人最多？我都会回答他，一般婚姻出现问题的斤斤计较的人士会非常多，以我多年经验，在婚姻落的一塌糊涂的人当中，斤斤计较的人数占比在20～30%以上，也就是说10个婚姻出现问题的斤斤计较的人有2-3个有多不减。在婚姻出问题当中，有大量的心理不平衡的、尖酸刻薄的怨妇。在婚姻中仅斤斤计较有两种类型：第一种是物质上的，另一种是精神上的。在物质与精神上抠门已经严重的影响
情绪觉察日记第37天露露_e800
今天是家庭关系规划师的第二阶最后一天，慧萍老师帮我做了个案，帮我处理了埋在心底好多年的一份恐惧，并给了我深深的力量！这几天出来学习，爸妈过来婆家帮我带小孩，妈妈出于爱帮我收拾东西，并跟我先生和婆婆产生矛盾，妈妈觉得他们没有照顾好我…。今晚回家见到妈妈，我很欣赏她并赞扬她，妈妈说今晚要跟我睡我说好，当我们俩躺在床上准备睡觉的时候，我握着妈妈的手对她说:妈妈这几天辛苦你了，你看你多利害把我们的家收拾得
芦花鞋一四许叶晗
又是在一个寒冷的夏日里，青铜和葵花决定今天一起去卖芦花鞋，奶奶亲手给他们做了一碗热乎乎的粥对他们说:“就靠你们两挣生活费了这碗粥赶紧趁热喝了吧！”于是青铜和葵花喝完了奶奶给她们做的粥，就准备去镇上卖卢花鞋，这回青铜和葵花穿着新的芦花鞋来到了镇上。青铜这回看到了很多人都在卖，用手势表达对葵花说:“这回有好多人在抢我们生意呢！我们必须得吆喝起来。”葵花点了点头。可是谁知他们也大声的叫，卖芦花喽！卖芦花
QQ群采集助手，精准引流必备神器 2401_87347160 其他经验分享
功能概述微信群查找与筛选工具是一款专为微信用户设计的辅助工具，它通过关键词搜索功能，帮助用户快速找到相关的微信群，并提供筛选是否需要验证的群组的功能。主要功能关键词搜索：用户可以输入关键词，工具将自动查找包含该关键词的微信群。筛选功能：工具提供筛选机制，用户可以选择是否只显示需要验证或不需要验证的群组。精准引流：通过上述功能，用户可以更精准地找到目标群组，进行有效的引流操作。3.设备需求该工具可以
关于沟通这件事，项目经理不需要每次都面对面进行流程大师兄
很多项目经理都会遇到这样的问题，项目中由于事情太多，根本没有足够的时间去召开会议，那在这种情况下如何去有效地管理项目中的利益相关者？当然，不建议电子邮件也不需要开会的话，建议可以采取下面几种方式来形成有效的沟通，这几种方式可以帮助你努力的通过各种办法来保持和各方面的联系。项目经理首先要问自己几个问题，项目中哪些利益相关者是必须要进行沟通的？可以列出项目中所有的利益相关者清单，同时也整理出项目中哪些
机器学习与深度学习间关系与区别 ℒℴѵℯ心·动ꦿ໊ོ꫞ 人工智能学习深度学习 python
一、机器学习概述定义机器学习（MachineLearning,ML）是一种通过数据驱动的方法，利用统计学和计算算法来训练模型，使计算机能够从数据中学习并自动进行预测或决策。机器学习通过分析大量数据样本，识别其中的模式和规律，从而对新的数据进行判断。其核心在于通过训练过程，让模型不断优化和提升其预测准确性。主要类型1.监督学习（SupervisedLearning）监督学习是指在训练数据集中包含输入
铭刻于星（四十二）随风至
69夜晚，绍敏同学做完功课后，看了眼房外，没听到动静才敢从书包的夹层里拿出那个心形纸团。折痕压得很深，都有些旧了，想来是已经写好很久了。绍敏同学慢慢地、轻轻地捏开折叠处，待到全部拆开后，又反复抚平纸张，然后仔细地一字字默看。只是开头的三个字是第一次看到，让她心漏跳了几拍。“亲爱的绍敏：从四年级的时候，我就喜欢你了，但是我一直不敢说，怕影响你学习。六年级的时候听说有人跟你表白，你接受了，我很难过，但
底层逆袭到底有多难，不甘平凡的你准备好了吗？让吴起给你说说造命者说
底层逆袭到底有多难，不甘平凡的你准备好了吗？让吴起给你说说我叫吴起，生于公元前440年的战国初期，正是群雄并起、天下纷争不断的时候。后人说我是军事家、政治家、改革家，是兵家代表人物。评价我一生历仕鲁、魏、楚三国，通晓兵家、法家、儒家三家思想，在内政军事上都有极高的成就。周安王二十一年（公元前381年），因变法得罪守旧贵族，被人乱箭射死。我出生在卫国一个“家累万金”的富有家庭，从年轻时候起就不甘平凡
2020-01-25 晴岚85
郑海燕坚持分享590天2020.1.24在生活中只存在两个问题。一个问题是：你知道想要达成的目标是什么，但却不知道如何才能达成；另一个问题是：你不知道你的目标是什么。前一个是行动的问题，后一个是结果的问题。通过制定具体的下一步行动，可以解决不知道如何开始行动的问题。而通过去想象结果，对结果做预估，可以解决找不着目标的问题。对于所有吸引我们注意力，想要完成的任务，你可以先想象一下，预期的结果究竟是什
随笔 | 仙一般的灵气海思沧海
仙岛今天，我看了你全部，似乎已经进入你的世界我不知道，这是否是梦幻，还是你仙一般的灵气吸引了我也许每一个人都要有一份属于自己的追求，这样才能够符合人生的梦想，生活才能够充满着阳光与快乐我不知道，我为什么会这样的感叹，是在感叹自己的人生，还是感叹自己一直没有孜孜不倦的追求只感觉虚度了光阴，每天活在自己的梦中，活在一个不真实的世界是在逃避自己，还是在逃避周围的一切有时候我嘲笑自己，嘲笑自己如此的虚无，
想家爆米花机
也许不同于大家对家乡的思念，我对家乡甚至是疯狂的不舍。还未踏出车站就感觉到幸福，我享受这里的夕阳、这里的浓烈柴火味、这里每一口家常菜。我是宅女，我贪恋家的安逸。刚刚踏出大学校门，初出茅庐，无法适应每年只能国庆和春节回家。我焦虑、失眠、无端发脾气，是无法适应工作的节奏，是无法接受我将一步步离开家乡的事实。我不想承认自己胸无大志，选择再次踏上征程。图片发自App
【iOS】MVC设计模式 Magnetic_h ios mvc 设计模式 objective-c 学习 ui
MVC前言如何设计一个程序的结构，这是一门专门的学问，叫做"架构模式"（architecturalpattern），属于编程的方法论。MVC模式就是架构模式的一种。它是Apple官方推荐的App开发架构，也是一般开发者最先遇到、最经典的架构。MVC各层controller层Controller/ViewController/VC（控制器）负责协调Model和View，处理大部分逻辑它将数据从Mod
OC语言多界面传值五大方式 Magnetic_h ios ui 学习 objective-c 开发语言
前言在完成暑假仿写项目时，遇到了许多需要用到多界面传值的地方，这篇博客来总结一下比较常用的五种多界面传值的方式。属性传值属性传值一般用前一个界面向后一个界面传值，简单地说就是通过访问后一个视图控制器的属性来为它赋值，通过这个属性来做到从前一个界面向后一个界面传值。首先在后一个界面中定义属性@interfaceBViewController:UIViewController@propertyNSSt
一百九十四章. 自相矛盾巨木擎天
唉！就这么一夜，林子感觉就像过了很多天似的，先是回了阳间家里，遇到了那么多不可思议的事情儿。特别是小伙伴们，第二次与自己见面时，僵硬的表情和恐怖的气氛，让自己如坐针毡，打从心眼里难受！还有东子，他现在还好吗？有没有被人欺负？护城河里的小鱼小虾们，还都在吗？水不会真的干枯了吧？那对相亲相爱漂亮的太平鸟儿，还好吧！春天了，到了做窝、下蛋、喂养小鸟宝宝的时候了，希望它们都能够平安啊！虽然没有看见家人，也
UI学习——cell的复用和自定义cell Magnetic_h ui 学习
目录cell的复用手动（非注册）自动（注册）自定义cellcell的复用在iOS开发中，单元格复用是一种提高表格（UITableView）和集合视图（UICollectionView）滚动性能的技术。当一个UITableViewCell或UICollectionViewCell首次需要显示时，如果没有可复用的单元格，则视图会创建一个新的单元格。一旦这个单元格滚动出屏幕，它就不会被销毁。相反，它被添
element实现动态路由+面包屑软件技术NINI vue案例 vue.js 前端
el-breadcrumb是ElementUI组件库中的一个面包屑导航组件，它用于显示当前页面的路径，帮助用户快速理解和导航到应用的各个部分。在Vue.js项目中，如果你已经安装了ElementUI，就可以很方便地使用el-breadcrumb组件。以下是一个基本的使用示例：安装ElementUI（如果你还没有安装的话）:你可以通过npm或yarn来安装ElementUI。bash复制代码npmi
10月|愿你的青春不负梦想-读书笔记-01 Tracy的小书斋
本书的作者是俞敏洪，大家都很熟悉他了吧。俞敏洪老师是我行业的领头羊吧，也是我事业上的偶像。本日摘录他书中第一章中的金句：『一个人如果什么目标都没有，就会浑浑噩噩，感觉生命中缺少能量。能给我们能量的，是对未来的期待。第一件事，我始终为了进步而努力。与其追寻全世界的骏马，不如种植丰美的草原，到时骏马自然会来。第二件事，我始终有阶段性的目标。什么东西能给我能量？答案是对未来的期待。』读到这里的时候，我便
C语言宏函数南林yan C语言 c语言
一、什么是宏函数？通过宏定义的函数是宏函数。如下，编译器在预处理阶段会将Add(x,y)替换为((x)*(y))#defineAdd(x,y)((x)*(y))#defineAdd(x,y)((x)*(y))intmain(){inta=10;intb=20;intd=10;intc=Add(a+d,b)*2;cout<
地推话术，如何应对地推过程中家长的拒绝校师学
相信校长们在做地推的时候经常遇到这种情况：市场专员反馈家长不接单，咨询师反馈难以邀约这些家长上门，校区地推疲软，招生难。为什么？仅从地推层面分析，一方面因为家长受到的信息轰炸越来越多，对信息越来越“免疫”；而另一方面地推人员的专业能力和营销话术没有提高，无法应对家长的拒绝，对有意向的家长也不知如何跟进，眼睁睁看着家长走远；对于家长的疑问，更不知道如何有技巧地回答，机会白白流失。由于回答没技巧和专业
谢谢你们，爱你们！鹿游儿
昨天家人去泡温泉，二个孩子也带着去，出发前一晚，匆匆下班，赶回家和孩子一起收拾。饭后，我拿出笔和本子（上次去澳门时做手帐的本子）写下了1\2\3\4\5\6\7\8\9,让后让小壹去思考，带什么出发去旅游呢？她在对应的数字旁边画上了，泳衣、泳圈、肖恩、内衣内裤、tapuy、拖鞋……画完后，就让她自己对着这个本子，将要带的，一一带上，没想到这次带的书还是这本《便便工厂》(晚上姑婆发照片过来，妹妹累得
C语言如何定义宏函数？小九格物 c语言
在C语言中，宏函数是通过预处理器定义的，它在编译之前替换代码中的宏调用。宏函数可以模拟函数的行为，但它们不是真正的函数，因为它们在编译时不会进行类型检查，也不会分配存储空间。宏函数的定义通常使用#define指令，后面跟着宏的名称和参数列表，以及宏展开后的代码。宏函数的定义方式：1.基本宏函数：这是最简单的宏函数形式，它直接定义一个表达式。#defineSQUARE(x)((x)*(x))2.带参
微服务下功能权限与数据权限的设计与实现 nbsaas-boot 微服务 java 架构
在微服务架构下，系统的功能权限和数据权限控制显得尤为重要。随着系统规模的扩大和微服务数量的增加，如何保证不同用户和服务之间的访问权限准确、细粒度地控制，成为设计安全策略的关键。本文将讨论如何在微服务体系中设计和实现功能权限与数据权限控制。1.功能权限与数据权限的定义功能权限：指用户或系统角色对特定功能的访问权限。通常是某个用户角色能否执行某个操作，比如查看订单、创建订单、修改用户资料等。数据权限：
理解Gunicorn：Python WSGI服务器的基石范范0825 ipython linux 运维
理解Gunicorn：PythonWSGI服务器的基石介绍Gunicorn，全称GreenUnicorn，是一个为PythonWSGI（WebServerGatewayInterface）应用设计的高效、轻量级HTTP服务器。作为PythonWeb应用部署的常用工具，Gunicorn以其高性能和易用性著称。本文将介绍Gunicorn的基本概念、安装和配置，帮助初学者快速上手。1.什么是Gunico
小丽成长记（四十三）玲玲54321
小丽发现，即使她好不容易调整好自己的心态下一秒总会有不确定的伤脑筋的事出现，一个接一个的问题，人生就没有停下的时候，小问题不断出现。不过她今天看的书，她接受了人生就是不确定的，厉害的人就是不断创造确定性，在Ta的领域比别人多的确定性就能让自己脱颖而出，显示价值从而获得的比别人多的利益。正是这样的原因，因为从前修炼自己太少，使得她现在在人生道路上打怪起来困难重重，她似乎永远摆脱不了那种无力感，有种习
学点心理知识，呵护孩子健康静候花开_7090
昨天听了华中师范大学教育管理学系副教授张玲老师的《哪里才是学生心理健康的最后庇护所，超越教育与技术的思考》的讲座。今天又重新学习了一遍，收获匪浅。张玲博士也注意到了当今社会上的孩子由于心理问题导致的自残、自杀及伤害他人等恶性事件。她向我们普及了一个重要的命题，她说心理健康的一些基本命题，我们与我们通常的一些教育命题是不同的，她还举了几个例子，让我们明白我们原来以为的健康并非心理学上的健康。比如如果
2021年12月19日，春蕾教育集团团建活动感受——黄晓丹黄错错加油
感受:1.从陌生到熟悉的过程。游戏环节让我们在轻松的氛围中得到了锻炼，也增长了不少知识。2.游戏过程中，我们贡献的是个人力量，展现的是团队的力量。它磨合的往往不止是工作的熟悉，更是观念上契合度的贴近。3.这和工作是一样的道理。在各自的岗位上，每个人摆正自己的位置、各司其职充分发挥才能，并团结一致劲往一处使，才能实现最大的成功。新知:1.团队精神需要不断地创新。过去，人们把创新看作是冒风险，现在人们
Cell Insight | 单细胞测序技术又一新发现，可用于HIV-1和Mtb共感染个体诊断尐尐呅
结核病是艾滋病合并其他疾病中导致患者死亡的主要原因。其中结核病由结核分枝杆菌（Mycobacteriumtuberculosis,Mtb）感染引起，获得性免疫缺陷综合症（艾滋病）由人免疫缺陷病毒（Humanimmunodeficiencyvirustype1,HIV-1）感染引起。国家感染性疾病临床医学研究中心/深圳市第三人民医院张国良团队携手深圳华大生命科学研究院吴靓团队，共同研究得出单细胞测序
c++ 的iostream 和 c++的stdio的区别和联系黄卷青灯77 c++算法开发语言 iostream stdio
在C++中，iostream和C语言的stdio.h都是用于处理输入输出的库，但它们在设计、用法和功能上有许多不同。以下是两者的区别和联系：区别1.编程风格iostream（C++风格）：C++标准库中的输入输出流类库，支持面向对象的输入输出操作。典型用法是cin（输入）和cout（输出），使用>操作符来处理数据。更加类型安全，支持用户自定义类型的输入输出。#includeintmain(){in
瑶池防线谜影梦蝶
冥华虽然逃过了影梦的军队，但他是一个忠臣，他选择上报战况。败给影梦后成逃兵，高层亡尔还活着，七重天失守......随便一条，即可处死冥华。冥华自然是知道以仙界高层的习性此信一发自己必死无疑，但他还选择上报实情，因为责任。同样此信送到仙宫后，知道此事的人，大多数人都认定冥华要完了，所以上到仙界高层，下到扫大街的，包括冥华自己，全都准备好迎接冥华之死。如果仙界现在还属于两方之争的话，冥华必死无疑。然而
爬山后遗症璃绛
爬山，攀登，一步一步走向制高点，是一种挑战。成功抵达是一种无法言语的快乐，在山顶吹吹风，看看风景，这是从未有过的体验。然而，爬山一时爽，下山腿打颤，颠簸的路，一路向下走，腿部力量不够，走起来抖到不行，停不下来了！第二天必定腿疼，浑身酸痛，坐立难安！
312个免费高速HTTP代理IP（能隐藏自己真实IP地址） yangshangchuan 高速免费 superword HTTP代理
124.88.67.20:843 190.36.223.93:8080 117.147.221.38:8123 122.228.92.103:3128 183.247.211.159:8123 124.88.67.35:81 112.18.51.167:8123 218.28.96.39:3128 49.94.160.198:3128 183.20
pull解析和json编码百合不是茶 android pull解析 json
n.json文件: [{name:java,lan:c++,age:17},{name:android,lan:java,age:8}] pull.xml文件 <?xml version="1.0" encoding="utf-8"?> <stu> <name>java
[能源与矿产]石油与地球生态系统 comsci 能源
按照苏联的科学界的说法,石油并非是远古的生物残骸的演变产物,而是一种可以由某些特殊地质结构和物理条件生产出来的东西,也就是说,石油是可以自增长的.... 那么我们做一个猜想: 石油好像是地球的体液,我们地球具有自动产生石油的某种机制,只要我们不过量开采石油,并保护好
类与对象浅谈沐刃青蛟 java 基础
类，字面理解，便是同一种事物的总称，比如人类，是对世界上所有人的一个总称。而对象，便是类的具体化，实例化，是一个具体事物，比如张飞这个人，就是人类的一个对象。但要注意的是：张飞这个人是对象，而不是张飞，张飞只是他这个人的名字，是他的属性而已。而一个类中包含了属性和方法这两兄弟，他们分别用来描述对象的行为和性质（感觉应该是
新站开始被收录后，我们应该做什么？ IT独行者 PHP seo
新站开始被收录后，我们应该做什么？百度终于开始收录自己的网站了，作为站长，你是不是觉得那一刻很有成就感呢，同时，你是不是又很茫然，不知道下一步该做什么了？至少我当初就是这样，在这里和大家一份分享一下新站收录后，我们要做哪些工作。至于如何让百度快速收录自己的网站，可以参考我之前的帖子《新站让百
oracle 连接碰到的问题文强chu oracle
Unable to find a java Virtual Machine－－安装64位版Oracle11gR2后无法启动SQLDeveloper的解决方案作者：草根IT网来源：未知人气：813标签：导读：安装64位版Oracle11gR2后发现启动SQLDeveloper时弹出配置java.exe的路径，找到Oracle自带java.exe后产生的路径“C:\app\用户名\prod
Swing中按ctrl键同时移动鼠标拖动组件（类中多借口共享同一数据）小桔子 java 继承 swing 接口监听
都知道java中类只能单继承，但可以实现多个接口，但我发现实现多个接口之后，多个接口却不能共享同一个数据，应用开发中想实现：当用户按着ctrl键时，可以用鼠标点击拖动组件，比如说文本框。编写一个监听实现KeyListener,NouseListener,MouseMotionListener三个接口，重写方法。定义一个全局变量boolea
linux常用的命令 aichenglong linux 常用命令
1 startx切换到图形化界面 2 man命令:查看帮助信息 man 需要查看的命令,man命令提供了大量的帮助信息,一般可以分成4个部分 name:对命令的简单说明 synopsis:命令的使用格式说明 description:命令的详细说明信息 options:命令的各项说明 3 date:显示时间语法：date [OPTION]... [+FORMAT]
eclipse内存优化 AILIKES java eclipse jvm jdk
一基本说明在JVM中，总体上分2块内存区,默认空余堆内存小于 40%时，JVM就会增大堆直到-Xmx的最大限制；空余堆内存大于70%时，JVM会减少堆直到-Xms的最小限制。 1)堆内存(Heap memory):堆是运行时数据区域，所有类实例和数组的内存均从此处分配,是Java代码可及的内存，是留给开发人
关键字的使用探讨百合不是茶关键字
//关键字的使用探讨/*访问关键词private 只能在本类中访问public 只能在本工程中访问protected 只能在包中和子类中访问默认的只能在包中访问*//*final 类方法变量 final 类不能被继承 final 方法不能被子类覆盖，但可以继承 final 变量只能有一次赋值，赋值后不能改变 final 不能用来修饰构造方法*///this()
JS中定义对象的几种方式 bijian1013 js
1. 基于已有对象扩充其对象和方法(只适合于临时的生成一个对象)： <html> <head> <title>基于已有对象扩充其对象和方法(只适合于临时的生成一个对象)</title> </head> <script> var obj = new Object();
表驱动法实例 bijian1013 java 表驱动法 TDD
获得月的天数是典型的直接访问驱动表方式的实例，下面我们来展示一下： MonthDaysTest.java package com.study.test; import org.junit.Assert; import org.junit.Test; import com.study.MonthDays; public class MonthDaysTest { @T
LInux启停重启常用服务器的脚本 bit1129 linux
启动，停止和重启常用服务器的Bash脚本，对于每个服务器，需要根据实际的安装路径做相应的修改 #! /bin/bash Servers=(Apache2, Nginx, Resin, Tomcat, Couchbase, SVN, ActiveMQ, Mongo); Ops=(Start, Stop, Restart); currentDir=$(pwd); echo
【HBase六】REST操作HBase bit1129 hbase
HBase提供了REST风格的服务方便查看HBase集群的信息，以及执行增删改查操作 1. 启动和停止HBase REST 服务 1.1 启动REST服务前台启动（默认端口号8080） [hadoop@hadoop bin]$ ./hbase rest start 后台启动 hbase-daemon.sh start rest 启动时指定
大话zabbix 3.0设计假设 ronin47
What’s new in Zabbix 2.0? 去年开始使用Zabbix的时候，是1.8.X的版本，今年Zabbix已经跨入了2.0的时代。看了2.0的release notes，和performance相关的有下面几个： :: Performance improvements::Trigger related da
http错误码大全 byalias http协议 javaweb
响应码由三位十进制数字组成，它们出现在由HTTP服务器发送的响应的第一行。响应码分五种类型，由它们的第一位数字表示： 1）1xx：信息，请求收到，继续处理 2）2xx：成功，行为被成功地接受、理解和采纳 3）3xx：重定向，为了完成请求，必须进一步执行的动作 4）4xx：客户端错误，请求包含语法错误或者请求无法实现 5）5xx：服务器错误，服务器不能实现一种明显无效的请求
J2EE设计模式-Intercepting Filter bylijinnan java 设计模式数据结构
Intercepting Filter类似于职责链模式有两种实现其中一种是Filter之间没有联系，全部Filter都存放在FilterChain中，由FilterChain来有序或无序地把把所有Filter调用一遍。没有用到链表这种数据结构。示例如下： package com.ljn.filter.custom; import java.util.ArrayList;
修改jboss端口 chicony jboss
修改jboss端口 %JBOSS_HOME%\server\{服务实例名}\conf\bindingservice.beans\META-INF\bindings-jboss-beans.xml 中找到 <!-- The ports-default bindings are obtained by taking the base bindin
c++ 用类模版实现数组类 CrazyMizzz C++
最近c++学到数组类，写了代码将他实现，基本具有vector类的功能 #include<iostream> #include<string> #include<cassert> using namespace std; template<class T> class Array { public: //构造函数
hadoop dfs.datanode.du.reserved 预留空间配置方法 daizj hadoop 预留空间
对于datanode配置预留空间的方法为：在hdfs-site.xml添加如下配置 <property> <name>dfs.datanode.du.reserved</name> <value>10737418240</value>
mysql远程访问的设置 dcj3sjt126com mysql 防火墙
第一步: 激活网络设置你需要编辑mysql配置文件my.cnf. 通常状况，my.cnf放置于在以下目录： /etc/mysql/my.cnf (Debian linux) /etc/my.cnf （Red Hat Linux/Fedora Linux) /var/db/mysql/my.cnf (FreeBSD) 然后用vi编辑my.cnf，修改内容从以下行： [mysqld] 你所需要: 1
ios 使用特定的popToViewController返回到相应的Controller dcj3sjt126com controller
1、取navigationCtroller中的Controllers NSArray * ctrlArray = self.navigationController.viewControllers; 2、取出后，执行， [self.navigationController popToViewController:[ctrlArray objectAtIndex:0] animated:YES
Linux正则表达式和通配符的区别 eksliang 正则表达式通配符和正则表达式的区别通配符
转载请出自出处：http://eksliang.iteye.com/blog/1976579 首先得明白二者是截然不同的通配符只能用在shell命令中,用来处理字符串的的匹配。判断一个命令是否为bash shell(linux 默认的shell)的内置命令 type -t commad 返回结果含义 file 表示为外部命令 alias 表示该
Ubuntu Mysql Install and CONF gengzg Install
http://www.navicat.com.cn/download/navicat-for-mysql Step1: 下载Navicat ，网址：http://www.navicat.com/en/download/download.html Step2：进入下载目录，解压压缩包：tar -zxvf navicat11_mysql_en.tar.gz
批处理，删除文件bat huqiji windows dos
@echo off ::演示：删除指定路径下指定天数之前（以文件名中包含的日期字符串为准）的文件。 ::如果演示结果无误，把del前面的echo去掉，即可实现真正删除。 ::本例假设文件名中包含的日期字符串（比如：bak-2009-12-25.log） rem 指定待删除文件的存放路径 set SrcDir=C:/Test/BatHome rem 指定天数 set DaysAgo=1
跨浏览器兼容的HTML5视频音频播放器天梯梦 html5
HTML5的video和audio标签是用来在网页中加入视频和音频的标签，在支持html5的浏览器中不需要预先加载Adobe Flash浏览器插件就能轻松快速的播放视频和音频文件。而html5media.js可以在不支持html5的浏览器上使video和audio标签生效。 How to enable <video> and <audio> tags in
Bundle自定义数据传递 hm4123660 android Serializable 自定义数据传递 Bundle Parcelable
我们都知道Bundle可能过put****()方法添加各种基本类型的数据，Intent也可以通过putExtras(Bundle)将数据添加进去，然后通过startActivity()跳到下一下Activity的时候就把数据也传到下一个Activity了。如传递一个字符串到下一个Activity 把数据放到Intent
C＃：异步编程和线程的使用（.NET 4.5 ） powertoolsteam .net 线程 C#异步编程
异步编程和线程处理是并发或并行编程非常重要的功能特征。为了实现异步编程，可使用线程也可以不用。将异步与线程同时讲，将有助于我们更好的理解它们的特征。本文中涉及关键知识点 1. 异步编程 2. 线程的使用 3. 基于任务的异步模式 4. 并行编程 5. 总结异步编程什么是异步操作？异步操作是指某些操作能够独立运行，不依赖主流程或主其他处理流程。通常情况下，C＃程序
spark 查看 job history 日志 Stark_Summer 日志 spark history job
SPARK_HOME/conf 下: spark-defaults.conf 增加如下内容 spark.eventLog.enabled true spark.eventLog.dir hdfs://master:8020/var/log/spark spark.eventLog.compress true spark-env.sh 增加如下内容 export SP
SSH框架搭建 wangxiukai2015eye spring Hibernate struts
MyEclipse搭建SSH框架 Struts Spring Hibernate 1、new一个web project。 2、右键项目，为项目添加Struts支持。选择Struts2 Core Libraries -<MyEclipes-Library> 点击Finish。src目录下多了struts

基于SpringBoot + MyBatis + WebMagic的爬虫

基于SpringBoot + MyBatis + WebMagic的爬虫

1、爬虫功能模块介绍

2、数据的存储

1）实体类对象

2） 构造存储数据的方法

3）构造存储的接口和实现方法

3、数据导出为excel格式进行查看

1）构造实体类

2）使用MyBatis构造mapper对象

3）构造导出为Excel格式的Service接口及其impl实现类

4、WebMagic爬虫逻辑代码

1）重写PageProcessor方法完成代码逻辑

2）重写Pipeline类构建数据通道存储爬虫的结果

1.InstitutionInfoPipeline

2.InstitutionDomainPipeline

3.InstitutionTestPipeline

3）构造一个读取文档类GetAddress，读取address文档中的字段信息，根据字段信息来爬取相应区域的数据

5、后端与前端建立通信掌握爬虫进度

1）MyEndpointConfigure

2）WebSocketConfig

3）ProductWebSocket

6、Web前端逻辑编写

1）MyWebController

2）index.html

3）js文件和css文件太过冗长，有选择的进行展示

7、SpringBoot整合项目启动类

1）SpringBoot项目的入口

2）application.properties配置文件

3）logback-spring.xml配置文件

4）pom.xml

8、项目运行结果展示

1）项目首页

2）开始爬虫

3）项目运行时可能出现的问题

你可能感兴趣的:(基于SpringBoot + MyBatis + WebMagic的爬虫)

2）构造存储数据的方法