Java分布式爬虫seimicrawler

最近在扒一些数据,原本使用jsoup,但是发觉这个框架爬取的效率不高,用起来也不是很方便,了解了一些爬虫框架之后,决定使用SeimiCrawler来爬取数据。
开发环境:ideal+mybatis+SeimiCrawler
环境配置,具体的不解释,做过Java开发的明白,直接上配置文件:注意:SeimiCrawler相关的配置必须以seimi开头;
全局配置:seimi.xml


<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.springframework.org/schema/beans
        http://www.springframework.org/schema/beans/spring-beans.xsd">
    
    <bean id="propertyConfigurer" class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
        <property name="locations">
            <list>
                <value>classpath:**/*.propertiesvalue>
            list>
        property>
    bean>
beans>

数据库全局配置:mybatis-config.xml



<configuration>
    
    <settings>
        <setting name="mapUnderscoreToCamelCase" value="true"/>
    settings>
configuration>

SeimiCrawler数据配置seimi-mybatis.xml:


<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns:context="http://www.springframework.org/schema/context"
       xsi:schemaLocation="http://www.springframework.org/schema/beans
       http://www.springframework.org/schema/beans/spring-beans.xsd
       http://www.springframework.org/schema/context
       http://www.springframework.org/schema/context/spring-context.xsd">
    <context:annotation-config/>
    <bean id="mybatisDataSource" class="org.apache.commons.dbcp2.BasicDataSource">
        <property name="driverClassName" value="${jdbc.driver}"/>
        <property name="url" value="${jdbc.url}"/>
        <property name="username" value="${jdbc.username}"/>
        <property name="password" value="${jdbc.password}"/>
    bean>
    <bean id="sqlSessionFactory" class="org.mybatis.spring.SqlSessionFactoryBean" abstract="true">
        <property name="configLocation" value="classpath:mybatis-config.xml"/>
    bean>
    <bean id="seimiSqlSessionFactory" parent="sqlSessionFactory">
        <property name="dataSource" ref="mybatisDataSource"/>
    bean>
    <bean class="org.mybatis.spring.mapper.MapperScannerConfigurer">
        <property name="basePackage" value="com.morse.seimicrawler.dao"/>
        <property name="sqlSessionFactoryBeanName" value="seimiSqlSessionFactory"/>
    bean>
beans>

数据库引擎配置seimi.properties:

jdbc.driver=com.mysql.jdbc.Driver
jdbc.url=jdbc:mysql://localhost:3360/xiaohuo?useUnicode=true&characterEncoding=utf8&useSSL=false
jdbc.username=root
jdbc.password=123456

日志输出配置log4j.properties:

log4j.rootLogger=info, console, log, error

###Console ###
log4j.appender.console = org.apache.log4j.ConsoleAppender
log4j.appender.console.Target = System.out
log4j.appender.console.layout = org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern = %d %p[%C:%L]- %m%n

### log ###
log4j.appender.log = org.apache.log4j.DailyRollingFileAppender
log4j.appender.log.File = ${catalina.base}/logs/debug.log
log4j.appender.log.Append = true
log4j.appender.log.Threshold = DEBUG
log4j.appender.log.DatePattern='.'yyyy-MM-dd
log4j.appender.log.layout = org.apache.log4j.PatternLayout
log4j.appender.log.layout.ConversionPattern = %d %p[%c:%L] - %m%n


### Error ###
log4j.appender.error = org.apache.log4j.DailyRollingFileAppender
log4j.appender.error.File = ${catalina.base}/logs/error.log
log4j.appender.error.Append = true
log4j.appender.error.Threshold = ERROR 
log4j.appender.error.DatePattern='.'yyyy-MM-dd
log4j.appender.error.layout = org.apache.log4j.PatternLayout
log4j.appender.error.layout.ConversionPattern =%d %p[%c:%L] - %m%n

###\u8F93\u51FASQL
log4j.logger.com.ibatis=DEBUG
log4j.logger.com.ibatis.common.jdbc.SimpleDataSource=DEBUG
log4j.logger.com.ibatis.common.jdbc.ScriptRunner=DEBUG
log4j.logger.com.ibatis.sqlmap.engine.impl.SqlMapClientDelegate=DEBUG
log4j.logger.java.sql.Connection=DEBUG
log4j.logger.java.sql.Statement=DEBUG
log4j.logger.java.sql.PreparedStatement=DEBUG

基础配置到这里就配置完成了,接下来的就是实现爬虫业务了。
SeimiCrawler融合Spring,结合XPath,可以很方便的解析html,每一个爬虫的具体实现类需要放在包名为:xxx.crawlers的目录下,SeimiCrawler会自动扫描该目录下的文件,不然会找不到文件,爬虫无法启动。每一个爬虫需要集成BaseSeimiCrawler,并实现重写startUrls(),start(Response response)和回调接口。
下面以爬取代理IP为例,实现并对爬虫框架进行简单的二次封装:
基类爬虫BaseCrawler:

public abstract class BaseCrawler extends BaseSeimiCrawler {

    /**
     * 数据搜集前缀
     *
     * @return
     */
    protected abstract String getUrlPrefix();

    /**
     * 数据搜集后缀
     *
     * @return
     */
    protected abstract String getUrlsuffix();

    /**
     * 获取最大页数
     *
     * @param document
     * @return
     */
    protected abstract int getMaxPage(JXDocument document);

    /**
     * 数据解析
     *
     * @param response
     */
    public abstract void operation(Response response);

    /**
     * 设置头信息
     *
     * @return
     */
    protected Map setHeader() {
        return null;
    }

    @Override
    public void start(Response response) {
        try {
            JXDocument document = response.document();
            int max = getMaxPage(document);
            for (int i = 1; i <= max; i++) {
                logger.info("当前为第{}页", i);
                push(Request.build(getUrlPrefix() + i + getUrlsuffix(), "operation").setHeader(setHeader()));
            }
        } catch (Exception e) {
            e.printStackTrace();
        }

    }
}

具体的爬虫实现SeCrawler:

@Crawler(name = "seCrawler")
public class SeCrawler extends BaseCrawler {

    @Autowired
    private ProxyIpStoreDao dao;

    @Override
    public String[] startUrls() {
        return new String[]{"https://ip.seofangfa.com/"};
    }

    @Override
    protected String getUrlPrefix() {
        return "https://ip.seofangfa.com/proxy/";
    }

    @Override
    protected String getUrlsuffix() {
        return ".html";
    }

    @Override
    protected int getMaxPage(JXDocument document) {
        try {
            List pages = document.sel("//div[@class='page_nav']/ul/li/a/text()");
            return Integer.parseInt((String) pages.get(pages.size() - 1));
        } catch (Exception e) {
            e.printStackTrace();
        }
        return 0;
    }

    @Override
    public void operation(Response response) {
        try {
            JXDocument document = response.document();
            List ips = document.sel("//table[@class='table']/tbody/tr/td[1]/text()");
            List ports = document.sel("//table[@class='table']/tbody/tr/td[2]/text()");
            List speeds = document.sel("//table[@class='table']/tbody/tr/td[3]/text()");
            List addres = document.sel("//table[@class='table']/tbody/tr/td[4]/text()");
            List times = document.sel("//table[@class='table']/tbody/tr/td[5]/text()");
            ProxyIp proxyIp = new ProxyIp();
            for (int i = 0; i < ips.size(); i++) {
                proxyIp.setIp((String) ips.get(i));
                proxyIp.setPort((String) ports.get(i));
                proxyIp.setSpeed((String) speeds.get(i));
                proxyIp.setAddr((String) addres.get(i));
                proxyIp.setTime((String) times.get(i));
                dao.insert(proxyIp);
                logger.info("插入代理IP:", proxyIp.toString());
            }
        } catch (Exception e) {

        }
    }
} 
  

启动爬虫:

public static void main(String... agrs) {
        Seimi seimi = new Seimi();
        seimi.goRun("seCrawler");
    }

SeimiCrawler爬虫就是这么简单,你学会了吗?

你可能感兴趣的:(Java分布式爬虫seimicrawler)