【爬虫】WebMagic结合Spring mvc爬取数据进行存储

工作4年多了,也没写过什么博客,去年回老家入职一家国企,工作稍微轻松些,没有在深圳的时候那么忙。最近感觉精力充沛(轻松的工作还是蛮养人的),想把自己研究或者使用到的相关技术做一个记录。第一、对这些知识做一个总结,因为现在发现脑袋不好使了,体会到了好记性不如烂笔头。
废话不多说,那就从最近用的爬虫说起吧。另外自己对爬虫也没有什么研究,纯粹处于会使用的地步。

前言

最近由于工作需要,接触到了爬虫这一块。抓取完整数据分如下二步。
第一步、选择爬虫框架。我们老总说直接用jsoup抓取就行了,这些网站都好抓。那就用吧,把jar下下来,试用了一下,API挺简单,方便,感觉挺好的。总觉得这些网站是好抓,jsoup能够满足,但是有木有更好的、更方便的框架呢,答案是肯定的。那就到网上的查,果然webMagic能够满足我这个需求,主要是文档是中文的呀。那就是它了。
第二步、页面元素分析,那就得看你需要那些数据,来分析页面了,下面我们细说。

工程搭建

因为我使用的是springmvc + mybatis ,所以Maven如下:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0modelVersion>
  <groupId>SpidergroupId>
  <artifactId>SpiderartifactId>
  <packaging>warpackaging>
  <version>1.0-SNAPSHOTversion>
  <name>Spider Maven Webappname>
  <url>http://maven.apache.orgurl>
  <properties>
    <spring.version>4.1.1.RELEASEspring.version>
    <jstl>1.2jstl>
    <mybatis.version>3.3.0mybatis.version>
  properties>
  <dependencies>
    <dependency>
      <groupId>junitgroupId>
      <artifactId>junitartifactId>
      <version>3.8.1version>
      <scope>testscope>
    dependency>
    
    <dependency>
      <groupId>log4jgroupId>
      <artifactId>log4jartifactId>
      <version>1.2.17version>
    dependency>
    
    <dependency>
      <groupId>org.jsoupgroupId>
      <artifactId>jsoupartifactId>
      <version>1.9.2version>
    dependency>
    <dependency>
      <groupId>net.sf.json-libgroupId>
      <artifactId>json-libartifactId>
      <version>2.4version>
      <classifier>jdk15classifier>
    dependency>
    <dependency>
      <groupId>commons-collectionsgroupId>
      <artifactId>commons-collectionsartifactId>
      <version>3.2.1version>
    dependency>
    <dependency>
      <groupId>org.apache.httpcomponentsgroupId>
      <artifactId>httpclientartifactId>
      <version>4.3.3version>
    dependency>
    
    <dependency>
      <groupId>junitgroupId>
      <artifactId>junitartifactId>
      <version>3.8.1version>
      <scope>testscope>
    dependency>
    <dependency>
      <groupId>org.springframeworkgroupId>
      <artifactId>spring-coreartifactId>
      <version>${spring.version}version>
    dependency>

    <dependency>
      <groupId>org.springframeworkgroupId>
      <artifactId>spring-webartifactId>
      <version>${spring.version}version>
    dependency>
    <dependency>
      <groupId>org.springframeworkgroupId>
      <artifactId>spring-context-supportartifactId>
      <version>${spring.version}version>
    dependency>

    <dependency>
      <groupId>org.springframeworkgroupId>
      <artifactId>spring-webmvcartifactId>
      <version>${spring.version}version>
    dependency>

    <dependency>
      <groupId>org.springframeworkgroupId>
      <artifactId>spring-testartifactId>
      <version>${spring.version}version>
      <scope>testscope>
    dependency>
    <dependency>
      <groupId>org.springframeworkgroupId>
      <artifactId>spring-jdbcartifactId>
      <version>${spring.version}version>
    dependency>
    <dependency>
      <groupId>org.springframeworkgroupId>
      <artifactId>spring-testartifactId>
      <version>4.1.1.RELEASEversion>
    dependency>
    
    <dependency>
      <groupId>javax.servletgroupId>
      <artifactId>servlet-apiartifactId>
      <version>2.5version>
    dependency>
    <dependency>
      <groupId>javax.servletgroupId>
      <artifactId>jstlartifactId>
      <version>1.2version>
    dependency>
    <dependency>
      <groupId>javax.servlet.jspgroupId>
      <artifactId>jsp-apiartifactId>
      <version>2.1version>
      <scope>providedscope>
    dependency>
    <dependency>
      <groupId>jstlgroupId>
      <artifactId>jstlartifactId>
      <version>${jstl}version>
    dependency>
    
    <dependency>
      <groupId>org.mybatisgroupId>
      <artifactId>mybatisartifactId>
      <version>${mybatis.version}version>
    dependency>
    <dependency>
      <groupId>org.mybatisgroupId>
      <artifactId>mybatis-springartifactId>
      <version>1.2.3version>
    dependency>
    <dependency>
      <groupId>org.mybatis.generatorgroupId>
      <artifactId>mybatis-generator-coreartifactId>
      <version>1.3.2version>
    dependency>
    <dependency>
      <groupId>mysqlgroupId>
      <artifactId>mysql-connector-javaartifactId>
      <version>5.1.32version>
    dependency>
    <dependency>
      <groupId>org.apache.commonsgroupId>
      <artifactId>commons-collections4artifactId>
      <version>4.0version>
    dependency>
    <dependency>
      <groupId>commons-dbcpgroupId>
      <artifactId>commons-dbcpartifactId>
      <version>1.4version>
    dependency>
    <dependency>
      <groupId>commons-poolgroupId>
      <artifactId>commons-poolartifactId>
      <version>1.6version>
    dependency>
    
    <dependency>
      <groupId>com.alibabagroupId>
      <artifactId>druidartifactId>
      <version>1.0.15version>
    dependency>
    
    <dependency>
      <groupId>com.fasterxml.jackson.coregroupId>
      <artifactId>jackson-coreartifactId>
      <version>2.3.0version>
    dependency>
    <dependency>
      <groupId>com.fasterxml.jackson.coregroupId>
      <artifactId>jackson-databindartifactId>
      <version>2.3.0version>
    dependency>
    
    <dependency>
      <groupId>com.google.code.gsongroupId>
      <artifactId>gsonartifactId>
      <version>2.7version>
    dependency>
  dependencies>
  <build>
    <finalName>GogBuySpiderfinalName>
    <plugins>
      <plugin>
        <artifactId>maven-compiler-pluginartifactId>
        <configuration>
          <source>1.6source>
          <target>1.6target>
        configuration>
      plugin>
      <plugin>
        <artifactId>maven-surefire-pluginartifactId>
        <configuration>
          <includes>
            <include>**/*Tests.javainclude>
          includes>
        configuration>
      plugin>
      
      <plugin>
        <groupId>org.mybatis.generatorgroupId>
        <artifactId>mybatis-generator-maven-pluginartifactId>
        <version>1.3.2version>
        <configuration>
          <verbose>trueverbose>
          <overwrite>trueoverwrite>
        configuration>
      plugin>
    plugins>
  build>
project>

配置spring-serlvet.xml

<context:component-scan base-package="com.xxx.spider"/>
    <mvc:annotation-driven/>
    <mvc:default-servlet-handler/>
    <bean id="viewResolver" class="org.springframework.web.servlet.view.InternalResourceViewResolver">
        <property name="prefix" value="/WEB-INF/view/"/>
        <property name="suffix" value=".jsp" />
    bean>

整个Spring mybatis


<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd">
    <bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
        <property name="location" value="classpath:jdbc.properties"/>
    bean>
    <bean id="dataSource" class="com.alibaba.druid.pool.DruidDataSource" destroy-method="close">
        <property name="driverClassName" value="${jdbc_driverClassName}"/>
        <property name="url" value="${jdbc_url}"/>
        <property name="username" value="${jdbc_username}"/>
        <property name="password" value="${jdbc_password}"/>
        
        <property name="filters" value="stat" />

        <property name="maxActive" value="20" />
        <property name="initialSize" value="1" />
        <property name="minIdle" value="1" />
        
        <property name="maxWait" value="60000" />
        
        <property name="timeBetweenEvictionRunsMillis" value="60000" />
        
        <property name="minEvictableIdleTimeMillis" value="300000" />

        <property name="validationQuery" value="SELECT 'x'" />
        <property name="testWhileIdle" value="true" />
        <property name="testOnBorrow" value="false" />
        <property name="testOnReturn" value="false" />

        
        
        <property name="poolPreparedStatements" value="true" />
        <property name="maxPoolPreparedStatementPerConnectionSize" value="50" />
    bean>
    <bean id="sqlSessionFactory" class="org.mybatis.spring.SqlSessionFactoryBean">
        <property name="dataSource" ref="dataSource"/>
        
        <property name="mapperLocations" value="classpath:mapper/*"/>
    bean>
    
    <bean class="org.mybatis.spring.mapper.MapperScannerConfigurer">
        <property name="basePackage" value="com.xxx.spider.dao"/>
        <property name="sqlSessionFactoryBeanName" value="sqlSessionFactory"/>
    bean>
    
    <bean id="transactionManager"
          class="org.springframework.jdbc.datasource.DataSourceTransactionManager">
        <property name="dataSource" ref="dataSource"/>
    bean>
beans>

配置spring context (不配置好像也可以)

<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns:websocket="http://www.springframework.org/schema/websocket"
       xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
       http://www.springframework.org/schema/websocket
       http://www.springframework.org/schema/websocket/spring-websocket-4.1.xsd">
        <import resource="spring-servlet.xml"/>
        <import resource="spring-mybatis.xml"/>
beans>

web.xml

<web-app version="3.0" xmlns="http://java.sun.com/xml/ns/javaee"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_3_1.xsd">
  <display-name>Archetype Created Web Application</display-name>
  <context-param>
    <param-name>contextConfigLocation</param-name>
    <param-value>classpath:spring-context.xml</param-value>
  </context-param>
  <listener>
    <listener-class>org.springframework.web.context.ContextLoaderListener</listener-class>
  </listener>
  <servlet>
    <servlet-name>springMvc</servlet-name>
    <servlet-class>org.springframework.web.servlet.DispatcherServlet</servlet-class>
    <load-on-startup>1</load-on-startup>
    <init-param>
      <param-name>contextConfigLocation</param-name>
      <param-value>classpath:spring-servlet.xml</param-value>
    </init-param>
  </servlet>
  <servlet-mapping>
    <servlet-name>springMvc</servlet-name>
    <url-pattern>*.do</url-pattern>
  </servlet-mapping>
  <servlet-mapping>
    <servlet-name>springMvc</servlet-name>
    <url-pattern>/</url-pattern>
  </servlet-mapping>
  <welcome-file-list>
    <welcome-file>index.jsp</welcome-file>
  </welcome-file-list>
</web-app>

然后到这里工程就建好了。
因为我们用到webMagic,在maven中添加


    <dependency>
      <groupId>us.codecraftgroupId>
      <version>0.5.3version>
      <artifactId>webmagic-coreartifactId>
    dependency>
    <dependency>
      <groupId>us.codecraftgroupId>
      <version>0.5.3version>
      <artifactId>webmagic-extensionartifactId>
    dependency>

一切OK,剩下的就是分析页面,然后用webMagic解析了。

页面分析

如下图
【爬虫】WebMagic结合Spring mvc爬取数据进行存储_第1张图片

我们想要类别名称跟URL,分析可知是在标签里面,通过webmagic的css选择器和xpath对页面元素进行抽取。

 List titles = page.getHtml().xpath("//div[@class='class1']/p/a/text()").all();
 List urls = page.getHtml().css("div.nav_style1_contentBg").links().regex(".*?c1=.*").all();

这样就得到全部的大类别和对应的URL。怎么使用大家可以查看http://webmagic.io/docs/。

数据保存

借用说明文档上的一段话。

好了,爬虫编写完成,现在我们可能还有一个问题:我如果想把抓取的结果保存下来,要怎么做呢?WebMagic用于保存结果的组件叫做Pipeline。例如我们通过“控制台输出结果”这件事也是通过一个内置的Pipeline完成的,它叫做ConsolePipeline。那么,我现在想要把结果用Json的格式保存下来,怎么做呢?我只需要将Pipeline的实现换成”JsonFilePipeline”就可以了。

public static void main(String[] args) {
Spider.create(new GithubRepoPageProcessor())
//从”https://github.com/code4craft“开始抓
.addUrl(“https://github.com/code4craft“)
.addPipeline(new JsonFilePipeline(“D:\webmagic\”))
//开启5个线程抓取
.thread(5)
//启动爬虫
.run(); } 这样子下载下来的文件就会保存在D盘的webmagic目录中了。

通过定制Pipeline,我们还可以实现保存结果到文件、数据库等一系列功能。这个会在第7章“抽取结果的处理”中介绍。

至此为止,我们已经完成了一个基本爬虫的编写,也具有了一些定制功能。

我们通过自定义pipeline来保存到数据库。

@Repository
public class DataBasePipeline implements Pipeline{
    @Autowired
    private CategoryMapper categoryMapper;
    @Autowired
    private ShopMapper shopMapper;
    @Autowired
    private ItemMapper itemMapper;
    @Override
    public void process(ResultItems resultItems, Task task) {
            //TODO 保存类目到数据库
            //TODO 保存商品到数据库
    }
}

抓取

定义一个main方法,run就行了,坐等爬完。
@Controller
public class Door {
@Autowired
private DataBasePipeline dataBasePipeline;
public static void main(String[] args) {

    ApplicationContext applicationContext = new ClassPathXmlApplicationContext("classpath:spring-context.xml");
    Door door = applicationContext.getBean(Door.class);
    door.goSpider();
}

public void goSpider() {
    Spider.create(new QmiaolingPageProcessor())
            .addUrl("http://www.xxx.com/")
            .addPipeline(new ConsolePipeline())
            .addPipeline(dataBasePipeline)
            .thread(5)
            .run();
}

}

结束语

后续会把项目放到github上面。

你可能感兴趣的:(web)