工作4年多了,也没写过什么博客,去年回老家入职一家国企,工作稍微轻松些,没有在深圳的时候那么忙。最近感觉精力充沛(轻松的工作还是蛮养人的),想把自己研究或者使用到的相关技术做一个记录。第一、对这些知识做一个总结,因为现在发现脑袋不好使了,体会到了好记性不如烂笔头。
废话不多说,那就从最近用的爬虫说起吧。另外自己对爬虫也没有什么研究,纯粹处于会使用的地步。
最近由于工作需要,接触到了爬虫这一块。抓取完整数据分如下二步。
第一步、选择爬虫框架。我们老总说直接用jsoup抓取就行了,这些网站都好抓。那就用吧,把jar下下来,试用了一下,API挺简单,方便,感觉挺好的。总觉得这些网站是好抓,jsoup能够满足,但是有木有更好的、更方便的框架呢,答案是肯定的。那就到网上的查,果然webMagic能够满足我这个需求,主要是文档是中文的呀。那就是它了。
第二步、页面元素分析,那就得看你需要那些数据,来分析页面了,下面我们细说。
因为我使用的是springmvc + mybatis ,所以Maven如下:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0modelVersion>
<groupId>SpidergroupId>
<artifactId>SpiderartifactId>
<packaging>warpackaging>
<version>1.0-SNAPSHOTversion>
<name>Spider Maven Webappname>
<url>http://maven.apache.orgurl>
<properties>
<spring.version>4.1.1.RELEASEspring.version>
<jstl>1.2jstl>
<mybatis.version>3.3.0mybatis.version>
properties>
<dependencies>
<dependency>
<groupId>junitgroupId>
<artifactId>junitartifactId>
<version>3.8.1version>
<scope>testscope>
dependency>
<dependency>
<groupId>log4jgroupId>
<artifactId>log4jartifactId>
<version>1.2.17version>
dependency>
<dependency>
<groupId>org.jsoupgroupId>
<artifactId>jsoupartifactId>
<version>1.9.2version>
dependency>
<dependency>
<groupId>net.sf.json-libgroupId>
<artifactId>json-libartifactId>
<version>2.4version>
<classifier>jdk15classifier>
dependency>
<dependency>
<groupId>commons-collectionsgroupId>
<artifactId>commons-collectionsartifactId>
<version>3.2.1version>
dependency>
<dependency>
<groupId>org.apache.httpcomponentsgroupId>
<artifactId>httpclientartifactId>
<version>4.3.3version>
dependency>
<dependency>
<groupId>junitgroupId>
<artifactId>junitartifactId>
<version>3.8.1version>
<scope>testscope>
dependency>
<dependency>
<groupId>org.springframeworkgroupId>
<artifactId>spring-coreartifactId>
<version>${spring.version}version>
dependency>
<dependency>
<groupId>org.springframeworkgroupId>
<artifactId>spring-webartifactId>
<version>${spring.version}version>
dependency>
<dependency>
<groupId>org.springframeworkgroupId>
<artifactId>spring-context-supportartifactId>
<version>${spring.version}version>
dependency>
<dependency>
<groupId>org.springframeworkgroupId>
<artifactId>spring-webmvcartifactId>
<version>${spring.version}version>
dependency>
<dependency>
<groupId>org.springframeworkgroupId>
<artifactId>spring-testartifactId>
<version>${spring.version}version>
<scope>testscope>
dependency>
<dependency>
<groupId>org.springframeworkgroupId>
<artifactId>spring-jdbcartifactId>
<version>${spring.version}version>
dependency>
<dependency>
<groupId>org.springframeworkgroupId>
<artifactId>spring-testartifactId>
<version>4.1.1.RELEASEversion>
dependency>
<dependency>
<groupId>javax.servletgroupId>
<artifactId>servlet-apiartifactId>
<version>2.5version>
dependency>
<dependency>
<groupId>javax.servletgroupId>
<artifactId>jstlartifactId>
<version>1.2version>
dependency>
<dependency>
<groupId>javax.servlet.jspgroupId>
<artifactId>jsp-apiartifactId>
<version>2.1version>
<scope>providedscope>
dependency>
<dependency>
<groupId>jstlgroupId>
<artifactId>jstlartifactId>
<version>${jstl}version>
dependency>
<dependency>
<groupId>org.mybatisgroupId>
<artifactId>mybatisartifactId>
<version>${mybatis.version}version>
dependency>
<dependency>
<groupId>org.mybatisgroupId>
<artifactId>mybatis-springartifactId>
<version>1.2.3version>
dependency>
<dependency>
<groupId>org.mybatis.generatorgroupId>
<artifactId>mybatis-generator-coreartifactId>
<version>1.3.2version>
dependency>
<dependency>
<groupId>mysqlgroupId>
<artifactId>mysql-connector-javaartifactId>
<version>5.1.32version>
dependency>
<dependency>
<groupId>org.apache.commonsgroupId>
<artifactId>commons-collections4artifactId>
<version>4.0version>
dependency>
<dependency>
<groupId>commons-dbcpgroupId>
<artifactId>commons-dbcpartifactId>
<version>1.4version>
dependency>
<dependency>
<groupId>commons-poolgroupId>
<artifactId>commons-poolartifactId>
<version>1.6version>
dependency>
<dependency>
<groupId>com.alibabagroupId>
<artifactId>druidartifactId>
<version>1.0.15version>
dependency>
<dependency>
<groupId>com.fasterxml.jackson.coregroupId>
<artifactId>jackson-coreartifactId>
<version>2.3.0version>
dependency>
<dependency>
<groupId>com.fasterxml.jackson.coregroupId>
<artifactId>jackson-databindartifactId>
<version>2.3.0version>
dependency>
<dependency>
<groupId>com.google.code.gsongroupId>
<artifactId>gsonartifactId>
<version>2.7version>
dependency>
dependencies>
<build>
<finalName>GogBuySpiderfinalName>
<plugins>
<plugin>
<artifactId>maven-compiler-pluginartifactId>
<configuration>
<source>1.6source>
<target>1.6target>
configuration>
plugin>
<plugin>
<artifactId>maven-surefire-pluginartifactId>
<configuration>
<includes>
<include>**/*Tests.javainclude>
includes>
configuration>
plugin>
<plugin>
<groupId>org.mybatis.generatorgroupId>
<artifactId>mybatis-generator-maven-pluginartifactId>
<version>1.3.2version>
<configuration>
<verbose>trueverbose>
<overwrite>trueoverwrite>
configuration>
plugin>
plugins>
build>
project>
配置spring-serlvet.xml
<context:component-scan base-package="com.xxx.spider"/>
<mvc:annotation-driven/>
<mvc:default-servlet-handler/>
<bean id="viewResolver" class="org.springframework.web.servlet.view.InternalResourceViewResolver">
<property name="prefix" value="/WEB-INF/view/"/>
<property name="suffix" value=".jsp" />
bean>
整个Spring mybatis
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd">
<bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
<property name="location" value="classpath:jdbc.properties"/>
bean>
<bean id="dataSource" class="com.alibaba.druid.pool.DruidDataSource" destroy-method="close">
<property name="driverClassName" value="${jdbc_driverClassName}"/>
<property name="url" value="${jdbc_url}"/>
<property name="username" value="${jdbc_username}"/>
<property name="password" value="${jdbc_password}"/>
<property name="filters" value="stat" />
<property name="maxActive" value="20" />
<property name="initialSize" value="1" />
<property name="minIdle" value="1" />
<property name="maxWait" value="60000" />
<property name="timeBetweenEvictionRunsMillis" value="60000" />
<property name="minEvictableIdleTimeMillis" value="300000" />
<property name="validationQuery" value="SELECT 'x'" />
<property name="testWhileIdle" value="true" />
<property name="testOnBorrow" value="false" />
<property name="testOnReturn" value="false" />
<property name="poolPreparedStatements" value="true" />
<property name="maxPoolPreparedStatementPerConnectionSize" value="50" />
bean>
<bean id="sqlSessionFactory" class="org.mybatis.spring.SqlSessionFactoryBean">
<property name="dataSource" ref="dataSource"/>
<property name="mapperLocations" value="classpath:mapper/*"/>
bean>
<bean class="org.mybatis.spring.mapper.MapperScannerConfigurer">
<property name="basePackage" value="com.xxx.spider.dao"/>
<property name="sqlSessionFactoryBeanName" value="sqlSessionFactory"/>
bean>
<bean id="transactionManager"
class="org.springframework.jdbc.datasource.DataSourceTransactionManager">
<property name="dataSource" ref="dataSource"/>
bean>
beans>
配置spring context (不配置好像也可以)
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:websocket="http://www.springframework.org/schema/websocket"
xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/websocket
http://www.springframework.org/schema/websocket/spring-websocket-4.1.xsd">
<import resource="spring-servlet.xml"/>
<import resource="spring-mybatis.xml"/>
beans>
web.xml
<web-app version="3.0" xmlns="http://java.sun.com/xml/ns/javaee"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_3_1.xsd">
<display-name>Archetype Created Web Application</display-name>
<context-param>
<param-name>contextConfigLocation</param-name>
<param-value>classpath:spring-context.xml</param-value>
</context-param>
<listener>
<listener-class>org.springframework.web.context.ContextLoaderListener</listener-class>
</listener>
<servlet>
<servlet-name>springMvc</servlet-name>
<servlet-class>org.springframework.web.servlet.DispatcherServlet</servlet-class>
<load-on-startup>1</load-on-startup>
<init-param>
<param-name>contextConfigLocation</param-name>
<param-value>classpath:spring-servlet.xml</param-value>
</init-param>
</servlet>
<servlet-mapping>
<servlet-name>springMvc</servlet-name>
<url-pattern>*.do</url-pattern>
</servlet-mapping>
<servlet-mapping>
<servlet-name>springMvc</servlet-name>
<url-pattern>/</url-pattern>
</servlet-mapping>
<welcome-file-list>
<welcome-file>index.jsp</welcome-file>
</welcome-file-list>
</web-app>
然后到这里工程就建好了。
因为我们用到webMagic,在maven中添加
<dependency>
<groupId>us.codecraftgroupId>
<version>0.5.3version>
<artifactId>webmagic-coreartifactId>
dependency>
<dependency>
<groupId>us.codecraftgroupId>
<version>0.5.3version>
<artifactId>webmagic-extensionartifactId>
dependency>
一切OK,剩下的就是分析页面,然后用webMagic解析了。
我们想要类别名称跟URL,分析可知是在标签里面,通过webmagic的css选择器和xpath对页面元素进行抽取。
List titles = page.getHtml().xpath("//div[@class='class1']/p/a/text()").all();
List urls = page.getHtml().css("div.nav_style1_contentBg").links().regex(".*?c1=.*").all();
这样就得到全部的大类别和对应的URL。怎么使用大家可以查看http://webmagic.io/docs/。
借用说明文档上的一段话。
好了,爬虫编写完成,现在我们可能还有一个问题:我如果想把抓取的结果保存下来,要怎么做呢?WebMagic用于保存结果的组件叫做Pipeline。例如我们通过“控制台输出结果”这件事也是通过一个内置的Pipeline完成的,它叫做ConsolePipeline。那么,我现在想要把结果用Json的格式保存下来,怎么做呢?我只需要将Pipeline的实现换成”JsonFilePipeline”就可以了。
public static void main(String[] args) {
Spider.create(new GithubRepoPageProcessor())
//从”https://github.com/code4craft“开始抓
.addUrl(“https://github.com/code4craft“)
.addPipeline(new JsonFilePipeline(“D:\webmagic\”))
//开启5个线程抓取
.thread(5)
//启动爬虫
.run(); } 这样子下载下来的文件就会保存在D盘的webmagic目录中了。通过定制Pipeline,我们还可以实现保存结果到文件、数据库等一系列功能。这个会在第7章“抽取结果的处理”中介绍。
至此为止,我们已经完成了一个基本爬虫的编写,也具有了一些定制功能。
我们通过自定义pipeline来保存到数据库。
@Repository
public class DataBasePipeline implements Pipeline{
@Autowired
private CategoryMapper categoryMapper;
@Autowired
private ShopMapper shopMapper;
@Autowired
private ItemMapper itemMapper;
@Override
public void process(ResultItems resultItems, Task task) {
//TODO 保存类目到数据库
//TODO 保存商品到数据库
}
}
定义一个main方法,run就行了,坐等爬完。
@Controller
public class Door {
@Autowired
private DataBasePipeline dataBasePipeline;
public static void main(String[] args) {
ApplicationContext applicationContext = new ClassPathXmlApplicationContext("classpath:spring-context.xml");
Door door = applicationContext.getBean(Door.class);
door.goSpider();
}
public void goSpider() {
Spider.create(new QmiaolingPageProcessor())
.addUrl("http://www.xxx.com/")
.addPipeline(new ConsolePipeline())
.addPipeline(dataBasePipeline)
.thread(5)
.run();
}
}
后续会把项目放到github上面。