从头学习爬虫(二十三)重构篇----WebMagic框架分析之pipeline

这系列文章主要分析分析webmagic框架,没有实战内容,如有实战问题可以讨论,也可以提供技术支持。


欢迎加群313557283(刚创建),小白互相学习~


Pipeline

我们先来看看接口,就一个process 方法

package us.codecraft.webmagic.pipeline;

import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;

/**
 * Pipeline is the persistent and offline process part of crawler.
* The interface Pipeline can be implemented to customize ways of persistent. * * @author [email protected]
* @since 0.1.0 * @see ConsolePipeline * @see FilePipeline */ public interface Pipeline { /** * Process extracted results. * * @param resultItems resultItems * @param task task */ public void process(ResultItems resultItems, Task task); }

我们再来看看默认调用实现pipeline的那个类ConsolePipeline

很简单把存储在resultItem 的结果打印出来

package us.codecraft.webmagic.pipeline;

import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;

import java.util.Map;

/**
 * Write results in console.
* Usually used in test. * * @author [email protected]
* @since 0.1.0 */ public class ConsolePipeline implements Pipeline { @Override public void process(ResultItems resultItems, Task task) { System.out.println("get page: " + resultItems.getRequest().getUrl()); for (Map.Entry entry : resultItems.getAll().entrySet()) { System.out.println(entry.getKey() + ":\t" + entry.getValue()); } } }

其他的

FilePipeline 以文件形式保存

package us.codecraft.webmagic.pipeline;

import org.apache.commons.codec.digest.DigestUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.utils.FilePersistentBase;

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.util.Map;

/**
 * Store results in files.
* * @author [email protected]
* @since 0.1.0 */ public class FilePipeline extends FilePersistentBase implements Pipeline { private Logger logger = LoggerFactory.getLogger(getClass()); /** * create a FilePipeline with default path"/data/webmagic/" */ public FilePipeline() { setPath("/data/webmagic/"); } public FilePipeline(String path) { setPath(path); } @Override public void process(ResultItems resultItems, Task task) { String path = this.path + PATH_SEPERATOR + task.getUUID() + PATH_SEPERATOR; try { PrintWriter printWriter = new PrintWriter(new OutputStreamWriter(new FileOutputStream(getFile(path + DigestUtils.md5Hex(resultItems.getRequest().getUrl()) + ".html")),"UTF-8")); printWriter.println("url:\t" + resultItems.getRequest().getUrl()); for (Map.Entry entry : resultItems.getAll().entrySet()) { if (entry.getValue() instanceof Iterable) { Iterable value = (Iterable) entry.getValue(); printWriter.println(entry.getKey() + ":"); for (Object o : value) { printWriter.println(o); } } else { printWriter.println(entry.getKey() + ":\t" + entry.getValue()); } } printWriter.close(); } catch (IOException e) { logger.warn("write file error", e); } } }

结果集

ResultItemsCollectorPipeline 我猜主要是为了批量处理这样效率高

package us.codecraft.webmagic.pipeline;

import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;

import java.util.ArrayList;
import java.util.List;

/**
 * @author [email protected]
 * @since 0.4.0
 */
public class ResultItemsCollectorPipeline implements CollectorPipeline {

    private List collector = new ArrayList();

    @Override
    public synchronized void process(ResultItems resultItems, Task task) {
        collector.add(resultItems);
    }

    @Override
    public List getCollected() {
        return collector;
    }
}


扩展

代码就不贴了

大概介绍下

FilePageModelPipeline 保存成.html

JsonFilePageModelPipeline 保存成.json

JsonFilePipeline 将内容转换成json再保存成.json

MultiPagePipeline 用于需要拼接的地方

官网还有个集成mysql 点击打开链接


总结

上面介绍了很多保存方式,个人习惯于在process就进行数据持久化,不知道有什么不同,欢迎探讨。





你可能感兴趣的:(网络爬虫)