Java爬虫

分享一个爬虫框架elves。

导包


    io.github.biezhi
    elves
    0.0.2



    org.projectlombok
    lombok
    1.18.8
    compile

编写代码

/***
 * @ClassName: MeiziExample
 * @Description:
 * @Auther: lyonardo
 * @Date: 2019/11/11 15:45
 * @version : V1.0
 */
public class MeiziExample {
    @Slf4j
    static class MeiziSpider extends Spider {

        private String storageDir = "/Users/Administrator/Desktop/meizi";

        public MeiziSpider(String name) {
            super(name);
            this.startUrls(
                    "http://www.meizitu.com/a/pure.html",
                    "http://www.meizitu.com/a/cute.html",
                    "http://www.meizitu.com/a/sexy.html",
                    "http://www.meizitu.com/a/fuli.html",
                    "http://www.meizitu.com/a/legs.html");
        }

        @Override
        public void onStart(Config config) {
            this.addPipeline((Pipeline>) (item, request) -> {
                item.forEach(imgUrl -> {
                    log.info("开始下载: {}", imgUrl);
                    io.github.biezhi.request.Request.get(imgUrl)
                            .header("Referer", request.getUrl())
                            .header("User-Agent", UserAgent.CHROME_FOR_MAC)
                            .connectTimeout(20_000)
                            .readTimeout(20_000)
                            .receive(new File(storageDir, System.currentTimeMillis() + ".jpg"));
                });
                log.info("[{}] 图片下载 OJ8K.", request.getUrl());
            });

            this.requests.forEach(this::resetRequest);
        }
        private Request resetRequest(Request request) {
            request.contentType("text/html; charset=gb2312");
            request.charset("gb2312");
            return request;
        }

        @Override
        protected Result parse(Response response) {
            Result result = new Result<>();
            Elements elements = response.body().css("#maincontent > div.inWrap > ul > li:nth-child(1) > div > div > a");
            log.info("elements size: {}", elements.size());
            List requests = elements.stream()
                    .map(element -> element.attr("href"))
                    .map(href -> MeiziSpider.this.makeRequest(href, new MeiziSpider.PictureParser()))
                    .map(this::resetRequest)
                    .collect(Collectors.toList());
            result.addRequests(requests);
            // 获取下一页 URL
            Optional nextEl = response.body().css("#wp_page_numbers > ul > li > a").stream().filter(element -> "下一页".equals(element.text())).findFirst();
            if (nextEl.isPresent()) {
                String nextPageUrl = "http://www.meizitu.com/a/" + nextEl.get().attr("href");
                Request nextReq = MeiziSpider.this.makeRequest(nextPageUrl, this::parse);
                result.addRequest(this.resetRequest(nextReq));
            }
            return result;
        }
        static class PictureParser implements Parser> {
            @Override
            public Result> parse(Response response) {
                Elements elements = response.body().css("#picture > p > img");
                List src = elements.stream().map(element -> element.attr("src")).collect(Collectors.toList());
                return new Result<>(src);
            }
        }
    }

    public static void main(String[] args) {
        MeiziSpider meiziSpider = new MeiziSpider("妹子图");
        Elves.me(meiziSpider, Config.me().delay(3000)).start();
    }
}

执行结果

Connected to the target VM, address: '127.0.0.1:7570', transport: 'socket'
13:01:40.223 [main] INFO io.github.biezhi.elves.ElvesEngine - Spider [妹子图] 启动...
13:01:40.232 [main] INFO io.github.biezhi.elves.ElvesEngine - Spider [妹子图] 配置 [Config(timeout=10000, delay=3000, parallelThreads=12, userAgent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11, queueSize=0)]
13:01:40.279 [task@thread-1] DEBUG io.github.biezhi.elves.download.Downloader - [http://www.meizitu.com/a/pure.html] 开始请求
13:01:43.278 [task@thread-2] DEBUG io.github.biezhi.elves.download.Downloader - [http://www.meizitu.com/a/cute.html] 开始请求
13:01:46.278 [task@thread-3] DEBUG io.github.biezhi.elves.download.Downloader - [http://www.meizitu.com/a/sexy.html] 开始请求
13:01:49.279 [task@thread-4] DEBUG io.github.biezhi.elves.download.Downloader - [http://www.meizitu.com/a/fuli.html] 开始请求
13:01:52.280 [task@thread-5] DEBUG io.github.biezhi.elves.download.Downloader - [http://www.meizitu.com/a/legs.html] 开始请求

你可能感兴趣的:(Java与大数据,java,爬虫,开发语言)