Android爬虫(一)使用OkHttp+Jsoup实现网络爬虫

这几天写了一个Android上面简单的爬虫Demo
数据爬取完后通过RecyclerView展示,这篇文章先写爬取数据部分

这里我爬虫测试网站是:什么值得买

想要爬取的数据是首页的一些精选文章,主要爬取文章标题、图片、简介

Android爬虫(一)使用OkHttp+Jsoup实现网络爬虫_第1张图片

这个是我爬到的数据
Android爬虫(一)使用OkHttp+Jsoup实现网络爬虫_第2张图片
这里需要引用到Jsoup和OkHttp的jar包,我是下载下来,添加到项目工程当中
在这里插入图片描述
也可以直接在gradle文件当中配置

implementation 'org.jsoup:jsoup:1.11.3'
implementation 'com.squareup.okhttp3:okhttp:3.4.1'

然后就可以开始写代码爬虫啦

实体类Article.java

/*
 *@Author:Swallow
 *@Date:2019/3/21
 * 抓取到的文章数据封装
 */
public class Article {
    private String title;
    private String author;
    private String imgUrl;
    private String context;
    private String articleUrl;
    private String date;
    private String from;

//有几个属性还没用到,所以构造方法先用上这四个有爬取到数据的
    public Article(String title, String author, String imgUrl, String context) {
        this.title = title;
        this.author = author;
        this.imgUrl = imgUrl;
        this.context = context;
    }


    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getAuthor() {
        return author;
    }

    public void setAuthor(String author) {
        this.author = author;
    }

    public String getImgUrl() {
        return imgUrl;
    }

    public void setImgUrl(String imgUrl) {
        this.imgUrl = imgUrl;
    }

    public String getContext() {
        return context;
    }

    public void setContext(String context) {
        this.context = context;
    }

    public String getArticleUrl() {
        return articleUrl;
    }

    public void setArticleUrl(String articleUrl) {
        this.articleUrl = articleUrl;
    }

    public String getDate() {
        return date;
    }

    public void setDate(String date) {
        this.date = date;
    }

    public String getFrom() {
        return from;
    }

    public void setFrom(String from) {
        this.from = from;
    }

    @Override
    public String toString() {
        return "Article{" +
                "title='" + title + '\'' +
                ", author='" + author + '\'' +
                ", imgUrl='" + imgUrl + '\'' +
                ", context='" + context + '\'' +
                ", articleUrl='" + articleUrl + '\'' +
                ", date='" + date + '\'' +
                ", from='" + from + '\'' +
                '}';
    }
}

OkHttp请求网络


/*
 *@Author:Swallow
 *@Date:2019/3/7
 */
public class OkHttpUtils {
    public static String OkGetArt(String url) {
        String html = null;
        OkHttpClient client = new OkHttpClient();
        Request request = new Request.Builder()
                .url(url)
                .build();
        try (Response response = client.newCall(request).execute()) {
            //return
            html = response.body().string();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return html;
    }
   }

抓取数据的类
这里用到Jsoup,它主要是解析获取到的网页资源的HTML标签来抓取里面的数据
这里我们可以到原本的网址上去查看网页源码,就可以看到网页的结构,还有要获取的数据所对应的标签

Android爬虫(一)使用OkHttp+Jsoup实现网络爬虫_第3张图片

/*
 *@Author:Swallow
 *@Date:2019/3/21
 */
public class GetData {
/**
     * 抓取什么值得买首页的精选文章
     * @param html
     * @return  ArrayList
articles */ public static ArrayList
spiderArticle(String html){ ArrayList
articles = new ArrayList<>(); Document document = Jsoup.parse(html); Elements elements = document .select("ul[class=feed-list-hits]") .select("li[class=feed-row-wide J_feed_za ]"); for (Element element : elements) { String title = element .select("h5[class=feed-block-title]") .text(); String author = element .select("div[class=feed-block-info]") .select("span") .text(); String imgurl = element .select("div[class=z-feed-img]") .select("a") .select("img") .attr("src"); String context = element .select("div[class=feed-block-descripe]") .text(); String url = element .select("div[class=feed-block z-hor-feed ]") .select("a") .attr("href"); Article article = new Article(title,author,imgurl,context); articles.add(article); Log.e("DATA>>",article.toString()); } return articles; } }

后面直接调用方法就可以
这里要注意一点就是,Android上面发送网络请求要放到子线程当中,所以调用的时候需要开启一个新的子线程

final String url = "https://www.smzdm.com/";
        new Thread(){
            public void run(){
                String html = OkHttpUtils.OkGetArt(url);
                ArrayList
articles = GetData.spiderArticle(html); //发送信息给handler用于更新UI界面 Message message = handler.obtainMessage(); message.what = 1; message.obj = articles; handler.sendMessage(message); } }.start();

你可能感兴趣的:(android)