Android利用jsoup爬虫爬网页数据(一)

效果图太大了,我放到github上了,想看效果的点击以下链接:
效果图一
效果图二
首先这个jsoup只能对html爬取数据,js里面的东西爬不到,暂时先只爬html的数据,这里先说明一下,博主仅仅出于学习的目的,不用做商业,也不是恶意窃取数据,现在的版权问题懂得好怕怕。
他们家的数据
第一件事就是引入依赖

 compile 'org.jsoup:jsoup:1.10.1'

然后比较恶心的上一下html的源代码,这里我格式化了,我把代码贴上,恶心一下(看了下效果,代码片太长了,截取一下得了),我这里用的sublime编辑的

<div class="wrap">
      <div class="w clear">
        <div class="space_left">
          <div class="ui_newlist_1 get_num" id="J_list">
            <ul>
              <li data-id="285511">
                <div class="pic">
                  <a target="_blank" href="http://home.meishichina.com/recipe-285511.html" title="孔雀开屏鱼">
                    <img width="180" height="180" src="http://static.meishichina.com/v6/img/blank.gif" data-src="http://i3.meishichina.com/attachment/recipe/2016/12/17/20161217148196906319013.jpg@!c320" class="imgLoad">a>
                div>
                <div class="detail">
                  <h2>
                    <a target="_blank" href="http://home.meishichina.com/recipe-285511.html">孔雀开屏鱼a>h2>
                  <p class="subline">
                    <a target="_blank" href="http://home.meishichina.com/space-9541848.html">零下一度0511a>p>
                  <p class="subcontent">原料:武昌鱼、姜、豆豉、葱、青红椒、盐、胡椒粉、蒸鱼豉油、花生油、料酒。p>
                  <div class="substatus clear">
                    <div class="left">div>
                  div>
                div>
              li>
              <li data-id="304148">
                <div class="pic">
                  <a target="_blank" href="http://home.meishichina.com/recipe-304148.html" title="花开富贵">
                    <img width="180" height="180" src="http://static.meishichina.com/v6/img/blank.gif" data-src="http://i3.meishichina.com/attachment/recipe/2016/12/12/20161212148152950212413.jpg@!c320" class="imgLoad">a>
                div>
                <div class="detail">
                  <h2>
                    <a target="_blank" href="http://home.meishichina.com/recipe-304148.html">花开富贵a>h2>
                  <p class="subline">
                    <a target="_blank" href="http://home.meishichina.com/space-9014474.html">小厨妞1688a>p>
                  <p class="subcontent">原料:大白菜菜叶、辣椒、猪肉、盐。p>
                  <div class="substatus clear">
                    <div class="left">div>
                  div>
                div>
              li>
              <li data-id="304224">
                <div class="pic">
                  <a target="_blank" href="http://home.meishichina.com/recipe-304224.html" title="千层葱花饼">
                    <img width="180" height="180" src="http://static.meishichina.com/v6/img/blank.gif" data-src="http://i3.meishichina.com/attachment/recipe/2016/12/12/20161212148155127748313.jpg@!c320" class="imgLoad">a>
                div>
                <div class="detail">
                  <h2>
                    <a target="_blank" href="http://home.meishichina.com/recipe-304224.html">千层葱花饼a>h2>
                  <p class="subline">
                    <a target="_blank" href="http://home.meishichina.com/space-2261565.html">香儿厨房a>p>
                  <p class="subcontent">原料:馄饨皮、葱花、盐、蛋液、油。p>
                  <div class="substatus clear">
                    <div class="left">div>
                  div>
                div>
              li>
              <li data-id="304301">
                <div class="pic">
                  <a target="_blank" href="http://home.meishichina.com/recipe-304301.html" title="猪蹄冻----高逼格新年宴客菜">
                    <img width="180" height="180" src="http://static.meishichina.com/v6/img/blank.gif" data-src="http://i3.meishichina.com/attachment/recipe/2016/12/13/20161213148161404254213.jpg@!c320" class="imgLoad">a>
                div>
                <div class="detail">
                  <h2>
                    <a target="_blank" href="http://home.meishichina.com/recipe-304301.html">猪蹄冻----高逼格新年宴客菜a>h2>
                  <p class="subline">
                    <a target="_blank" href="http://home.meishichina.com/space-7482619.html">允儿小妞的厨房a>p>
                  <p class="subcontent">原料:猪蹄儿、料酒、生姜、大蒜、桂皮、香叶、八角、丁香、大葱、小米辣。p>
                  <div class="substatus clear">
                    <div class="left">div>
                  div>
                div>
              li>
              <li data-id="302700">
                <div class="pic">
                  <a target="_blank" href="http://home.meishichina.com/recipe-302700.html" title="辣酱粉丝">
                    <img width="180" height="180" src="http://static.meishichina.com/v6/img/blank.gif" data-src="http://i3.meishichina.com/attachment/recipe/2016/12/13/20161213148159716214013.jpg@!c320" class="imgLoad">a>
                div>
                <div class="detail">
                  <h2>
                    <a target="_blank" href="http://home.meishichina.com/recipe-302700.html">辣酱粉丝a>h2>
                  <p class="subline">
                    <a target="_blank" href="http://home.meishichina.com/space-9764821.html">天国的女儿a>p>
                  <p class="subcontent">原料:五花肉、海米、粉丝、郫县豆瓣酱、蚝油、蒜蓉辣酱、蒜、葱、白糖、香菜末p>
                  <div class="substatus clear">
                    <div class="left">div>
                  div>
                div>
              li>
              <li data-id="304481">
                <div class="pic">
                  <a target="_blank" href="http://home.meishichina.com/recipe-304481.html" title="【桃李厨艺】正宗祖传灌汤小笼包的做法,鲜美多汁!吃货赶紧来试试!">
                    <img width="180" height="180" src="http://static.meishichina.com/v6/img/blank.gif" data-src="http://i3.meishichina.com/attachment/recipe/2016/12/14/20161214148170810381213.jpg@!c320" class="imgLoad">a>
                div>
                <div class="detail">
                  <h2>
                    <a target="_blank" href="http://home.meishichina.com/recipe-304481.html">【桃李厨艺】正宗祖传灌汤小笼包的做法,鲜美多汁!吃货赶紧来试试!a>h2>
                  <p class="subline">
                    <a target="_blank" href="http://home.meishichina.com/space-8893689.html">桃李烹饪a>p>
                  <p class="subcontent">原料:猪肉、盐、细砂糖、色拉油、料酒、鸡精、老抽、葱姜水。p>
                  <div class="substatus clear">
                    <div class="left">div>
                  div>
                div>
              li>
              <li data-id="304135">
                <div class="pic">
                  <a target="_blank" href="http://home.meishichina.com/recipe-304135.html" title="千层肉饼">
                    <img width="180" height="180" src="http://static.meishichina.com/v6/img/blank.gif" data-src="http://i3.meishichina.com/attachment/recipe/2016/12/12/20161212148152421589213.jpg@!c320" class="imgLoad">a>
                div>
                <div class="detail">
                  <h2>
                    <a target="_blank" href="http://home.meishichina.com/recipe-304135.html">千层肉饼a>h2>
                  <p class="subline">
                    <a target="_blank" href="http://home.meishichina.com/space-9118742.html">满宝妈妈a>p>
                  <p class="subcontent">原料:猪肉馅、清水、食用油、酱油、花椒油、盐、面粉、葱花、料酒、蚝油、香油、鸡精。p>
                  <div class="substatus clear">
                    <div class="left">div>
                  div>
                div>
              li>
              <li data-id="304363">
                <div class="pic">
                  <a target="_blank" href="http://home.meishichina.com/recipe-304363.html" title="秘制红烧肉">
                    <img width="180" height="180" src="http://static.meishichina.com/v6/img/blank.gif" data-src="http://i3.meishichina.com/attachment/recipe/2016/12/13/20161213148163593838413.jpg@!c320" class="imgLoad">a>
                div>
                <div class="detail">
                  <h2>
                    <a target="_blank" href="http://home.meishichina.com/recipe-304363.html">秘制红烧肉a>h2>
                  <p class="subline">
                    <a target="_blank" href="http://home.meishichina.com/space-7194731.html">多幸福多快乐a>p>
                  <p class="subcontent">原料:猪五花肉、大蒜、葱段、姜片、八角、、盐、食用油、清水、冰糖、黄酒、生抽。p>
                  <div class="substatus clear">
                    <div class="left">div>
                  div>
                div>
              li>
              <li data-id="304458">
                <div class="pic">
                  <a target="_blank" href="http://home.meishichina.com/recipe-304458.html" title="麻婆豆腐">
                    <img width="180" height="180" src="http://static.meishichina.com/v6/img/blank.gif" data-src="http://i3.meishichina.com/attachment/recipe/2016/12/14/20161214148170177631113.jpg@!c320" class="imgLoad">a>
                div>
                <div class="detail">
                  <h2>
                    <a target="_blank" href="http://home.meishichina.com/recipe-304458.html">麻婆豆腐a>h2>
                  <p class="subline">
                    <a target="_blank" href="http://home.meishichina.com/space-6591561.html">梦~桃缘a>p>
                  <p class="subcontent">原料:肉末、豆腐、郫县豆瓣酱、花椒粉、淀粉、葱、水、辣椒粉、酱油p>
                  <div class="substatus clear">
                    <div class="left">div>
                  div>
                div>
              li>
              <li data-id="304129">
                <div class="pic">
                  <a target="_blank" href="http://home.meishichina.com/recipe-304129.html" title="香辣土豆片">
                    <img width="180" height="180" src="http://static.meishichina.com/v6/img/blank.gif" data-src="http://i3.meishichina.com/attachment/recipe/2016/12/12/20161212148153522532813.jpg@!c320" class="imgLoad">a>
                div>
                <div class="detail">
                  <h2>
                    <a target="_blank" href="http://home.meishichina.com/recipe-304129.html">香辣土豆片a>h2>
                  <p class="subline">
                    <a target="_blank" href="http://home.meishichina.com/space-1478694.html">斯佳丽WHa>p>
                  <p class="subcontent">原料:土豆、青蒜、辣椒粉、干辣椒、蒜、油盐。p>
                  <div class="substatus clear">
                    <div class="left">div>
                  div>
                div>
              li>
            ul>
          div>

抓的是这个列表的数据,再来看一下jsoup比较实用的代码


            Document document = Jsoup.connect("http://home.meishichina.com/show-top-type-recipe-page-" + page + ".html").get();

            Log.d("jsoup:", "http://home.meishichina.com/show-top-type-recipe-page-" + page + ".html");

            Elements elements = document.select("div.top-bar");
//            Log.d("jsoup:", elements.select("a").attr("title"));

            Elements titleAndPic = document.select("div.pic");
//            Log.d("jsoup", "数量:" + titleAndPic.size());
//            Log.d("jsoup", "title:" + titleAndPic.get(1).select("a").attr("title") + "pic:" + titleAndPic.get(1).select("a").select("img").attr("data-src"));
            Elements url = document.select("div.detail").select("h2").select("a");
//            Log.d("jsoup", "url:" + url.get(1).attr("href"));
            Elements burden = document.select("p.subcontent");
//            Log.d("jsoup", "burden:" + burden.get(1).text());

select是选取一个节点,attr是取数据,简单可以这么理解,爬虫其实并不是一味地知道网站就直接去爬,还要了解里面的数据结构以后才可以,当然这是相对于jsoup来说的,其他的博主没用过也不细说
看一下android的代码:
fragement

package com.fanyafeng.recreation.fragment;

import android.content.Context;
import android.net.Uri;
import android.os.AsyncTask;
import android.os.Bundle;
import android.os.Handler;
import android.os.Message;
import android.support.annotation.Nullable;
import android.support.v4.app.Fragment;
import android.support.v7.widget.GridLayoutManager;
import android.support.v7.widget.RecyclerView;
import android.support.v7.widget.StaggeredGridLayoutManager;
import android.support.v7.widget.Toolbar;
import android.util.Log;
import android.view.LayoutInflater;
import android.view.View;
import android.view.ViewGroup;
import android.widget.Toast;

import com.fanyafeng.recreation.R;
import com.fanyafeng.recreation.adapter.MenuAdapter;
import com.fanyafeng.recreation.bean.MainItemBean;
import com.fanyafeng.recreation.bean.MenuBean;
import com.fanyafeng.recreation.network.NetUtil;
import com.fanyafeng.recreation.network.Urls;
import com.fanyafeng.recreation.refreshview.XRefreshView;
import com.fanyafeng.recreation.refreshview.XRefreshViewFooter;
import com.fanyafeng.recreation.util.StringUtil;

import org.json.JSONArray;
import org.json.JSONException;
import org.json.JSONObject;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

public class TwoFragment extends BaseFragment {
    private static final String ARG_PARAM1 = "param1";
    private static final String ARG_PARAM2 = "param2";

    private static final int XREFRESH_GET_DATA = 0;
    private static final int XREFRESH_FRESH = 1;
    private static final int XREFRESH_LOAD_MORE = 2;

    private String mParam1;
    private String mParam2;

    private Toolbar toolbar_two;

    private XRefreshView refreshTwo;
    private RecyclerView rvTwo;

    private List menuBeanList = new ArrayList<>();

    private MenuAdapter menuAdapter;

    private int page = 1;


    public TwoFragment() {
        // Required empty public constructor
    }

    public static TwoFragment newInstance(String param1, String param2) {
        TwoFragment fragment = new TwoFragment();
        Bundle args = new Bundle();
        args.putString(ARG_PARAM1, param1);
        args.putString(ARG_PARAM2, param2);
        fragment.setArguments(args);
        return fragment;
    }

    @Override
    public void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        if (getArguments() != null) {
            mParam1 = getArguments().getString(ARG_PARAM1);
            mParam2 = getArguments().getString(ARG_PARAM2);
        }
    }

    @Override
    public View onCreateView(LayoutInflater inflater, ViewGroup container, Bundle savedInstanceState) {
        // Inflate the layout for this fragment
        return inflater.inflate(R.layout.fragment_two, container, false);
    }

    @Override
    public void onActivityCreated(@Nullable Bundle savedInstanceState) {
        super.onActivityCreated(savedInstanceState);
        initView();
        Thread thread = new Thread(new LoadThread(XREFRESH_GET_DATA));
        thread.start();
    }

    private void initView() {
        toolbar_two = (Toolbar) getActivity().findViewById(R.id.toolbar_two);
        toolbar_two.setLogo(R.drawable.simle_logo_02);
        toolbar_two.setTitle("美食");

        refreshTwo = (XRefreshView) getActivity().findViewById(R.id.refreshTwo);
        refreshTwo.setAutoLoadMore(true);
        refreshTwo.setPullLoadEnable(true);

        rvTwo = (RecyclerView) getActivity().findViewById(R.id.rvTwo);
        rvTwo.setLayoutManager(new StaggeredGridLayoutManager(2, StaggeredGridLayoutManager.VERTICAL));
        menuAdapter = new MenuAdapter(getActivity(), menuBeanList);
        menuAdapter.setCustomLoadMoreView(new XRefreshViewFooter(getActivity()));
        rvTwo.setAdapter(menuAdapter);

        refreshTwo.setXRefreshViewListener(new XRefreshView.SimpleXRefreshListener() {
            @Override
            public void onRefresh() {
                super.onRefresh();
//                getData(1, XREFRESH_FRESH);
                new Handler().postDelayed(new Runnable() {
                    @Override
                    public void run() {
                        Thread thread = new Thread(new LoadThread(XREFRESH_FRESH));
                        thread.start();
                    }
                }, 1000);
            }

            @Override
            public void onLoadMore(boolean isSilence) {
                super.onLoadMore(isSilence);
                new Handler().postDelayed(new Runnable() {
                    @Override
                    public void run() {
                        Thread thread = new Thread(new LoadThread(XREFRESH_LOAD_MORE));
                        thread.start();
                    }
                }, 1000);
            }
        });
    }

    private void getData(int page, int refreshState) {

        try {
//            Document document = Jsoup.connect("http://home.meishichina.com/show-top-type-recipe.html").get();
//            http://home.meishichina.com/show-top-type-recipe-page-2.html


            Document document = Jsoup.connect("http://home.meishichina.com/show-top-type-recipe-page-" + page + ".html").get();

            Log.d("jsoup:", "http://home.meishichina.com/show-top-type-recipe-page-" + page + ".html");

            Elements elements = document.select("div.top-bar");
//            Log.d("jsoup:", elements.select("a").attr("title"));

            Elements titleAndPic = document.select("div.pic");
//            Log.d("jsoup", "数量:" + titleAndPic.size());
//            Log.d("jsoup", "title:" + titleAndPic.get(1).select("a").attr("title") + "pic:" + titleAndPic.get(1).select("a").select("img").attr("data-src"));
            Elements url = document.select("div.detail").select("h2").select("a");
//            Log.d("jsoup", "url:" + url.get(1).attr("href"));
            Elements burden = document.select("p.subcontent");
//            Log.d("jsoup", "burden:" + burden.get(1).text());


            for (int i = 0; i < titleAndPic.size(); i++) {
//                Log.d("jsoup", "title:" + titleAndPic.get(i).select("a").attr("title") + "pic:" + titleAndPic.get(i).select("a").select("img").attr("data-src"));
//                Log.d("jsoup", "url:" + url.get(i).attr("href"));
//                Log.d("jsoup", "burden:" + burden.get(i).text());
                int imgLength = titleAndPic.get(i).select("a").select("img").attr("data-src").length();
                String img = titleAndPic.get(i).select("a").select("img").attr("data-src");
//                Log.d("jsoup", img.substring(0, imgLength - 3) + "640");
                String title = titleAndPic.get(i).select("a").attr("title");
                String pic = img.substring(0, imgLength - 3) + "640";
                String itemUrl = url.get(i).attr("href");
                String itemBurden = burden.get(i).text();
                MenuBean menuBean = new MenuBean();
                menuBean.setTitle(title);
                menuBean.setPic(pic);
                menuBean.setUrl(itemUrl);
                menuBean.setBurden(itemBurden);
                menuBeanList.add(menuBean);
                Message message = Message.obtain();
                message.what = refreshState;
                handler.sendMessage(message);
            }

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    class LoadThread implements Runnable {

        int refreshState;

        public LoadThread(int refreshState) {
            this.refreshState = refreshState;
        }

        @Override
        public void run() {
            if (refreshState == XREFRESH_GET_DATA) {
                menuBeanList.clear();
                page = 1;
            } else if (refreshState == XREFRESH_FRESH) {
                menuBeanList.clear();
                page = 1;
            } else if (refreshState == XREFRESH_LOAD_MORE) {
                page++;
            }
            getData(page, refreshState);
        }
    }

    Handler handler = new Handler() {
        @Override
        public void handleMessage(Message msg) {
            super.handleMessage(msg);
            switch (msg.what) {
                case XREFRESH_FRESH:
                    refreshTwo.stopRefresh();
                    break;
                case XREFRESH_GET_DATA:
                    break;
                case XREFRESH_LOAD_MORE:
                    refreshTwo.stopLoadMore();
                    break;
            }
            menuAdapter.notifyDataSetChanged();
        }
    };
}

xml:

"http://schemas.android.com/apk/res/android"
    xmlns:tools="http://schemas.android.com/tools"
    android:layout_width="match_parent"
    android:layout_height="match_parent"
    android:background="@android:color/white"
    tools:context="com.fanyafeng.recreation.fragment.TwoFragment">

    .support.design.widget.CoordinatorLayout xmlns:android="http://schemas.android.com/apk/res/android"
        xmlns:app="http://schemas.android.com/apk/res-auto"
        xmlns:tools="http://schemas.android.com/tools"
        android:layout_width="match_parent"
        android:layout_height="match_parent"
        android:fitsSystemWindows="false"
        tools:context="com.fanyafeng.recreation.activity.MainActivity">

        .support.design.widget.AppBarLayout
            android:layout_width="match_parent"
            android:layout_height="wrap_content"
            android:theme="@style/AppTheme.AppBarOverlay">

            .support.v7.widget.Toolbar
                android:id="@+id/toolbar_two"
                android:layout_width="match_parent"
                android:layout_height="?attr/actionBarSize"
                android:background="?attr/colorPrimary"
                app:layout_scrollFlags="scroll|exitUntilCollapsed"
                app:popupTheme="@style/AppTheme.PopupOverlay">

                
                "@+id/toolbar_center_title"
                    android:layout_width="wrap_content"
                    android:layout_height="wrap_content"
                    android:layout_gravity="center"
                    android:textColor="@android:color/white"
                    android:textSize="@dimen/activity_horizontal_margin" />
            .support.v7.widget.Toolbar>

        .support.design.widget.AppBarLayout>

        "match_parent"
            android:layout_height="wrap_content"
            app:layout_behavior="@string/appbar_scrolling_view_behavior">

            <com.fanyafeng.recreation.refreshview.XRefreshView
                android:id="@+id/refreshTwo"
                android:layout_width="match_parent"
                android:layout_height="match_parent">

                .support.v7.widget.RecyclerView
                    android:id="@+id/rvTwo"
                    android:layout_width="match_parent"
                    android:layout_height="match_parent" />
            com.fanyafeng.recreation.refreshview.XRefreshView>
        

    .support.design.widget.CoordinatorLayout>


上一张非gif得图片吧:
Android利用jsoup爬虫爬网页数据(一)_第1张图片

你可能感兴趣的:(Android,爬虫)