编写网络爬虫获取饿了么商家信息(一)

利用HttpClient和Jsoup两种工具分别进行爬取数据

maven坐标:


	commons-httpclient
	commons-httpclient
	3.1

		

	org.jsoup
	jsoup
	1.10.2
	runtime


要爬取的页面:

编写网络爬虫获取饿了么商家信息(一)_第1张图片


利用谷歌Chrome进行网络信息监控

编写网络爬虫获取饿了么商家信息(一)_第2张图片

编写网络爬虫获取饿了么商家信息(一)_第3张图片

发现前台响应的数据来自后台返回的json格式,所以只需要访问数据请求的url即可。

url :https://www.ele.me/restapi/shopping/restaurants?extras%5B%5D=activities&geohash=wsb0ujx0pu4&latitude=26.88082&limit=24&longitude=112.68573&offset=0&terminal=web

点开始json格式的乱码。下面开始请求:

HttpClient:

package com.yc.elm.utils;

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpMethod;
import org.apache.commons.httpclient.methods.GetMethod;

public class GetDate {

	public static void main(String[] args) throws Exception {
		String url = "https://www.ele.me/restapi/shopping/restaurants"
				+ "?extras%5B%5D=activities&geohash=wsb0ujx0pu4&latitude=26.88082"
				+ "&limit=24&longitude=112.68573&offset=0&terminal=web";
		// 创建客户端
		HttpClient client = new HttpClient();
		HttpMethod method = new GetMethod(url);
		client.executeMethod(method);
		byte[] bytes = method.getResponseBody();
		// 更改字符编码集
		String json = new String(bytes, "utf-8");
		System.out.println(json);
	}
}


结果:



Jsoup:

package com.yc.elm.utils;

import org.jsoup.Connection;
import org.jsoup.Connection.Response;
import org.jsoup.Jsoup;

public class GetDate {

	public static void main(String[] args) throws Exception {
		String url = "https://www.ele.me/restapi/shopping/restaurants?"
				+ "extras%5B%5D=activities&geohash=wsb0ujqse46&latitude=26.88021&limit=24&"
				+ "longitude=112.68484&offset=0&terminal=web";
		Connection con = Jsoup.connect(url);
		Response response = con.execute();
		System.out.println(response.body());
	}
}

出现错误:

Exception in thread "main" org.jsoup.UnsupportedMimeTypeException: Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml. Mimetype=application/json, URL=https://www.ele.me/restapi/shopping/restaurants?extras%255B%255D=activities&geohash=wsb0ujqse46&latitude=26.88021&limit=24&longitude=112.68484&offset=0&terminal=web
	at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:689)
	at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:628)
	at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:260)
	at com.yc.elm.utils.GetDate.main(GetDate.java:14)

这是因为没有指定类型。jsoup不支持json返回类型,所以这里我们使用.ignoreContentType(true)来忽略返回值类型。

package com.yc.elm.utils;

import org.jsoup.Connection;
import org.jsoup.Connection.Response;
import org.jsoup.Jsoup;

public class GetDate {

	public static void main(String[] args) throws Exception {
		String url = "https://www.ele.me/restapi/shopping/restaurants?"
				+ "extras%5B%5D=activities&geohash=wsb0ujqse46&latitude=26.88021&limit=24&"
				+ "longitude=112.68484&offset=0&terminal=web";
		Connection con = Jsoup.connect(url).ignoreContentType(true);
		Response response = con.execute();
		System.out.println(response.body());
	}
}

结果:

编写网络爬虫获取饿了么商家信息(一)_第4张图片


爬到数据,接下来我们就是用json工具进行解析就可以了。具体内容看下一篇博客


你可能感兴趣的:(网络爬虫)