一、获取html的两种方式
1、方式一:直接通过创建Connection连接对象获取html
示例代码:
2、方式二:先获取Response对象,再通过Response对象获取html
示例代码:
运行结果:
二、设置请求头信息
1、设置单条请求头信息
2、设置多条请求头信息
3、常规做法
做法:
常用User-Agent:
代码示例:
三、提交请求参数的5种方式
1、5种方式
2、第一种方式代码示例
3、第二种方式代码示例
4、第三种方式代码示例
四、超时设置
1、情况一代码示例
2、情况二代码示例
3、备注
五、代理服务器的使用
1、什么是代理服务器
2、为什么要使用代理服务器
好处一:
好处二:
3、代理服务器的来源
4、设置代理服务器的两种方式
说明:
两个方法:
方式一代码演示:
方式二代码演示:
六、响应转输出流(图片、PDF等的下载)
1、概述
2、代码演示
3、运行结果(下载成功)
七、HTTPS请求证书
1、HTTPS概述
2、代码示例
八、大文件内容获取问题
1、说明
2、代码示例
package com.zb.book.jsoup;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
public class Main {
public static void main(String[] args) throws IOException {
//获取Document文档对象
Document document = Jsoup.connect("http://okzyw.com/?m=vod-type-id-1.html").get();
//输出文档的html内容
System.out.println(document.html());
}
}
(其中包含通过Response对象获取其他信息的示例代码)
package com.zb.book.jsoup;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import java.io.IOException;
import java.net.URL;
import java.nio.charset.StandardCharsets;
public class Main {
public static void main(String[] args) throws IOException {
//先获取Response对象,再通过Response对象获取html
Connection.Response response = Jsoup.connect("http://okzyw.com/?m=vod-type-id-1.html").method(Connection.Method.GET).execute();
//获取请求的url
URL url = response.url();
System.out.println("请求的url为:" + url);
//获取响应状态码
int statusCode = response.statusCode();
System.out.println("响应状态码为:" + statusCode);
//获取响应数据类型
String contentType = response.contentType();
System.out.println("响应数据类型为:" + contentType);
//获取响应信息
String statusMessage = response.statusMessage();
System.out.println("响应信息为:" + statusMessage);
//如果状态码等于200,说明获取请求成功
if(statusCode==200){
//获取html
String html = new String(response.bodyAsBytes(), StandardCharsets.UTF_8);
//获取对应的Document对象(Document和html内容是一样的,Document更加格式化)
// Document document = response.parse();
System.out.println(html);
}
}
}
package com.zb.book.jsoup;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
public class Main {
public static void main(String[] args) throws IOException {
//获取Document文档对象
Connection connect = Jsoup.connect("http://okzyw.com/?m=vod-type-id-1.html");
//设置一条请求头
connect.header("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36");
//获取Document文档对象
Document document = connect.get();
//输出文档的html内容
System.out.println(document.html());
}
}
package com.zb.book.jsoup;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
public class Main {
public static void main(String[] args) throws IOException {
//获取Document文档对象
Connection connect = Jsoup.connect("http://okzyw.com/?m=vod-type-id-1.html");
//设置多条请求头:将多条请求头存入map集合
Map headers = new HashMap<>();
headers.put("Accept","*/*");
headers.put("Content-Type","application/x-www-form-urlencoded");
headers.put("Referer","http://okzyw.com/?m=vod-type-id-1.html");
headers.put("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36");
connect.headers(headers);
//获取Document文档对象
Document document = connect.get();
//输出文档的html内容
System.out.println(document.html());
}
}
使用一个静态Builder类,将使用的各种参数封装进去;
User-Agent和Referer从列表中随机挑选一个(防止被网站反爬虫程序发现);
window.navigator.userAgent
1) Chrome
Win7:
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1
2) Firefox
Win7:
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0
3) Safari
Win7:
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50
4) Opera
Win7:
Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50
5) IE
Win7+ie9:
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; Tablet PC 2.0; .NET4.0E)
Win7+ie8:
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)
WinXP+ie8:
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.0)
WinXP+ie7:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)
WinXP+ie6:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
6) 傲游
傲游3.1.7在Win7+ie9,高速模式:
Mozilla/5.0 (Windows; U; Windows NT 6.1; ) AppleWebKit/534.12 (KHTML, like Gecko) Maxthon/3.0 Safari/534.12
傲游3.1.7在Win7+ie9,IE内核兼容模式:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)
7) 搜狗
搜狗3.0在Win7+ie9,IE内核兼容模式:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)
搜狗3.0在Win7+ie9,高速模式:
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.33 Safari/534.3 SE 2.X MetaSr 1.0
8) 360
360浏览器3.0在Win7+ie9:
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)
9) QQ浏览器
QQ浏览器6.9(11079)在Win7+ie9,极速模式:
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.41 Safari/535.1 QQBrowser/6.9.11079.201
QQ浏览器6.9(11079)在Win7+ie9,IE内核兼容模式:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E) QQBrowser/6.9.11079.201
10) 阿云浏览器
阿云浏览器1.3.0.1724 Beta(编译日期2011-12-05)在Win7+ie9:
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
Builder静态类:
package com.zb.book.jsoup.data;
import java.util.Arrays;
import java.util.List;
public class Builder {
//常用User-Agent
private static final String[] userAgentStrs = {
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; Tablet PC 2.0; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ) AppleWebKit/534.12 (KHTML, like Gecko) Maxthon/3.0 Safari/534.12",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.33 Safari/534.3 SE 2.X MetaSr 1.0",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.41 Safari/535.1 QQBrowser/6.9.11079.201",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E) QQBrowser/6.9.11079.201",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
};
//User-Agent库
public static List userAgentList = Arrays.asList(userAgentStrs);
//User-Agent列表长度
public static int userAgentSize = userAgentList.size();
//RefererList库,可根据需求增加更多的referer
public static final String[] refererStrs = {
"https://www.***.com/"
};
//RefererList库
public static List refererList = Arrays.asList(refererStrs);
//RefererList库长度
public static int refererSize = refererList.size();
//设置accept、accept-language、accept-Encoding
public static String accept = "*/*";
public static String acceptLanguage = "zh-cn,zh;q=0.5";
public static String acceptEncoding = "gzip, deflate";
public static String host;
}
Main测试类:
package com.zb.book.jsoup;
import com.zb.book.jsoup.data.Builder;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Random;
public class Main {
public static void main(String[] args) throws IOException {
//获取Document文档对象
Connection connect = Jsoup.connect("http://okzyw.com/?m=vod-type-id-1.html");
//设置host,这里不再进行具体设置
Builder.host = "";
//设置多条请求头:将多条请求头存入map集合
Map headers = new HashMap<>();
headers.put("Accept","*/*");
headers.put("Content-Type","application/x-www-form-urlencoded");
//随机选一个Referer
headers.put("Referer",Builder.refererList.get(new Random().nextInt(Builder.refererSize)));
//随机选一个User-Agent
headers.put("User-Agent",Builder.userAgentList.get(new Random().nextInt(Builder.userAgentSize)));
headers.put("Accept-Language",Builder.acceptLanguage);
headers.put("Accept-Encoding",Builder.acceptEncoding);
connect.headers(headers);
//获取Document文档对象
Document document = connect.get();
//输出文档的html内容
System.out.println(document.html());
}
}
(常用前3种,代码示例见下方)
Connection data(String key, String value);
Connection data(String... keyvals);
Connection data(Map data);
Connection data(String key, String filename, InputStream inputStream);
Connection data(String key, String filename, InputStream inputStream, String contentType);
Connection data(Collection data);
package com.zb.book.jsoup;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
public class Main {
public static void main(String[] args) throws IOException {
//获取Connection连接对象
Connection connect = Jsoup.connect("http://okzyw.com/?m=vod-type-id-1.html");
//设置提交的请求参数-核心内容
connect.data("key1","value1").data("key1","value2");
//获取Document文档对象
Document document = connect.get();
//输出文档的html内容
System.out.println(document.html());
}
}
package com.zb.book.jsoup;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
public class Main {
public static void main(String[] args) throws IOException {
//获取Connection连接对象
Connection connect = Jsoup.connect("http://okzyw.com/?m=vod-type-id-1.html");
//设置提交的请求参数-核心内容
connect.data("key1","value1","key2","value2");
//获取Document文档对象
Document document = connect.get();
//输出文档的html内容
System.out.println(document.html());
}
}
package com.zb.book.jsoup;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
public class Main {
public static void main(String[] args) throws IOException {
//获取Connection连接对象
Connection connect = Jsoup.connect("http://okzyw.com/?m=vod-type-id-1.html");
//设置提交的请求参数-核心内容
Map data = new HashMap<>();
data.put("key1","value1");
data.put("key2","value2");
connect.data(data);
//获取Document文档对象
Document document = connect.get();
//输出文档的html内容
System.out.println(document.html());
}
}
package com.zb.book.jsoup;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
public class Main {
public static void main(String[] args) throws IOException {
//获取Document文档对象
Document document = Jsoup
.connect("http://okzyw.com/?m=vod-type-id-1.html")
.timeout(3000)
.get();
//输出文档的html内容
System.out.println(document.html());
}
}
package com.zb.book.jsoup;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
public class Main {
public static void main(String[] args) throws IOException {
//获取response响应对象
Connection.Response response = Jsoup
.connect("http://okzyw.com/?m=vod-type-id-1.html")
.method(Connection.Method.GET)
.timeout(3000)
.execute();
//获取Document文档对象
Document document = response.parse();
//输出文档的html内容
System.out.println(document.html());
}
}
若未设置,默认为30秒;
代理服务器是介于客户端和Web服务器之间的另一台服务器,基于代理服务器,浏览器不再直接从Web服务器获取数据,而是向代理服务器发出请求,信号会先发送到代理服务器,由代理服务器取回浏览器所需要的信息。也可以理解为中介。
能够高度隐藏爬虫的真是IP,从而防止爬虫被服务器封锁;
普通网络爬虫IP固定,需要设置随机休息时间,而代理服务器不需要,从而能够提高数据采集的效率;
免费代理服务的一些网站或网站接口,但此种稳定性差;
也可以通过付费的方式获取商业级代理,其提供的IP地址可用率较高,稳定性较强;
这里只是用一个代理服务器的IP地址和端口进行演示,实际使用中往往需要构建代理服务器库,不断地切换代理服务器去请求URL库;
Connection proxy(Proxy proxy);
Connection proxy(String host, int port);
package com.zb.book.jsoup;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
import java.net.InetSocketAddress;
import java.net.Proxy;
public class Main {
public static void main(String[] args) throws IOException {
//设置代理
Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress("171.221.239.11", 808));
//获取Document文档对象
Document document = Jsoup
.connect("http://okzyw.com/?m=vod-type-id-1.html")
.proxy(proxy)
.get();
//输出文档的html内容
System.out.println(document.html());
}
}
package com.zb.book.jsoup;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
public class Main {
public static void main(String[] args) throws IOException {
//获取Document文档对象
Document document = Jsoup
.connect("http://okzyw.com/?m=vod-type-id-1.html")
.proxy("171.221.239.11", 808)//设置代理
.get();
//输出文档的html内容
System.out.println(document.html());
}
}
使用Jsoup下载图片、PDF和压缩文件时,需要将响应转化为输出流,目的是增强写文件的能力,即以字节为单位写入指定文件;
另外,针对图片和PDF等文件,之执行URL请求获取Response时,必须通过ignoreContentType(boolean ignoreContentType)方法设置忽略对应内容的类型,否则会报错;
package com.zb.book.jsoup;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import java.io.*;
public class Main {
public static void main(String[] args) throws IOException {
//获取Connection连接对象
Connection connect = Jsoup.connect("https://csdnimg.cn/cdn/content-toolbar/magpieFestival-white.gif");
//获取response
Connection.Response response = connect.method(Connection.Method.GET).ignoreContentType(true).execute();
//获取输入流
BufferedInputStream bufferedInputStream = response.bodyStream();
//写出图片
byte[] buffer = new byte[1024];
int len = 0;
//创建缓冲流
FileOutputStream fileOutputStream = new FileOutputStream(new File("C:\\Users\\ZiBo\\Desktop\\1.gif"));
BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(fileOutputStream);
while ((len = bufferedInputStream.read(buffer,0,1024)) != -1){
bufferedOutputStream.write(buffer,0,len);
}
//缓冲流的释放与关闭
bufferedOutputStream.flush();
bufferedOutputStream.close();
}
}
以https://为前缀的URL使用的是HTTPS协议,HTTPS是在HTTP的基础上加入了SSL(安全套接层)。SSL的作用是保障网络通信的安全性,其广泛应用于客户端与服务器之间的身份认证和加密数据传输。
SSL支持双向认证(服务器认证与客户端认证),将服务器证书下载到客户端,再将客户端的证书返回到服务器。目前,访问网络并不常用客户端证书,大部分用户都没有自己的客户端证书,但HTTPS总要求使用客户端证书。其中,使用最多的客户端证书是X.509证书。
网络爬虫在请求以https://为前缀的URL时,通常也需要创建X.509证书信任管理器。若没有创建证书,咋可能出现找不到合法证书的错误。
package com.zb.book.jsoup.copy;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import javax.net.ssl.*;
import java.io.IOException;
import java.security.cert.X509Certificate;
public class JsoupConnectSSLInit {
public static void main(String[] args) throws IOException {
initUnSecureTSL();
String url = "https://cn.kompass.com/a/hospitality-tourism-hotel-and-catering-industries/78/";
//创建连接
Connection connect = Jsoup.connect(url);
//请求网页
Document document = connect.get();
//输出HTML
System.out.println(document.html());
}
private static void initUnSecureTSL() {
// 创建信任管理器(不验证证书)
final TrustManager[] trustAllCerts = new TrustManager[]{new X509TrustManager() {
//检查客户端证书
public void checkClientTrusted(final X509Certificate[] chain, final String authType) {
//do nothing 接受任意客户端证书
}
//检查服务器端证书
public void checkServerTrusted(final X509Certificate[] chain, final String authType) {
//do nothing 接受任意服务端证书
}
//返回受信任的X509证书
public X509Certificate[] getAcceptedIssuers() {
return null; //或者return new X509Certificate[0];
}
}};
try {
// 创建SSLContext对象,并使用指定的信任管理器初始化
SSLContext sslContext = SSLContext.getInstance("SSL");
sslContext.init(null, trustAllCerts, new java.security.SecureRandom());
基于信任管理器,创建套接字工厂 (ssl socket factory)
SSLSocketFactory sslSocketFactory = sslContext.getSocketFactory();
//给HttpsURLConnection配置SSLSocketFactory
HttpsURLConnection.setDefaultSSLSocketFactory(sslSocketFactory);
} catch (Exception e) {
e.printStackTrace();
}
}
}
默认情况下,Jsoup最大只能获取1MB的文件,我们在获取超过1MB的图片、压缩包等文件会导致无法查看;可以通过maxBodySize(int bytes)方法来设置请求文件限制;
package com.zb.book.jsoup;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import java.io.*;
public class Main {
public static void main(String[] args) throws IOException {
//获取Connection连接对象
Connection connect = Jsoup.connect("https://gudu2019.oss-cn-beijing.aliyuncs.com/apk/%E5%AD%A4%E7%8B%AC5.1.apk");
//获取response
Connection.Response response = connect.maxBodySize(Integer.MAX_VALUE).method(Connection.Method.GET).ignoreContentType(true).execute();
//获取输入流
BufferedInputStream bufferedInputStream = response.bodyStream();
//写出图片
byte[] buffer = new byte[1024];
int len = 0;
//创建缓冲流
FileOutputStream fileOutputStream = new FileOutputStream(new File("C:\\Users\\ZiBo\\Desktop\\1.apk"));
BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(fileOutputStream);
while ((len = bufferedInputStream.read(buffer,0,1024)) != -1){
bufferedOutputStream.write(buffer,0,len);
}
//缓冲流的释放与关闭
bufferedOutputStream.flush();
bufferedOutputStream.close();
}
}