笔者之前没接触过爬虫的,参考网上的例子,模仿这写了第一个Java爬虫的例子,用来抓取http://www.mmjpg.com/网站的妹子图片,先看结果:
第一次抓取到网上的图片还是有点小欣慰的。好了,废话不多说,说说实现的具体过程吧。
完成这个简单的爬虫程序主要是分为以下几个步骤:
1)获取HttpClient对象,实现需要引入对应的jar包。可以到官网上下载,地址:点击打开链接
2)执行HttpClient的execute方法,获取HttpResponse对象的返回结果。代码中是用CloseableHttpResponse对象,是对HttpClient接口的实现
3)将2中的对象转换为流式对象,并通过工具类将其转换为String类型。
4)完成了上面的步骤,后面的就是对字符串的拆分和写入文本文件了。
4.1)先将字符串对象中的
4.2)然后通过img标签来获取到每一张图片的url地址,
4.3)准备好存储图片文件的路径。
4.4)通过URL类new一个URL对象,传入上面获取到的url地址。
4.5)为该URL对象创建输出流用于输出该url下的内容,对于文件来说是InputStream,通过System.arraycopy函数循环的将内容写入到文件中。
下面附上程序实现源码:
主程序实现。
import java.io.InputStream;
import org.apache.http.client.config.CookieSpecs;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
public class SimpleSpider {
private static final int meiZiPage = 67;
public static void main(String[] args) {
SimpleSpider spider = new SimpleSpider();
//spider.getJianDanImages();
spider.getMeiZiImages();
}
private void getMeiZiImages() {
// TODO Auto-generated method stub
RequestConfig globalConfig =
RequestConfig.custom().setCookieSpec(CookieSpecs.STANDARD).setConnectionRequestTimeout(6000).setConnectTimeout(6000).build();
CloseableHttpClient httpClient = HttpClients.custom().setDefaultRequestConfig(globalConfig).build();
System.out.println("Java爬虫马上开始抓取妹子图片。。。");
for (int i = 1; i <= meiZiPage; i++) {
//创建一个GET请求,http://www.mmjpg.com/home/
HttpGet httpGet = new HttpGet("http://www.mmjpg.com/home/" + i);
httpGet.addHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:53.0) Gecko/20100101 Firefox/53.0");
try {
Thread.sleep(5000);
CloseableHttpResponse response = httpClient.execute(httpGet);
InputStream in = response.getEntity().getContent();
String html = Utils.convertStreamToString(in);
//每一页启动一个线程用于当前网页内容的解析
new Thread(new MeiZiHtmlParser(html,i)).start();
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
}
}
}
}
拆分String类型的html对象,确定图片的位置以及图片的url地址。
import java.util.ArrayList;
import java.util.List;
public class MeiZiHtmlParser implements Runnable {
private String html;
private int page;
public MeiZiHtmlParser(String html, int page) {
// TODO Auto-generated constructor stub
this.html = html;
this.page = page;
}
@Override
public void run() {
// TODO Auto-generated method stub
System.out.println("**************第"+page+"页******************");
List list = new ArrayList<>();
html = html.substring(html.indexOf(""), html.indexOf("
"));
String[] images = html.split("");
//System.out.println("第"+page+"页有"+(images.length - 1) +"张图片");
for (String image : images) {
if(image.indexOf("
0) {
new Thread(new MeiZiImageCreator(imageUrl, page)).start();
}
}
}
}
生成图片,并写入到文件中去。
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.MalformedURLException;
import java.net.URL;
public class MeiZiImageCreator implements Runnable {
private static int count = 1;
private String imageUrl;
private int page;
private StringBuffer basePath;
public MeiZiImageCreator(String imageUrl, int page) {
// TODO Auto-generated constructor stub
this.imageUrl = imageUrl;
this.page = page;
basePath = new StringBuffer("D:/meizitu/page_"+page);
}
@Override
public void run() {
// TODO Auto-generated method stub
File dir = new File(basePath.toString());
if (!dir.exists()) {
dir.mkdirs();
System.out.println("妹子图片存放于"+basePath+"目录下");
}
String imageName = imageUrl.substring(imageUrl.lastIndexOf("/") + 1);
File file = new File(basePath + "/" + page + "--" + imageName);
try {
OutputStream os = new FileOutputStream(file);
URL url = new URL(imageUrl);
InputStream in = url.openStream();
byte[] buff = new byte[1024];
while(true){
int readed = in.read(buff);//读取内容长度
if(readed == -1){
break;
}
byte[] temp = new byte[readed];
System.arraycopy(buff, 0, temp, 0, readed);//内容复制
//写入到文件中
os.write(temp);
}
System.out.println("第" + (count++) + "张妹子:" + file.getAbsolutePath());
in.close();
os.close();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (MalformedURLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;
public class Utils {
public static String convertStreamToString(InputStream in) throws UnsupportedEncodingException {
//BufferedReader reader = new BufferedReader(new InputStreamReader(in));
BufferedReader reader = new BufferedReader(new InputStreamReader(in, "UTF-8"));
StringBuilder sb = new StringBuilder();
String line = null;
try {
while ((line = reader.readLine()) != null) {
sb.append(line + "/n");
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
in.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return sb.toString();
}
}
说明:学习娱乐归学习娱乐,但是所有的图片版权归原网站所有,请大家不要随意传播。