java作业--用正则表达式提取豆瓣网电影信息

java-正则表达式提取豆瓣网电影信息

留档自用,谨慎参考。

(1)利用正则表达式实现从网页中提取网站和网址信息:如:

输入的字符串为:

265G游戏07073游戏征途”

提取结果为:

265G游戏:http://www.265g.com

07073游戏:http://www.07073.com

征途:http://zt.ztgame.com/url/hao.html

这个老师给的代码的正则表达式改为非贪婪模式就行:

正则表达式:

String regex = "(.+?)"; //非贪婪模式

(2)利用正则表达式到豆瓣网爬取电影信息,只需要爬取电影名称,导演,演员,上映时间,评分即可。

读取网页信息的参考代码

https://blog.csdn.net/dufufd/article/details/72781248

这个参考代码不能直接使用,需要设置请求方式。

具体操作为:

打开豆瓣,按F12
java作业--用正则表达式提取豆瓣网电影信息_第1张图片

随便点进去一个,在Header最下方有一个User-Agent复制,然后在参考代码connection.setRequestMethod(“GET”);

后面加上

connection.setRequestProperty(“User-Agent”, “(刚刚复制的内容)”);

这一步是设置访问方式,具体原理不是很了解

想要爬具体电影的评论,影评啥的,点进电影的页面,进行相同操作即可。

偷懒设置两个类,一个存放豆瓣电影排行的User-Agent,一个是具体电影页面User-Agent

代码:

package regular_expression;
import java.net.URL;
import java.net.MalformedURLException;
import java.net.HttpURLConnection;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.BufferedReader;

//从所给的网站名获取一个字符串
public class GetString {
	private String name;
	private static HttpURLConnection connection = null;

	public static String httpRequest(String url)
	{
		
	String content = "";
	try{
	URL u = new URL(url);
	connection = (HttpURLConnection)u.openConnection();
	connection.setRequestMethod("GET");
	connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36");
	int code = connection.getResponseCode();
	if(code == 200){
	InputStream in = connection.getInputStream();
	InputStreamReader isr = new InputStreamReader(in,"UTF-8");
	BufferedReader reader = new BufferedReader(isr);
	String line = null;
	while((line = reader.readLine()) != null){
	content += line;
	}
	}
	}catch(MalformedURLException e){
	e.printStackTrace();
	}catch(IOException e){
	e.printStackTrace();
	}finally{
	if(connection != null){
	connection.disconnect();
	}
	}
	return content;
	}
}

package regular_expression;
import java.net.URL;
import java.net.MalformedURLException;
import java.net.HttpURLConnection;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.BufferedReader;

//从所给的网站名获取一个字符串
public class GetString2 {
	private String name;
	private static HttpURLConnection connection = null;

	public static String httpRequest(String url)
	{
		
	String content = "";
	try{
	URL u = new URL(url);
	connection = (HttpURLConnection)u.openConnection();
	connection.setRequestMethod("GET");
	connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36");
	int code = connection.getResponseCode();
	if(code == 200){
	InputStream in = connection.getInputStream();
	InputStreamReader isr = new InputStreamReader(in,"UTF-8");
	BufferedReader reader = new BufferedReader(isr);
	String line = null;
	while((line = reader.readLine()) != null){
	content += line;
	}
	}
	}catch(MalformedURLException e){
	e.printStackTrace();
	}catch(IOException e){
	e.printStackTrace();
	}finally{
	if(connection != null){
	connection.disconnect();
	}
	}
	return content;
	}
}

这两个类只有User-Agent的不同,也可以全写到Text类里面。
正则表达式是根据网页源码找规律写的,可以根据crtl+U查看网页源码~

package regular_expression;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class Test {

	/**
	 * @param args
	 */
	public static void main(String[] args) {
		// TODO Auto-generated method stub
		GetString a=new GetString();
		String content =a.httpRequest("https://movie.douban.com/chart");
	 
		//String regex = "(.+?)(.+?).+?

(.+?)

.+?(.+?)";
String regex = ".+? (.+?).+?

(.+?)

.+?(.+?)"; Pattern p = Pattern.compile(regex); Matcher m=p.matcher(content); while(m.find()) { System.out.println("电影名:"+m.group(2)+"//"+m.group(3) + " \n演员列表: " +m.group(4)+"\n评分: "+m.group(5)); if(m.group(1).length()<80) { System.out.println("电影连接: "+m.group(1)); //试图爬取评论 GetString2 tma=new GetString2(); String tmp =tma.httpRequest(m.group(1)); // String regex1="
(.+?)
";//这个可以爬影评,就是有点丑
String regex1=" (.+?)";//爬评论 Pattern q=Pattern.compile(regex1); Matcher n=q.matcher(tmp); while(n.find()) { System.out.println("评论:"+n.group(1)); } System.out.println("\n===================分割线==================\n"); } } } }

运行截图:

你可能感兴趣的:(Java,正则表达式)