本文为原创博客,仅供技术学习使用。未经本人允许,不得将其复制下来上传到百度文库等平台。
目录
JSON 是存储和交换文本信息的语法。类似 XML。JSON 比 XML 更小、更快,更易解析。JSON采用完全独立于语言的文本格式,但是也使用了类似于C语言家族的习惯(包括C、C++、C#、Java、JavaScript、Perl、Python等)。这些特性使JSON成为理想的数据交换语言。 易于人阅读和编写,同时也易于机器解析和生成(一般用于提升网络传输速率)。
JSON 数据的书写格式是:名称/值对。
如下所示:
{
"employees": [
{ "firstName":"Bill" , "lastName":"Gates" },
{ "firstName":"George" , "lastName":"Bush" },
{ "firstName":"Thomas" , "lastName":"Carter" }
]
}
以下,我将以一个简单的爬虫来解析爬虫中的Json数据。这里的爬虫写的比较简单,建议大家还是按照我前面写的爬虫框架来写,下面的主要是为了讲解Json的解析。
下面是爬时光网所写的一个样例程序:
1。首先是框架中的model,封装要爬的数据。
package model;
/*
* 合肥工业大学 管理学院 qianyang [email protected]
*/
public class MtimeModel {
private String prmovieId;
private String url;
private String movieId;
private String title;
public String getPrmovieId() {
return prmovieId;
}
public void setPrmovieId(String prmovieId) {
this.prmovieId = prmovieId;
}
public String getUrl() {
return url;
}
public void setUrl(String url) {
this.url = url;
}
public String getMovieId() {
return movieId;
}
public void setMovieId(String movieId) {
this.movieId = movieId;
}
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
}
2。来看看我们要爬的地址:http://movie.mtime.com/212471/trailer.html,以下程序是爬相关预告片的信息。下年是main方法
package main;
import java.io.IOException;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.List;
import model.MtimeModel;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import parse.MtimeParse;
/*
* 合肥工业大学 管理学院 qianyang [email protected]
*/
public class Mtime {
static final Log logger = LogFactory.getLog(Mtime.class);
public static void main(String[] args) throws IOException, SQLException {
//测试程序
String Starturl="http://movie.mtime.com/212471/trailer.html";
Document doc=Jsoup.connect(Starturl).userAgent("bbb").timeout(120000).get();
System.out.println(doc);
List moviedatas=new ArrayList();
moviedatas =MtimeParse.getData(doc);
for (MtimeModel mt:moviedatas) {
System.out.println("prmovieId:"+mt.getPrmovieId()+" movieId:"+mt.getMovieId()+" Title:"+mt.getTitle()
+" url:"+mt.getUrl());
}
}
}
下面是获取的网站源码,主要是解析这里面Json数据,下面的程序中请定位到var videos= 可以看到这后面就是Json数据。
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>澳门风云2 视频 预告片 – Mtime时光网title>
<meta name="Keywords" content="澳门风云2,The Man From Macao II,预告片视频,在线观看王晶,周润发,张家辉">
<meta name="Description" content="澳门风云2 视频 预告片">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<link type="image/x-icon" href="http://static1.mtime.cn/favicon.ico" rel="icon">
<link type="image/x-icon" href="http://static1.mtime.cn/favicon.ico" rel="shortcut icon">
<link type="image/x-icon" href="http://static1.mtime.cn/favicon.ico" rel="bookmark">
<link type="application/opensearchdescription+xml" href="http://feed.mtime.com/opensearch.xml" title="Mtime影视搜索" rel="search">
<link rel="alternate" type="application/rss+xml" title="影评" href="http://feed.mtime.com/comment.rss">
<link rel="alternate" type="application/rss+xml" title="日志" href="http://feed.mtime.com/blog.rss">
<link rel="alternate" type="application/rss+xml" title="资讯" href="http://feed.mtime.com/news.rss">
<link rel="alternate" type="application/rss+xml" title="话题" href="http://feed.mtime.com/topic.rss">
<link rel="alternate" type="application/rss+xml" title="周刊" href="http://feed.mtime.com/weekly.rss">
<script type="text/javascript">
var server = "http://static1.mtime.cn/";
var subServer = "http://static1.mtime.cn/library/";
var version = "20160720105244";
var subVersion = "20160623154218";
var jsServer = server + version;
var cssServer = server + version;
var subJsServer = subServer + subVersion;
var subCssServer = subServer + subVersion;
var debug = false;
var mtimeCookieDomain = "mtime.com";
var siteLogUrl = "http://log.mtime.cn";
var siteServiceUrl = "http://service.mtime.com";
var siteLibraryServiceUrl = "http://service.library.mtime.com";
var crossDomainUpload="http://upload3.mtime.com/Upload.ashx";
script>
<script type="text/javascript">
document.write(unescape("%3Clink href='" + cssServer + "/css/2014/publicpack.css' rel='stylesheet' media='all' type='text/css'%3E%3C/link%3E"));
script>
<script type="text/javascript">
document.write(unescape("%3Clink href='" + subCssServer + "/css/database.css' rel='stylesheet' media='all' type='text/css'%3E%3C/link%3E"));
script>
head>
<body>
<script type="text/javascript">
var navigationBarType = 1;document.writeln( "");var debug = false;var mtimeCookieDomain="mtime.com";var siteLogUrl="http://log.mtime.cn";var siteUrl="http://www.mtime.com";var siteMcUrl="http://my.mtime.com";var siteApiUrl="http://api.mtime.com";var siteBlogUrl="http://i.mtime.com";var siteGroupUrl="http://group.mtime.com";var siteMovieUrl="http://movie.mtime.com";var sitePeopleUrl="http://people.mtime.com";var siteNewsUrl="http://news.mtime.com";var siteServiceUrl="http://service.mtime.com";var siteSearchUrl="http://search.mtime.com";var siteGoodsListUrl="http://list.mall.mtime.com";var theaterService="http://service.theater.mtime.com";var siteLibraryServiceUrl="http://service.library.mtime.com";var siteCommunityServiceUrl="http://service.community.mtime.com";var siteChannelServiceUrl="http://service.channel.mtime.com";var siteGoodsServiceUrl="http://service.mall.mtime.com";var siteTradeServiceUrl="http://trade.mtime.com";var siteFunUrl="";var sitePassportUrl="http://passport.mtime.com";var crossDomainUpload="http://upload3.mtime.com/Upload.ashx";var topMenuValues={"mainNavType":"Detail","footer":"第179期时光周刊 \n "};
script>
<div id="db_sechead">
<div id="onlineTicketMovieRegion" class="db_ticket none">
div>
<div class="db_head">
<div class="clearfix">
<h1 property="v:itemreviewed"><a href="http://movie.mtime.com/212471/">澳门风云2a>h1>
<p class="db_year">(<a href="http://movie.mtime.com/movie/search/section/?year=2015" target="_blank">2015a>)p>
<p class="db_enname"><a href="http://movie.mtime.com/212471/">The Man From Macao IIa>p>
div>
div>
div>
<div class="db_nav db_secnav">
<dl id="movieNavigationRegion" class="clearfix">
<dd token="Generalize">
<a href="http://movie.mtime.com/212471/"><span> span>影片首页a>
<i> i>
dd>
<dd token="Video" _videocount="14">
<a href="http://movie.mtime.com/212471/trailer.html"><span>14span> 个视频a>
<i> i>
dd>
<dd token="Image" _imagecount="219">
<a href="http://movie.mtime.com/212471/posters_and_images/"><span>219span> 张图片a>
<i> i>
dd>
<dd token="Person">
<a href="http://movie.mtime.com/212471/fullcredits.html"><span>38span> 位演职员a>
<i> i>
dd>
<dd token="Review">
<a href="http://movie.mtime.com/212471/comment.html"><span property="v:count" content="11190">999+span> 条影评a>
<i> i>
dd>
<dd token="RelatedNews">
<a href="http://movie.mtime.com/212471/news.html"><span>50span> 条新闻a>
<i> i>
dd>
<dt class="more" id="detailMenuRegion">
<a href="###">更多<em id="detailMenuRegionLabel"> em>a>
<i> i>
<dl class="db_nav_sel" id="detailSubMenuRegion" style="display:none">
<dt>
dt>
<dd token="Synopsis">
<a href="http://movie.mtime.com/212471/plots.html">剧情a>
dd>
<dd token="Role" class="false">
<a href="###">角色介绍a>
dd>
<dd token="Trivia" class="false">
<a href="###">幕后揭秘a>
dd>
<dd token="Awards" class="false">
<a href="###">获奖记录a>
dd>
<dd token="Details">
<a href="http://movie.mtime.com/212471/details.html">更多资料a>
dd>
dl>
dt>
dl>
div>
<div class="db_videocont" id="allvideos">div>
<div id="M13_B_DB_Movie_FooterTopTG">div>
<script type="text/javascript">var videos = {"预告片":[{"VideoID":51655,"MovieID":212471,"Title":"澳门风云2 先行版预告片","ShortTitle":"先行版预告片","TitleSamll":"先行版预告片","Description":"","Length":"02:23","HD":1,"ImagePath":"http://img31.mtime.cn/mg/2014/11/27/184214.14086815_235X132X4.jpg","PlayCount":391607,"VideoType":0,"VideoTypeName":"预告片","Url":"http://video.mtime.com/51655/?mid=212471"},{"VideoID":52533,"MovieID":212471,"Title":"澳门风云2 剧场版预告片","ShortTitle":"剧情预告片“娱众不同”","TitleSamll":"剧情预告片“娱..","Description":"","Length":"02:12","HD":0,"ImagePath":"http://img31.mtime.cn/mg/2015/01/21/173317.88939215_235X132X4.jpg","PlayCount":37601,"VideoType":0,"VideoTypeName":"预告片","Url":"http://video.mtime.com/52533/?mid=212471"},{"VideoID":52715,"MovieID":212471,"Title":"澳门风云2 剧场版预告片2","ShortTitle":"剧场版预告片2","TitleSamll":"剧场版预告片2","Description":"","Length":"01:29","HD":0,"ImagePath":"http://img31.mtime.cn/mg/2015/02/04/102313.99825206_235X132X4.jpg","PlayCount":13877,"VideoType":0,"VideoTypeName":"预告片","Url":"http://video.mtime.com/52715/?mid=212471"},{"VideoID":53100,"MovieID":212471,"Title":"澳门风云 制作特辑之机器人PK海陆空","ShortTitle":"制作特辑之机器人PK海陆空","TitleSamll":"制作特辑之机器..","Description":"","Length":"01:39","HD":0,"ImagePath":"http://img31.mtime.cn/mg/2015/03/06/111143.14920228_235X132X4.jpg","PlayCount":3372,"VideoType":0,"VideoTypeName":"预告片","Url":"http://video.mtime.com/53100/?mid=212471"}],"拍摄花絮":[{"VideoID":52769,"MovieID":212471,"Title":"澳门风云2 制作特辑之“世纪阵容大联欢”","ShortTitle":"制作特辑之“世纪阵容大联欢”","TitleSamll":"制作特辑之“世..","Description":"","Length":"02:40","HD":0,"ImagePath":"http://img31.mtime.cn/mg/2015/02/06/163405.84085824_235X132X4.jpg","PlayCount":3075,"VideoType":2,"VideoTypeName":"拍摄花絮","Url":"http://video.mtime.com/52769/?mid=212471"},{"VideoID":52918,"MovieID":212471,"Title":"澳门风云2 “五代同堂合家欢”特辑","ShortTitle":"“五代同堂合家欢”特辑","TitleSamll":"“五代同堂合家..","Description":"","Length":"01:50","HD":0,"ImagePath":"http://img31.mtime.cn/mg/2015/02/16/111822.61107798_235X132X4.jpg","PlayCount":529,"VideoType":2,"VideoTypeName":"拍摄花絮","Url":"http://video.mtime.com/52918/?mid=212471"},{"VideoID":52926,"MovieID":212471,"Title":"澳门风云2 制作特辑之七招过大年","ShortTitle":"制作特辑之七招过大年","TitleSamll":"制作特辑之七招..","Description":"","Length":"03:34","HD":0,"ImagePath":"http://img31.mtime.cn/mg/2015/02/16/195152.85132238_235X132X4.jpg","PlayCount":2822,"VideoType":2,"VideoTypeName":"拍摄花絮","Url":"http://video.mtime.com/52926/?mid=212471"},{"VideoID":52929,"MovieID":212471,"Title":"澳门风云 制作特辑之羊年春节七天乐","ShortTitle":"制作特辑之羊年春节七天乐","TitleSamll":"制作特辑之羊年..","Description":"","Length":"03:33","HD":0,"ImagePath":"http://img31.mtime.cn/mg/2015/02/17/093038.24212654_235X132X4.jpg","PlayCount":727,"VideoType":2,"VideoTypeName":"拍摄花絮","Url":"http://video.mtime.com/52929/?mid=212471"}],"更多":[{"VideoID":51667,"MovieID":212471,"Title":"澳门风云2 北京发布会","ShortTitle":"北京发布会","TitleSamll":"北京发布会","Description":"","Length":"02:04","HD":0,"ImagePath":"http://img31.mtime.cn/mg/2014/11/27/233841.25950987_235X132X4.jpg","PlayCount":5278,"VideoType":4,"VideoTypeName":"更多","Url":"http://video.mtime.com/51667/?mid=212471"},{"VideoID":52540,"MovieID":212471,"Title":"澳门风云2 北京发布会","ShortTitle":"北京发布会","TitleSamll":"北京发布会","Description":"","Length":"01:35","HD":0,"ImagePath":"http://img31.mtime.cn/mg/2015/01/21/225742.13087836_235X132X4.jpg","PlayCount":838,"VideoType":4,"VideoTypeName":"更多","Url":"http://video.mtime.com/52540/?mid=212471"},{"VideoID":52782,"MovieID":212471,"Title":"澳门风云2 北京首映式","ShortTitle":"北京首映式","TitleSamll":"北京首映式","Description":"","Length":"02:23","HD":0,"ImagePath":"http://img31.mtime.cn/mg/2015/02/09/122417.54628077_235X132X4.jpg","PlayCount":1127,"VideoType":4,"VideoTypeName":"更多","Url":"http://video.mtime.com/52782/?mid=212471"},{"VideoID":52788,"MovieID":212471,"Title":"澳门风云2 片尾曲MV《财神到》","ShortTitle":"片尾曲MV《财神到》","TitleSamll":"片尾曲MV《财神..","Description":"","Length":"03:02","HD":0,"ImagePath":"http://img31.mtime.cn/mg/2015/02/10/084357.52985713_235X132X4.jpg","PlayCount":919,"VideoType":5,"VideoTypeName":"更多","Url":"http://video.mtime.com/52788/?mid=212471"},{"VideoID":52794,"MovieID":212471,"Title":"澳门风云2 片尾曲“财神到”MV","ShortTitle":"片尾曲“财神到”MV","TitleSamll":"片尾曲“财神到..","Description":"","Length":"03:02","HD":0,"ImagePath":"http://img31.mtime.cn/mg/2015/02/10/110509.18710174_235X132X4.jpg","PlayCount":1698,"VideoType":5,"VideoTypeName":"更多","Url":"http://video.mtime.com/52794/?mid=212471"},{"VideoID":52894,"MovieID":212471,"Title":"澳门风云2 主题曲MV《停格》(演唱:蔡健雅)","ShortTitle":"主题曲MV《停格》(演唱:蔡健雅)","TitleSamll":"主题曲MV《停格..","Description":"","Length":"03:50","HD":1,"ImagePath":"http://img31.mtime.cn/mg/2015/02/14/103516.50781151_235X132X4.jpg","PlayCount":3090,"VideoType":5,"VideoTypeName":"更多","Url":"http://video.mtime.com/52894/?mid=212471"}]};script>
<script type="text/javascript">
if ( typeof(mtimeStufs) == "undefined" ) {
mtimeStufs = [];
}
mtimeStufs.push( {id:"M13_B_DB_Movie_FooterTopTG",type:"mtime",content:"\n\n\n"} );
mtimeStufs.push( {id:"M13_B_DB_Movie_ImageDetailPage_CommentRightTG",type:"mtime",content:"\n\n\n"} );
mtimeStufs.push( {id:"M13_B_DB_Movie_OverviewHotMovieCommentRightTG1",type:"mtime",content:"\n\n\n\n"} );
script>
<style type="text/css">
#M13_B_DB_Movie_OverviewHotMovieCommentRightTG2{background:#fff;margin-bottom: -4000px;padding-bottom: 4000px;}
.db_cont .db_inews { width: 290px; float: left; display: inline; padding-right: 26px; border-right: 1px solid #ccc; }
.db_cont .db_inews dt { border-bottom: 1px dotted #dcdcdc; font-size: 12px; color: #666; line-height: 1.6em; padding-bottom: 10px; margin-bottom: 3px; margin-top: 10px; }
.db_cont .db_inews .imgbox { position: relative; zoom: 1; overflow: hidden; }
.db_cont .db_inews .imgbox div { position: absolute; left: 0; top: 95px; overflow: hidden; padding: 7px 12px; margin-right: 30px; }
.db_cont .db_inews .imgbox div .bg { background: #fff; position: absolute; left: 0; top: 0; width: 100%; height: 100px; opacity: .7; filter: alpha(opacity=70); }
.db_cont .db_inews .imgbox h3 { position: relative; font-size: 18px; line-height: 1.4em; }
.db_cont .db_inews .imgbox a { color: #000; text-decoration: none; }
.db_cont .db_inews dd { font-size: 12px; color: #666; padding-top: 6px; }
#externalVideo{ display:none;}
.storeboxer li{height:215px;}
.db_headnews a, .db_headnews{
z-index: 10;
position: relative;
opacity: 0;}
.db_headnews a{font-size:0; line-height:0;}
style>
<div id="bottom">div>
<script type="text/javascript"> document.write(unescape("%3Cscript src='" + jsServer + "/js/systemall2014.js' type='text/javascript'%3E%3C/script%3E"));script>
<script type="text/javascript"> document.write(unescape("%3Cscript src='" + subJsServer + "/js/moviepagepack.js' type='text/javascript'%3E%3C/script%3E"));script>
<script type="text/javascript">
//页尾 导航
// 静态文件初始化类
new StaticManager({
});
script>
<div style="display: none">
<script type="text/javascript">
var tracker = new Tracker();
tracker.trackPageView();
script>
div>
<script type="text/javascript"> window.moviePageBaseClient = new MoviePageBaseClient({ id: 212471, initializeMovieNavigationToken: "Video" }); script>
<script type="text/javascript"> $loadSubJs("/movie/MovieVideosPage.js", function () { new MovieVideosPage(); }); script>
body>
html>
将这里面的Json数据,放入到http://json.cn/来查看Json数据格式是否正确,如下图所示,完美啊:
下面便是对网页中的Json进行解析,以下是解析程序,这里提供了两种方法,一种是正则表达式,一种是fastjson,建议使用fastjson,快捷高效。
package parse;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import model.JsonModel;
import model.MtimeModel;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.jsoup.nodes.Document;
import com.alibaba.fastjson.JSON;
/*
* 合肥工业大学 管理学院 qianyang [email protected]
*/
public class MtimeParse {
static final Log logger = LogFactory.getLog(MtimeParse.class);
public static List getData (Document doc) {
List mtimeData=new ArrayList();
//获取待解析的html文件
String html=doc.html();
// System.out.println(html);
//通过正则表达获取,所要解析的json数据,只要预告片不要花絮
Pattern data = Pattern.compile("预告片\":(.*?)\\,\"拍摄花絮");
Matcher dataMatcher = data.matcher(html);
String da="";
while (dataMatcher.find()) {
//待解析的json字符串
da=dataMatcher.group(1);
}
//jsoup获取movieId(影片id)
String movieId="mtime"+doc.select("h1[property=v:itemreviewed]").select("a").attr("href").
replaceAll("\\D","").trim();
//正则匹配获取videoID(预告片id)
Pattern videoID = Pattern.compile("VideoID\"(.*?)\"");
//正则匹配获取videoID(预告片id)
Pattern titlePattern = Pattern.compile("ShortTitle\":\"(.*?)\"");
Matcher videoIDMatcher = videoID.matcher(da);
Matcher titleMatcher = titlePattern.matcher(da);
ArrayList urldatas = new ArrayList();
while (videoIDMatcher.find()) {
urldatas.add(videoIDMatcher.group(1));
}
ArrayList titles = new ArrayList();
while (titleMatcher.find()) {
titles.add(titleMatcher.group(1));
}
for (int i = 0; i < titles.size(); i++) {
MtimeModel mtimeModel=new MtimeModel();
String prmovieId="mtime"+urldatas.get(i).replaceAll("\\D","").trim();
String url="http://video.mtime.com/"+urldatas.get(i).replaceAll("\\D","").trim()+"/?mid="+doc.select("h1[property=v:itemreviewed]").select("a").attr("href").
replaceAll("\\D","").trim();
String title=titles.get(i);
mtimeModel.setPrmovieId(prmovieId);
mtimeModel.setUrl(url);
mtimeModel.setMovieId(movieId);
mtimeModel.setTitle(title);
mtimeData.add(mtimeModel);
}
//fastJson测试
//just contain the preview
List mtimeJsonData=new ArrayList();
Pattern data1 = Pattern.compile("预告片\":(.*?)\\,(\"拍摄花絮|\"精彩片段)");
Matcher dataMatcher1 = data1.matcher(html);
String da1="";
while (dataMatcher1.find()) {
//待解析的json字符串
da1=dataMatcher1.group(1);
}
if (da1.length()!=0) {
List jsonmodel1 = JSON.parseArray(da1,JsonModel.class);
for (JsonModel jso:jsonmodel1 ) {
JsonModel mtimeModel=new JsonModel();
String VideoID="mtime"+jso.getVideoID();
String MovieID="mtime"+jso.getMovieID();
String ShortTitle=jso.getShortTitle();
String url="http://video.mtime.com/"+jso.getVideoID()+"/?mid"+jso.getMovieID();
mtimeModel.setPrmovieId(VideoID);
mtimeModel.setUrl(url);
mtimeModel.setMovieID(MovieID);
mtimeModel.setShortTitle(ShortTitle);
mtimeJsonData.add(mtimeModel);
logger.info("VideoID: "+VideoID+" MovieID:"+MovieID+" ShortTitle:"+ShortTitle+" url:"+url);
}
}
return mtimeData;
}
}