Scala+Jsoup爬取B站动态数据
示例仅作为记录练习,不存在恶意爬取网页,如涉及违法,立删。代码过于简单不贴了。
使用scala+jsoup爬取B站数据,由于大部分数据属于动态数据,直接访问主页是获取不到有用数据
比如要爬取“生活区-搞笑类”的视频列表
上面对应的URL是https://www.bilibili.com/v/life/funny/?spm_id_from=333.334.b_7072696d6172795f6d656e75.60#/all/click/0/1/2019-09-01,2019-09-30
,直接获取这个链接,得到的Document的元素里面,这个视频列表是没有的
3、打开chrome浏览器的网页检查,在Network菜单栏,ctrl+f,打开Search搜索界面,搜索视频排行榜中出现的关键词,比如第一个视频UP主敬汉卿,播放140.5万,弹幕9806,这里要注意的是,up名字使用的是16进制表示直接搜索会搜索到Document,播放量这里是四舍五入的表示,搜索无结果,只有弹幕是一个具体详细的数字9806。
如上所示,通过搜索257这个弹幕数,得到了一个Response,进而获取到Headers的URLhttps://s.search.bilibili.com/cate/search?callback=jqueryCallback_bili_9358062349811913&main_ver=v3&search_type=video&view_type=hot_rank&order=click©_right=-1&cate_id=138&page=1&pagesize=20&jsonp=jsonp&time_from=20190901&time_to=20190930&_=1567654136524
在网页直接打开该URL得到以下json数据
jqueryCallback_bili_9358062349811913({"exp_list":null,"code":0,"cost_time":{"params_check":"0.000288","illegal_handler":"0.000004","as_response_format":"0.001452","as_request":"0.027562","deserialize_response":"0.000241","as_request_format":"0.000405","total":"0.030284","main_handler":"0.029876"},"show_module_list":["activity","web_game","card","media_bangumi","media_ft","bili_user","user","star","video"],"result":[{"senddate":1567506835,"rank_offset":1,"tag":"\u5947\u8469,\u656c\u6c49\u537f,\u641e\u7b11,\u5e7d\u9ed8,\u4f5c\u6b7b,\u5976\u8336,100\u5757","duration":277,"id":66509570,"rank_score":1830174,"badgepay":false,"pubdate":"2019-09-03 18:33:55","title":"\u53bb\u5b9e\u4f53\u5e97\u70b9\u4e00\u4efd\u5976\u8336\u52a0\u4e00\u767e\u5143\u7684\u6599\uff01\u6ee1\u5c4f\u5e55\u7684\u7f9e\u803b\u548c\u5c34\u5c2c\uff01","review":4288,"mid":9824766,"is_union_video":0,"rank_index":0,"type":"video","arcrank":"0","play":"1830174","pic":"\/\/i0.hdslb.com\/bfs\/archive\/6ab1284e75b2183d6b4216c324976b82b9f8052a.jpg","description":"youtube\u9891\u9053\uff1a\u656c\u6c49\u537f\u3010\u5b98\u65b9\u9891\u9053\u3011\n\u521b\u610f\u6295\u7a3f\u90ae\[email protected]\n\u6dd8\u5b9d\u5e97\uff1a\u656c\u6c49\u537f","video_review":11434,"is_pay":0,"favorites":13577,"arcurl":"http:\/\/www.bilibili.com\/video\/av66509570","author":"\u656c\u6c49\u537f"},
通过观察发现里面的数据不是可以解析的json数据,在URL和这数据都出现了一个
jqueryCallback_bili_9358062349811913这个东西,这个应该代表的是一个jquery的回调函数,如果把URL的这个属性去掉行不行?得到新的URL
https://s.search.bilibili.com/cate/search?main_ver=v3&search_type=video&view_type=hot_rank&order=click©_right=-1&cate_id=138&page=1&pagesize=20&jsonp=jsonp&time_from=20190901&time_to=20190930&_=1567654136524
此时的数据是这样
{"exp_list":null,"code":0,"cost_time":{"params_check":"0.000348","illegal_handler":"0.000005","as_response_format":"0.001420","as_request":"0.027466","deserialize_response":"0.000218","as_request_format":"0.000496","total":"0.030270","main_handler":"0.029795"},"show_module_list":["activity","web_game","card","media_bangumi","media_ft","bili_user","user","star","video"],"result":[{"senddate":1567506835,"rank_offset":1,"tag":"\u5947\u8469,\u656c\u6c49\u537f,\u641e\u7b11,\u5e7d\u9ed8,\u4f5c\u6b7b,\u5976\u8336,100\u5757","duration":277,"id":66509570,"rank_score":1830795,"badgepay":false,"pubdate":"2019-09-03 18:33:55","title":"\u53bb\u5b9e\u4f53\u5e97\u70b9\u4e00\u4efd\u5976\u8336\u52a0\u4e00\u767e\u5143\u7684\u6599\uff01\u6ee1\u5c4f\u5e55\u7684\u7f9e\u803b\u548c\u5c34\u5c2c\uff01","review":4288,"mid":9824766,"is_union_video":0,"rank_index":0,"type":"video","arcrank":"0","play":"1830795","pic":"\/\/i0.hdslb.com\/bfs\/archive\/6ab1284e75b2183d6b4216c324976b82b9f8052a.jpg","description":"youtube\u9891\u9053\uff1a\u656c\u6c49\u537f\u3010\u5b98\u65b9\u9891\u9053\u3011\n\u521b\u610f\u6295\u7a3f\u90ae\[email protected]\n\u6dd8\u5b9d\u5e97\uff1a\u656c\u6c49\u537f","video_review":11435,"is_pay":0,"favorites":13581,"arcurl":"http:\/\/www.bilibili.com\/video\/av66509570","author":"\u656c\u6c49\u537f"},
得到一个可以直接解析的json数据。
再来看看一看这个URLhttps://s.search.bilibili.com/cate/search?main_ver=v3&search_type=video&view_type=hot_rank&order=click©_right=-1&cate_id=138&page=1&pagesize=20&jsonp=jsonp&time_from=20190901&time_to=20190930&_=1567654136524
需要关注的参数有page页数,pagesize每页大小,测试最大一次性获取100,time_from/time_to这个一个按月的格式,表示的是9月1号开始的数据。那么如果把time_from改成20190801呢?经过测试,可以获取最近三个月的数据也就是20190701to20190930。
关于其他的页面动态数据,也是按照这种方式去处理,看到callback直接干掉,简直不要太爽。
另外附上几个爬取up主主页信息的URL
//粉丝数
https://api.bilibili.com/x/relation/stat?vmid=13354765
{"code":0,"message":"0","ttl":1,"data":{"mid":13354765,"following":49,"whisper":0,"black":0,"follower":2843378}}
//总播放量
https://api.bilibili.com/x/space/upstat?mid=13354765
{"code":0,"message":"0","ttl":1,"data":{"archive":{"view":187858995},"article":{"view":0}}}
//总视频数
https://api.bilibili.com/x/space/navnum?mid=13354765
{"code":0,"message":"0","ttl":1,"data":{"video":169,"bangumi":0,"cinema":1,"channel":{"master":3,"guest":3},"favourite":{"master":0,"guest":0},"tag":0,"article":0,"playlist":0,"album":44,"audio":0}}
//作者的其他视频
https://space.bilibili.com/ajax/member/getSubmitVideos?mid=203708804&pagesize=30&tid=0&page=1&keyword=&order=pubdate