*本实战仅作为学习和技术交流使用,转载请注明出处;
*此文章很早便在草稿箱中,由于编写时事情较多,临时中断,现暂时发表,后续补上(20190410)
本篇实战来源于自己的一个idea:收集歌曲的歌词以及热门的比较有情调的网友的歌曲点评作为基础数据集,希望能够结合机器学习和深度学习技术打造一款能够自己写诗的机器人。
网易云音乐的数据一直是独立开发者或团队一直觊觎的香饽饽,因此,其自身在反爬虫上的设计也是颇有心机。然而,由于程序的设计无论如何也逃不开与后台数据的交互,因此,及时在隐蔽,其相应的api接口仍旧能够被解析出来。此处直接参考网友解析出的api接口信息,具体可参考:
http://moonlib.com/606.html
http://www.jianshu.com/p/07ebbb142c73
说明:
本次网易云音乐的数据直接使用已经的数据接口API进行模拟请求,与常规的爬虫解析Html有着本质区别,其数据源无需提取和清洗可直接获取。虽然相比解析Html方法而言,该技术手段缺少些技术含量,但其突出优势以及在模拟请求涉及的相关参数的初始化依旧是亮点
网易云音乐的接口通过实际测试及自主挖掘,主要包括以下几个接口:搜索、歌曲详情、歌手专辑、专辑信息、歌单、评论及歌曲。在程序设计中,通过构造ApiUtil类保存其api接口的静态信息。类具体代码为:
package com.wangcui.wangyiyun.util;
public class ApiUtil {
public static String search = "http://music.163.com/api/search/pc";//搜索
public static String songdetail = "http://music.163.com/api/song/detail";//歌曲详情
public static String artistalbums = "http://music.163.com/api/artist/albums";//歌手专辑
public static String songalbum = "http://music.163.com/api/album";//专辑信息
public static String playlistdetail = "http://music.163.com/api/playlist/detail?id=";//歌单
public static String songpage = "http://music.163.com/#/song?id=";//歌曲页面
public static String comments = "http://music.163.com/weapi/v1/resource/comments/";//+commentThreadId?csrf_token最新评论
public static String lyric = "http://music.163.com/api/song/lyric?oc=pc&id=";//歌词
public static void setSearch(){
/*
* POST http://music.163.com/api/search/pc
s:搜索的内容
offset:偏移量(分页用)
limit:获取的数量
type:搜索的类型
歌曲 1
专辑 10
歌手 100
歌单 1000
用户 1002
mv 1004
歌词 1006
主播电台 1009
*/
}
public static void setSongdetail(String id) {
/*
* demo:http://music.163.com/api/song/detail/?id=28377211&ids=%5B28377211%5D
* URL:GET http://music.163.com/api/song/detail/
* 必要参数:
* id:歌曲ID
* ids:不知道干什么用的,用[]括起来的歌曲ID
*/
ApiUtil.songdetail = ApiUtil.songdetail+"/?id="+id+"&ids=%5B"+id+"%5D";
}
public static void setComments(String threadId){
/*
* URL:POST
* Content-Type:application/x-www-form-urlencoded
* Cookie:usertrack=c+5+hVkB/WOqTQsRBGnpAg==; _ntes_nnid=0ab9650029b73d3e38528cbd34e9cdbb,1493302291675; _ntes_nuid=0ab9650029b73d3e38528cbd34e9cdbb; _ga=GA1.2.211059730.1493302294; Province=0; City=0; JSESSIONID-WYYY=miFgUZkw8jN6JSSg4DIVjStzCII6BHepYQ%5CzmtYY7GHneTzwWP7n58yb0qpd9EAXYMhyVh5iCgc7mZiDdmoimFi1kpwWWz7QvK9vQHZtAEyzk1JTn%2BVzuilNQ%2BkgB9gcjNR9odRYxAWmCC2rTz5r6ewAqXafNC%5CpY%2BfzC4UqN2QAlGOb%3A1493794989514; _iuqxldmzr_=32; __utma=94650624.211059730.1493302294.1493780140.1493793145.2; __utmb=94650624.13.10.1493793145; __utmc=94650624; __utmz=94650624.1493780140.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)
* Referer:http://music.163.com/song?id=
* User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36
*
* Form Data:
* params:CAgy17LPzKBHVDh663y5zyAi9JY8Z489UpccnopUmdc8r27aGvewb/1FYSqi/Idzz48iwDmggk+SZJuXuE1Yll2tmp7Qhcfic0KTHezf60PI7JACnM5kKhSF+GmjY1bYARrEb5vV6L0RIWgvoDNNXPpnKhJGNxWccfZlOF64V7r1qfb2DURNrDNYgCZnpOHf
* encSecKey:17d9ebdb13099dd2a5e321ffdf573325bd4ade30cb7aa5174fd042aa0cce1c1f29133dc97890daddbd825b24492da0ccaa9a60bc6bdff6903b2c889bb634e7e8dd0dda3f3acd41249e0a41dc6d0f793d09e5011fbd33bb5dfa6623cff1d4f0f0947099a7b4aed00f42f6ac8a25ddc9b9635c5da0b7e0325d07eaa378cbc41786
*
*/
ApiUtil.comments = ApiUtil.comments+threadId+"?csrf_token=";
}
public static void setLyric(String id){
/*
* GET http://music.163.com/api/song/lyric
* id:歌曲ID
lv:值为-1,判断是否搜索lyric格式
kv:值为-1,带有特殊戳的歌词
tv:值为-1,是否搜索tlyric格式(翻译的内容)
Full request URI: http://music.163.com/api/song/lyric?os=pc&id=93920&lv=-1&kv=-1&tv=-1
*/
ApiUtil.lyric = ApiUtil.lyric+id+"&lv=-1&tv=-1";
}
public static void setPlaylist(String id){
/*
* GET http://music.163.com/api/playlist/detail?id=
*/
ApiUtil.playlistdetail = ApiUtil.playlistdetail+id;
}
}
说明:获取歌曲评论时需要模拟post请求,该请求中有两个重要参数必不可少,及params和encSecKey,其二者均属于动态生成的密钥,具体如何生成将在后续说明。每个歌曲都有相应的歌曲id,同时,每首歌的评论总有一个总的id,为threadId。
##程序设计
本次爬虫的设计需求为:将网易云上的资源下载到本地,并用数据库保存起来。需求比较简单和直接,因此将程序主要分为三个模块:即 实体和工具类,ORM类,下载类,其分别对应三个包名。如下图:
实体和工具类中,分别包含了歌曲的相关信息实体类以及必须使用的工具类,如AES加密算法,Base64转换,Emoji表情处理(由于mysql数据库版本较老,无法在配置层做处理保存Emoji,因此需在代码层对emoji表情做特殊处理,在较新的mysql版本中,可直接使用CHARSET=utf8mb4
对emoji表情进行保存)
###详细说明
####实体和工具类
歌曲实体类>AlbumUtil/ArtistsUtil/CommentUtil/LyricUtil/SongUtil
该实体类分别保存了对应歌曲信息的set/get方法,举例说明:SongUtil代码如下:
package com.wangcui.wangyiyun.util;
public class SongUtil {
public String name;
public String id;
public String position;
public String alias;
public String status;
public String fee;
public ArtistsUtil artists[];
public AlbumUtil album;
public String popularity;
public String score;
public String commentThreadId;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getPosition() {
return position;
}
public void setPosition(String position) {
this.position = position;
}
public String getAlias() {
return alias;
}
public void setAlias(String alias) {
this.alias = alias;
}
public String getStatus() {
return status;
}
public void setStatus(String status) {
this.status = status;
}
public String getFee() {
return fee;
}
public void setFee(String fee) {
this.fee = fee;
}
public ArtistsUtil[] getArtists() {
return artists;
}
public void setArtists(ArtistsUtil[] artists) {
this.artists = artists;
}
public AlbumUtil getAlbum() {
return album;
}
public void setAlbum(AlbumUtil album) {
this.album = album;
}
public String getPopularity() {
return popularity;
}
public void setPopularity(String popularity) {
this.popularity = popularity;
}
public String getScore() {
return score;
}
public void setScore(String score) {
this.score = score;
}
public String getCommentThreadId() {
return commentThreadId;
}
public void setCommentThreadId(String commentThreadId) {
this.commentThreadId = commentThreadId;
}
}
工具类AESUtil
使用AES加密方法,采用CBC模式,补码方式为PKCS5Padding。最终加密内容生成的byte字节数组需要转换为Base64编码,但在jdk1.8之后才引入Base64编码的基础类,因此在1.8之前,需要自己进行进行Base64编码转换。下面贴出AESUtil及Base64工具类源码。
AESUtil
package com.wangcui.wangyiyun.util;
import javax.crypto.Cipher;
import javax.crypto.spec.IvParameterSpec;
import javax.crypto.spec.SecretKeySpec;
public class AESUtil {
//AES加密
public static String encrypt(String text, String secKey) throws Exception {
byte[] raw = secKey.getBytes();
SecretKeySpec skeySpec = new SecretKeySpec(raw, "AES");
// "算法/模式/补码方式"
Cipher cipher = Cipher.getInstance("AES/CBC/PKCS5Padding");
// 使用CBC模式,需要一个向量iv,可增加加密算法的强度
IvParameterSpec iv = new IvParameterSpec("0102030405060708".getBytes());
cipher.init(Cipher.ENCRYPT_MODE, skeySpec, iv);
byte[] encrypted = cipher.doFinal(text.getBytes());
//return Base64.getEncoder().encodeToString(encrypted);//jdk1.8才引入Base64基础类
return Base64.encode(encrypted);
}
//字符填充
public static String zfill(String result, int n) {
if (result.length() >= n) {
result = result.substring(result.length() - n, result.length());
} else {
StringBuilder stringBuilder = new StringBuilder();
for (int i = n; i > result.length(); i--) {
stringBuilder.append("0");
}
stringBuilder.append(result);
result = stringBuilder.toString();
}
return result;
}
}
Base64(参考网络中的代码)
package com.wangcui.wangyiyun.util;
public class Base64 {
private static final int BASELENGTH = 128;
private static final int LOOKUPLENGTH = 64;
private static final int TWENTYFOURBITGROUP = 24;
private static final int EIGHTBIT = 8;
private static final int SIXTEENBIT = 16;
private static final int FOURBYTE = 4;
private static final int SIGN = -128;
private static char PAD = '=';
private static byte[] base64Alphabet = new byte[BASELENGTH];
private static char[] lookUpBase64Alphabet = new char[LOOKUPLENGTH];
static {
for (int i = 0; i < BASELENGTH; ++i) {
base64Alphabet[i] = -1;
}
for (int i = 'Z'; i >= 'A'; i--) {
base64Alphabet[i] = (byte) (i - 'A');
}
for (int i = 'z'; i >= 'a'; i--) {
base64Alphabet[i] = (byte) (i - 'a' + 26);
}
for (int i = '9'; i >= '0'; i--) {
base64Alphabet[i] = (byte) (i - '0' + 52);
}
base64Alphabet['+'] = 62;
base64Alphabet['/'] = 63;
for (int i = 0; i <= 25; i++) {
lookUpBase64Alphabet[i] = (char) ('A' + i);
}
for (int i = 26, j = 0; i <= 51; i++, j++) {
lookUpBase64Alphabet[i] = (char) ('a' + j);
}
for (int i = 52, j = 0; i <= 61; i++, j++) {
lookUpBase64Alphabet[i] = (char) ('0' + j);
}
lookUpBase64Alphabet[62] = (char) '+';
lookUpBase64Alphabet[63] = (char) '/';
}
private static boolean isWhiteSpace(char octect) {
return (octect == 0x20 || octect == 0xd || octect == 0xa || octect == 0x9);
}
private static boolean isPad(char octect) {
return (octect == PAD);
}
private static boolean isData(char octect) {
return (octect < BASELENGTH && base64Alphabet[octect] != -1);
}
/**
* Encodes hex octects into Base64
*
* @param binaryData
* Array containing binaryData
* @return Encoded Base64 array
*/
public static String encode(byte[] binaryData) {
if (binaryData == null) {
return null;
}
int lengthDataBits = binaryData.length * EIGHTBIT;
if (lengthDataBits == 0) {
return "";
}
int fewerThan24bits = lengthDataBits % TWENTYFOURBITGROUP;
int numberTriplets = lengthDataBits / TWENTYFOURBITGROUP;
int numberQuartet = fewerThan24bits != 0 ? numberTriplets + 1
: numberTriplets;
char encodedData[] = null;
encodedData = new char[numberQuartet * 4];
byte k = 0, l = 0, b1 = 0, b2 = 0, b3 = 0;
int encodedIndex = 0;
int dataIndex = 0;
for (int i = 0; i < numberTriplets; i++) {
b1 = binaryData[dataIndex++];
b2 = binaryData[dataIndex++];
b3 = binaryData[dataIndex++];
l = (byte) (b2 & 0x0f);
k = (byte) (b1 & 0x03);
byte val1 = ((b1 & SIGN) == 0) ? (byte) (b1 >> 2)
: (byte) ((b1) >> 2 ^ 0xc0);
byte val2 = ((b2 & SIGN) == 0) ? (byte) (b2 >> 4)
: (byte) ((b2) >> 4 ^ 0xf0);
byte val3 = ((b3 & SIGN) == 0) ? (byte) (b3 >> 6)
: (byte) ((b3) >> 6 ^ 0xfc);
encodedData[encodedIndex++] = lookUpBase64Alphabet[val1];
encodedData[encodedIndex++] = lookUpBase64Alphabet[val2 | (k << 4)];
encodedData[encodedIndex++] = lookUpBase64Alphabet[(l << 2) | val3];
encodedData[encodedIndex++] = lookUpBase64Alphabet[b3 & 0x3f];
}
// form integral number of 6-bit groups
if (fewerThan24bits == EIGHTBIT) {
b1 = binaryData[dataIndex];
k = (byte) (b1 & 0x03);
byte val1 = ((b1 & SIGN) == 0) ? (byte) (b1 >> 2)
: (byte) ((b1) >> 2 ^ 0xc0);
encodedData[encodedIndex++] = lookUpBase64Alphabet[val1];
encodedData[encodedIndex++] = lookUpBase64Alphabet[k << 4];
encodedData[encodedIndex++] = PAD;
encodedData[encodedIndex++] = PAD;
} else if (fewerThan24bits == SIXTEENBIT) {
b1 = binaryData[dataIndex];
b2 = binaryData[dataIndex + 1];
l = (byte) (b2 & 0x0f);
k = (byte) (b1 & 0x03);
byte val1 = ((b1 & SIGN) == 0) ? (byte) (b1 >> 2)
: (byte) ((b1) >> 2 ^ 0xc0);
byte val2 = ((b2 & SIGN) == 0) ? (byte) (b2 >> 4)
: (byte) ((b2) >> 4 ^ 0xf0);
encodedData[encodedIndex++] = lookUpBase64Alphabet[val1];
encodedData[encodedIndex++] = lookUpBase64Alphabet[val2 | (k << 4)];
encodedData[encodedIndex++] = lookUpBase64Alphabet[l << 2];
encodedData[encodedIndex++] = PAD;
}
return new String(encodedData);
}
/**
* Decodes Base64 data into octects
*
* @param encoded
* string containing Base64 data
* @return Array containind decoded data.
*/
public static byte[] decode(String encoded) {
if (encoded == null) {
return null;
}
char[] base64Data = encoded.toCharArray();
// remove white spaces
int len = removeWhiteSpace(base64Data);
if (len % FOURBYTE != 0) {
return null;// should be divisible by four
}
int numberQuadruple = (len / FOURBYTE);
if (numberQuadruple == 0) {
return new byte[0];
}
byte decodedData[] = null;
byte b1 = 0, b2 = 0, b3 = 0, b4 = 0;
char d1 = 0, d2 = 0, d3 = 0, d4 = 0;
int i = 0;
int encodedIndex = 0;
int dataIndex = 0;
decodedData = new byte[(numberQuadruple) * 3];
for (; i < numberQuadruple - 1; i++) {
if (!isData((d1 = base64Data[dataIndex++]))
|| !isData((d2 = base64Data[dataIndex++]))
|| !isData((d3 = base64Data[dataIndex++]))
|| !isData((d4 = base64Data[dataIndex++]))) {
return null;
}// if found "no data" just return null
b1 = base64Alphabet[d1];
b2 = base64Alphabet[d2];
b3 = base64Alphabet[d3];
b4 = base64Alphabet[d4];
decodedData[encodedIndex++] = (byte) (b1 << 2 | b2 >> 4);
decodedData[encodedIndex++] = (byte) (((b2 & 0xf) << 4) | ((b3 >> 2) & 0xf));
decodedData[encodedIndex++] = (byte) (b3 << 6 | b4);
}
if (!isData((d1 = base64Data[dataIndex++]))
|| !isData((d2 = base64Data[dataIndex++]))) {
return null;// if found "no data" just return null
}
b1 = base64Alphabet[d1];
b2 = base64Alphabet[d2];
d3 = base64Data[dataIndex++];
d4 = base64Data[dataIndex++];
if (!isData((d3)) || !isData((d4))) {// Check if they are PAD characters
if (isPad(d3) && isPad(d4)) {
if ((b2 & 0xf) != 0)// last 4 bits should be zero
{
return null;
}
byte[] tmp = new byte[i * 3 + 1];
System.arraycopy(decodedData, 0, tmp, 0, i * 3);
tmp[encodedIndex] = (byte) (b1 << 2 | b2 >> 4);
return tmp;
} else if (!isPad(d3) && isPad(d4)) {
b3 = base64Alphabet[d3];
if ((b3 & 0x3) != 0)// last 2 bits should be zero
{
return null;
}
byte[] tmp = new byte[i * 3 + 2];
System.arraycopy(decodedData, 0, tmp, 0, i * 3);
tmp[encodedIndex++] = (byte) (b1 << 2 | b2 >> 4);
tmp[encodedIndex] = (byte) (((b2 & 0xf) << 4) | ((b3 >> 2) & 0xf));
return tmp;
} else {
return null;
}
} else { // No PAD e.g 3cQl
b3 = base64Alphabet[d3];
b4 = base64Alphabet[d4];
decodedData[encodedIndex++] = (byte) (b1 << 2 | b2 >> 4);
decodedData[encodedIndex++] = (byte) (((b2 & 0xf) << 4) | ((b3 >> 2) & 0xf));
decodedData[encodedIndex++] = (byte) (b3 << 6 | b4);
}
return decodedData;
}
/**
* remove WhiteSpace from MIME containing encoded Base64 data.
*
* @param data
* the byte array of base64 data (with WS)
* @return the new length
*/
private static int removeWhiteSpace(char[] data) {
if (data == null) {
return 0;
}
// count characters that's not whitespace
int newSize = 0;
int len = data.length;
for (int i = 0; i < len; i++) {
if (!isWhiteSpace(data[i])) {
data[newSize++] = data[i];
}
}
return newSize;
}
}
后续内容抽空更新,工作繁忙
技术交流,可邮箱联系。