在实际R爬虫过程中,针对不同的网页,采取的爬虫方法也会有所不同。对于静态网页,rvest包足够了。但是对于网页动态加载的数据,继续使用rvest可能就不合适了。这时候需要RCurl或httr这类能提供丰富请求参数的R包,才能实现对这类动态网页的抓取。今天呢,主要介绍httr包,虽然说这个R包已经比RCurl精简很多,但涉及到的函数也很多,但是常规爬虫中用的比较多的还是GET和POST这两个函数。
下文案例是个典型的POST请求类的爬虫,因此,今天先看下POST这个函数的大致用法:POST(url = NULL, config = list(), ..., body = NULL, encode = c("multipart", "form", "json", "raw"), handle = NULL),这里面比较重要的是config参数(设置请求头和cookies)和 body参数(查询参数)。具体参数解释如下:
- url :the url of the page to retrieve
- config:Additional configuration settings such as http authentication (authenticate), additional headers (add_headers), cookies (set_cookies) etc. See config for full details and list of helpers. Further named parameters, such as query, path, etc, passed on to modify_url. Unnamed parameters will be combined with config.
- body:One of the following:
- FALSE: No body. This is typically not used with POST, PUT, or PATCH, but can be useful if you need to send a bodyless request (like GET) with VERB().
- NULL: An empty body
- "": A length 0 body
- upload_file("path/"): The contents of a file. The mime type will be guessed from the extension, or can be supplied explicitly as the second argument to upload_file()
- A character or raw vector: sent as is in body. Use content_type to tell the server what sort of data you are sending.
- A named list: See details for encode. (比较常用)
- encode:If the body is a named list, how should it be encoded? Can be one of form (application/x-www-form-urlencoded), multipart, (multipart/form-data), or json (application/json). For "multipart", list elements can be strings or objects created by upload_file. For "form", elements are coerced to strings and escaped, use I() to prevent double-escaping. For "json", parameters are automatically "unboxed" (i.e. length 1 vectors are converted to scalars). To preserve a length 1 vector as a vector, wrap in I(). For "raw", either a character or raw vector. You'll need to make sure to set the content_type() yourself.
- handle:The handle to use with this request. If not supplied, will be retrieved and reused from the handle_pool based on the scheme, hostname and port of the url. By default httr requests to the same scheme/host/port combo. This substantially reduces connection time, and ensures that cookies are maintained over multiple requests to the same host. See handle_pool for more details.
为了更好理解httr:POST这个函数如何抓取动态异步加载网页,下面以网易云课程为例做简单介绍!
网易云课堂案例说明
进入网易云课堂,点击编程课程页面,网址为https://study.163.com/category/480000003131009,后台显示了该网址的请求和响应,表明该页面信息是可以通过该网址获取的。
接下来,第二页,常见情况下在网址的后面添加页码,请求和响应得到第二页面的信息。但是,网易云再点击第2/3/....页后,网址会依次添加页面,但网页后台工具显示(记得刷新),没有添加页码的网页请求,网址维持不变,如下图。这说明通过【https://study.163.com/category/480000003131009#/?p=数字】这类网址是得不到课程信息的。
这个时候,考虑异步加载,回到首页,进入开发后台,再进入XHR面板。然后点击页码翻页,第2页,第3页,第4页..... ,发现XHR面板再每点击页码,就会生成新的内容。这其中就包括studycourse.json,发现其响应内容就是课程信息。比较不同的studycourse.json,发现request paload的具体参数不同,经分析,不难发现这些studycourse.json存储着课程信息,每50个课程生成一个文件,共计13个文件。所以,要获取课程,应该针对https://study.163.com/p/search/studycourse.json网址,POST提交参数信息,然后获取课程信息。
在正式爬取之前,需要对下面爬虫主要涉及的参数做下介绍:General里面的Request URL、Request Method、Status Code;Response Headers里面的Content-Type;Request Headers 里面的 Accept、Content-Type、Cookie、Referer、User-Agent等以及最后Form data/Request Paylond里面的所有参数。
- General里面的Request URL和Request Method方法即是即决定访问的资源对象和使用的技术手段。
- Response Headers里面的Content-Type决定着你获得的数据以什么样的编码格式返回。
- Request Headers 里面的 Accept、Content-Type、Cookie、Referer、User-Agent等是你客户端的浏览器信息,其中Cookie是你浏览器登录后缓存在本地的登录状态信息,使用Cookie登入可以避免爬虫程序被频繁拒绝。这其中的参数不一定全部需要提交。
- Form data/Request Paylond信息最为关键,是POST提交请求必备的定位信息,因为浏览器的课程页有很多页信息,但是实际上访问同一个地址(就是General里面的url),而真正起到切换页面的就是这个Form data/Request Paylond里面的表单信息。
如何实际操作?
1. 加载所需要的R包,没安装的提前安装。
rm(list=ls())
library("httr")
library("dplyr")
library("jsonlite")
library("curl")
library("magrittr")
library("rlist")
library("pipeR")
library("plyr")
2. 构造url,通过观察XHR面板得到,如下图所示。
url <- c('https://study.163.com/p/search/studycourse.json')
3. 构造请求提交信息,根据Request headers的内容填写,如下图所示。
mycookie <- 'EDUWEBDEVICE=2558e3af159d412cbd77e1bd4d2ac2b2; hb_MA-BFF5-63705950A31C_source=www.google.com; UM_distinctid=172b723649a9f8-0656102adeb43b-d373666-144000-172b723649b942; EDU-YKT-MODULE_GLOBAL_PRIVACY_DIALOG=true; STUDY_NOT_SHOW_PROMOTION_WIN=true; eds_utm=eyJjIjoiIiwiY3QiOiIiLCJpIjoiIiwibSI6IiIsInMiOiIiLCJ0IjoiIn0=|aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8=; __utmc=129633230; __utmz=129633230.1592444560.4.4.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); sideBarPost=891; NTESSTUDYSI=858bed85fc9541a18c43cd5a6934f38e; NTES_YD_SESS=1tT526KyasdDFqzud7RGk5jej6MFRxOISQfMF1iNn3v_7LHE7Muc9bTJyJAttj.joOdOzfye0L5uGhlKLcFiJ1eQEHvs_xxxUv2aQbafpcbS6hsTMsDdEfzDYOYnYNE2JhmQJ_b6EjujeMBPiY2p_uAiwiZBBAZ9oXs3aZVXgDz4Ml6MBD1fU.BmguhUJap3SRPbGtT.47UnGi3dnfLx7kKzWXRNJ42La4HlM7147KO_x; NTES_YD_PASSPORT=JFDBLA68q0IasVpovWl2uu.D2kvzuFwobwXjKpbdMsw_Wh3VWELSHa6OxOp55IPIQtgtekxnBhmK0HdF9PDQRC7OCZxa2Gn.qoFKRMmPKdxg0hVqvtGpPZo857IgjcbxNCJrffK9DY58Nw8OhoxGUKm7GIS1cSfY3FKlF_TkZSVADf5877_.SzHE5ImYW3nhuQ1V61_vI1NX8v5QQxL2GSO4e; S_INFO=1592451940|0|3&80##|13120412092; P_INFO=13120412092|1592451940|1|study|00&99|null&null&null#bej&null#10#0#0|&0|null|13120412092; STUDY_INFO="[email protected]|8|1415444383|1592451940634"; STUDY_SESS="fxEsQC5LvlYHX4GhzCnnlTl2iio0a1fDsRiqpj5hqzQh5bTFReRt55rb+vnOnEZ45V/nvYjstqkjGmTNU344+VHBcA9uP0xzObA9G8ot0BOrnyy1VnW5ZKxb44cf+3ZGTda5t/QjAY20KOE0EF9+TP8RDwCBhi8apnA+E128sPALhur2Nm2wEb9HcEikV+3FTI8+lZKyHhiycNQo+g+/oA=="; STUDY_PERSIST="+GsuXMNai/WHdRbupv7eEeLokfCkeuarSAaYVuDIvpH9hK3yadTjOawfrSa+uwNez0VFU4i3ndRj9W9omWbyCuPHHsek+Esh2RBGBgwayFaMnccfeJrtvHow3DYivsfP8diGmo2uGQEZemAqCA5us1F3KV8BqtsUDrCO1ITmM6Zt5913n1CKxdhKmqhyOdXoJ5pxN8si2bAj3KSQtVtAP/OnSEP5aavbVEpleM+fnYXZgpjCC7Iso4RP9U87vJE8LtaQzUT1ovP2MqtW5+L3Hw+PvH8+tZRDonbf7gEH7JU="; NETEASE_WDA_UID=1415444383#|#1581733410145; NTES_STUDY_YUNXIN_ACCID=s-1415444383; NTES_STUDY_YUNXIN_TOKEN=73db624b15dc91b405d838a2433bafc9; __utma=129633230.575204015.1592210449.1592445051.1592458754.6; utm=eyJjIjoiIiwiY3QiOiIiLCJpIjoiIiwibSI6IiIsInMiOiIiLCJ0IjoiIn0=|aHR0cHM6Ly9zdHVkeS4xNjMuY29tL2NhdGVnb3J5LzQ4MDAwMDAwMzEzMTAwOQ==; CNZZDATA1272960468=716509904-1592205743-https%253A%252F%252Fwww.google.com%252F%7C1592461992; STUDY_UUID=d9689752-da3c-43d3-9c17-471aefab6636; __utmb=129633230.22.9.1592463486945'
#注意比较,保留不变的参数
myheaders <- c('accept' ='application/json',
'accept-encoding' = 'gzip, deflate, br',
'accept-language' = 'zh-CN,zh;q=0.9',
'content-type' = 'application/json',
'edu-script-token' = '858bed85fc9541a18c43cd5a6934f38e',
'origin' = 'https://study.163.com',
'referer' = 'https://study.163.com/category/480000003131009',
'sec-fetch-dest' = 'empty',
'sec-fetch-mode' = 'cors',
'sec-fetch-site' = 'same-origin',
'user-agent' = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
'cookie' = mycookie)
4. 构造请求头参数信息,根据Request Payload内容填写,如下图所示。仔细比较,发现pageIndex和relativeOffset是有规律变化的。以下代码是第一页的Request Payload参数,当请求不同页面时,有此规律:"pageIndex"= i, "relativeOffset"= 50*(i-1) 。以下代码是第一页的Request Payload参数。
mypayload <- list("pageIndex"= 1,
"pageSize"= 50,
"relativeOffset"= 0,
"frontCategoryId"= "480000003131009",
"searchTimeType"= -1,
"orderType"= 50,
"priceType"= -1,
"activityId"= 0,
"keyword"= "")
5. 执行第一页,httr的POST函数,这里用什么函数是由General里的Request Method参数决定的,如下图。
POST(url , add_headers(.headers = 待爬取网页的头部信息),
set_cookies(.cookies =自己的cookie,
body = Form data/Request Paylond里面的参数,
encode = c("multipart"/"form"/"json"/"raw"),
timeout(最大请求时长/秒),
use_proxy(代理IP)......)
#针对POST请求而言,POST请求灰常复杂,它的查询参数必须含在请求体(body)中,
#而且参数发送前需要做指定的编码方式(就是request header中的content-type).
长见的编码方式有4种:前两种比较常见
application/x-www-form-urlencoded —— form
application/json —— json
multipart/form-data —— multipart
text/xml —— raw 需要自己设置content_type()
根据POST函数用法,正式进行网络请求,服务器响应返回包含有json格式课程的数据。
response <- POST(url = url, add_headers(.headers = myheaders),body = mypayload, encode="json",verbose())
#从数据中获取正文数据,包含4个list,选择第3个list,然后再其中选择第2个list,得到50份课信息(这时还是list格式)
#接着通过toJSON和fromJSON函数将数据转化为矩阵
result <- response %>% content() %>%`[[`(3) %>% `[[`(2) %>% toJSON() %>% fromJSON(simplifyDataFrame=TRUE)
colnames(result)#查看都有哪些信息
usefulname <- c("productId","courseId","productName","lectorName","provider","score","scoreLevel","learnerCount","originalPrice","discountPrice","discountRate","description")
result <- result %>% select(usefulname)
第一页爬取结果如下所示,共计50条课程信息:
6. 执行获取所有页面信息,共13页,利用循环抓取。
myfullresult<-list()
i=1
for (i in 1:13){
print(paste0("正在抓取第",i,"页"))
mypayload <- list("pageIndex"= i,
"pageSize"= 50,
"relativeOffset"= 50*(i-1),
"frontCategoryId"= "480000003131009",
"searchTimeType"= -1,
"orderType"= 50,
"priceType"= -1,
"activityId"= 0,
"keyword"= "")
web <- POST(url = url,add_headers(.headers =myheaders),body = mypayload,encode="json",verbose())
myresult<-web %>% content() %>% `[[`(3) %>% `[[`(2)
myfullresult<-c(myfullresult,myresult)
}
7. 然后,整理数据并输出到本地。
#以上获取数据为列表格式,需转换为数据框,并提取出要保留的列。
mydata<-do.call(rbind,myfullresult) %>% as.data.frame() %>% select(usefulname)
# mydata整体是数据框,但是单个变量仍然是lsit(原因是原始信息中出现大量的NULL值),
# 需要将所有NULL替换为NA,方可对mydata的个列进行向量化。
# 替换NULL值
for (j in 1:length(mydata)){
for (i in 1:nrow(mydata)){
if(is.null(mydata[i,j][[1]])){
mydata[i,j][[1]]=NA
}
}
}
#将所有list列转为向量
for (i in usefulname){
mydata[[i]]<-mydata[[i]] %>% unlist()
}
#去重和保存
mydata<-unique(mydata)
write.csv(mydata, file ="course.csv")
所有课程信息如下,共计627条。
最后,完整代码汇总如下:
rm(list=ls())
library("httr")
library("dplyr")
library("jsonlite")
library("curl")
library("magrittr")
library("rlist")
library("pipeR")
library("plyr")
#构造url,通过观察XHR面板得到
url <- c('https://study.163.com/p/search/studycourse.json')
#构造请求提交信息
mycookie <- 'EDUWEBDEVICE=2558e3af159d412cbd77e1bd4d2ac2b2; hb_MA-BFF5-63705950A31C_source=www.google.com; UM_distinctid=172b723649a9f8-0656102adeb43b-d373666-144000-172b723649b942; EDU-YKT-MODULE_GLOBAL_PRIVACY_DIALOG=true; STUDY_NOT_SHOW_PROMOTION_WIN=true; eds_utm=eyJjIjoiIiwiY3QiOiIiLCJpIjoiIiwibSI6IiIsInMiOiIiLCJ0IjoiIn0=|aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8=; __utmc=129633230; __utmz=129633230.1592444560.4.4.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); sideBarPost=891; NTESSTUDYSI=858bed85fc9541a18c43cd5a6934f38e; NTES_YD_SESS=1tT526KyasdDFqzud7RGk5jej6MFRxOISQfMF1iNn3v_7LHE7Muc9bTJyJAttj.joOdOzfye0L5uGhlKLcFiJ1eQEHvs_xxxUv2aQbafpcbS6hsTMsDdEfzDYOYnYNE2JhmQJ_b6EjujeMBPiY2p_uAiwiZBBAZ9oXs3aZVXgDz4Ml6MBD1fU.BmguhUJap3SRPbGtT.47UnGi3dnfLx7kKzWXRNJ42La4HlM7147KO_x; NTES_YD_PASSPORT=JFDBLA68q0IasVpovWl2uu.D2kvzuFwobwXjKpbdMsw_Wh3VWELSHa6OxOp55IPIQtgtekxnBhmK0HdF9PDQRC7OCZxa2Gn.qoFKRMmPKdxg0hVqvtGpPZo857IgjcbxNCJrffK9DY58Nw8OhoxGUKm7GIS1cSfY3FKlF_TkZSVADf5877_.SzHE5ImYW3nhuQ1V61_vI1NX8v5QQxL2GSO4e; S_INFO=1592451940|0|3&80##|13120412092; P_INFO=13120412092|1592451940|1|study|00&99|null&null&null#bej&null#10#0#0|&0|null|13120412092; STUDY_INFO="[email protected]|8|1415444383|1592451940634"; STUDY_SESS="fxEsQC5LvlYHX4GhzCnnlTl2iio0a1fDsRiqpj5hqzQh5bTFReRt55rb+vnOnEZ45V/nvYjstqkjGmTNU344+VHBcA9uP0xzObA9G8ot0BOrnyy1VnW5ZKxb44cf+3ZGTda5t/QjAY20KOE0EF9+TP8RDwCBhi8apnA+E128sPALhur2Nm2wEb9HcEikV+3FTI8+lZKyHhiycNQo+g+/oA=="; STUDY_PERSIST="+GsuXMNai/WHdRbupv7eEeLokfCkeuarSAaYVuDIvpH9hK3yadTjOawfrSa+uwNez0VFU4i3ndRj9W9omWbyCuPHHsek+Esh2RBGBgwayFaMnccfeJrtvHow3DYivsfP8diGmo2uGQEZemAqCA5us1F3KV8BqtsUDrCO1ITmM6Zt5913n1CKxdhKmqhyOdXoJ5pxN8si2bAj3KSQtVtAP/OnSEP5aavbVEpleM+fnYXZgpjCC7Iso4RP9U87vJE8LtaQzUT1ovP2MqtW5+L3Hw+PvH8+tZRDonbf7gEH7JU="; NETEASE_WDA_UID=1415444383#|#1581733410145; NTES_STUDY_YUNXIN_ACCID=s-1415444383; NTES_STUDY_YUNXIN_TOKEN=73db624b15dc91b405d838a2433bafc9; __utma=129633230.575204015.1592210449.1592445051.1592458754.6; utm=eyJjIjoiIiwiY3QiOiIiLCJpIjoiIiwibSI6IiIsInMiOiIiLCJ0IjoiIn0=|aHR0cHM6Ly9zdHVkeS4xNjMuY29tL2NhdGVnb3J5LzQ4MDAwMDAwMzEzMTAwOQ==; CNZZDATA1272960468=716509904-1592205743-https%253A%252F%252Fwww.google.com%252F%7C1592461992; STUDY_UUID=d9689752-da3c-43d3-9c17-471aefab6636; __utmb=129633230.22.9.1592463486945'
#注意比较,保留不变的参数
myheaders <- c('accept' ='application/json',
'accept-encoding' = 'gzip, deflate, br',
'accept-language' = 'zh-CN,zh;q=0.9',
'content-type' = 'application/json',
'edu-script-token' = '858bed85fc9541a18c43cd5a6934f38e',
'origin' = 'https://study.163.com',
'referer' = 'https://study.163.com/category/480000003131009',
'sec-fetch-dest' = 'empty',
'sec-fetch-mode' = 'cors',
'sec-fetch-site' = 'same-origin',
'user-agent' = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
'cookie' = mycookie)
#构造请求头参数信息,特别是那些有规律变化的参数,list格式
mypayload <- list("pageIndex"= 1,
"pageSize"= 50,
"relativeOffset"= 0,
"frontCategoryId"= "480000003131009",
"searchTimeType"= -1,
"orderType"= 50,
"priceType"= -1,
"activityId"= 0,
"keyword"= "")
#执行第一页,httr的POST函数
response <- POST(url = url, add_headers(.headers = myheaders),body = mypayload, encode="json",verbose())
#从数据中获取正文数据,包含4个list,选择第3个list,然后再其中选择第2个list,得到50份课信息(这时还是list格式)
#接着通过toJSON和fromJSON函数将数据转化为矩阵
result <- response %>% content() %>%`[[`(3) %>% `[[`(2) %>% toJSON() %>% fromJSON(simplifyDataFrame=TRUE)
colnames(result)#查看都有哪些信息
usefulname <- c("productId","courseId","productName","lectorName","provider","score","scoreLevel","learnerCount","originalPrice","discountPrice","discountRate","description")
result <- result %>% select(usefulname)
#执行获取所有页面信息,共13页,利用循环抓取。
myfullresult<-list()
for (i in 1:13){
print(paste0("正在抓取第",i,"页"))
mypayload <- list("pageIndex"= i,
"pageSize"= 50,
"relativeOffset"= 50*(i-1),
"frontCategoryId"= "480000003131009",
"searchTimeType"= -1,
"orderType"= 50,
"priceType"= -1,
"activityId"= 0,
"keyword"= "")
web <- POST(url = url,add_headers(.headers =myheaders),body = mypayload,encode="json",verbose())
myresult<-web %>% content() %>% `[[`(3) %>% `[[`(2)
myfullresult<-c(myfullresult,myresult)
}
#以上获取数据为列表格式,需转换为数据框,并提取出要保留的列。
mydata<-do.call(rbind,myfullresult) %>% as.data.frame() %>% select(usefulname)
# mydata整体是数据框,但是单个变量仍然是lsit(原因是原始信息中出现大量的NULL值),
# 我们需要将所有NULL替换为NA,方可对mydata的个列进行向量化。
# 替换NULL值
for (j in 1:length(mydata)){
for (i in 1:nrow(mydata)){
if(is.null(mydata[i,j][[1]])){
mydata[i,j][[1]]=NA
}
}
}
#将所有list列转为向量
for (i in usefulname){
mydata[[i]]<-mydata[[i]] %>% unlist()
}
#去重和保存
mydata<-unique(mydata)
write.csv(mydata, file ="course.csv")
这一期主要介绍了httr如何进行POST请求类爬虫,下一期再介绍httr如何进行GET类请求类爬虫!
参考:https://cloud.tencent.com/developer/article/1092893
更多内容可关注公共号“YJY技能修炼”~~~
往期回顾
R爬虫在工作中的一点妙用
R爬虫必备基础——HTML和CSS初识
R爬虫必备基础——静态网页+动态网页
R爬虫必备——rvest包的使用
R爬虫必备基础——CSS+SelectorGadget
R爬虫必备基础—Chrome开发者工具(F12)
R爬虫必备基础—HTTP协议