R爬虫必备—httr+POST请求类爬虫(网易云课堂)

在实际R爬虫过程中,针对不同的网页,采取的爬虫方法也会有所不同。对于静态网页,rvest包足够了。但是对于网页动态加载的数据,继续使用rvest可能就不合适了。这时候需要RCurl或httr这类能提供丰富请求参数的R包,才能实现对这类动态网页的抓取。今天呢,主要介绍httr包,虽然说这个R包已经比RCurl精简很多,但涉及到的函数也很多,但是常规爬虫中用的比较多的还是GET和POST这两个函数。

下文案例是个典型的POST请求类的爬虫,因此,今天先看下POST这个函数的大致用法:POST(url = NULL, config = list(), ..., body = NULL, encode = c("multipart", "form", "json", "raw"), handle = NULL),这里面比较重要的是config参数(设置请求头和cookies)和 body参数(查询参数)。具体参数解释如下:

  • url :the url of the page to retrieve
  • config:Additional configuration settings such as http authentication (authenticate), additional headers (add_headers), cookies (set_cookies) etc. See config for full details and list of helpers. Further named parameters, such as query, path, etc, passed on to modify_url. Unnamed parameters will be combined with config.
  • body:One of the following:
  • FALSE: No body. This is typically not used with POST, PUT, or PATCH, but can be useful if you need to send a bodyless request (like GET) with VERB().
  • NULL: An empty body
  • "": A length 0 body
  • upload_file("path/"): The contents of a file. The mime type will be guessed from the extension, or can be supplied explicitly as the second argument to upload_file()
  • A character or raw vector: sent as is in body. Use content_type to tell the server what sort of data you are sending.
  • A named list: See details for encode. (比较常用)
  • encode:If the body is a named list, how should it be encoded? Can be one of form (application/x-www-form-urlencoded), multipart, (multipart/form-data), or json (application/json). For "multipart", list elements can be strings or objects created by upload_file. For "form", elements are coerced to strings and escaped, use I() to prevent double-escaping. For "json", parameters are automatically "unboxed" (i.e. length 1 vectors are converted to scalars). To preserve a length 1 vector as a vector, wrap in I(). For "raw", either a character or raw vector. You'll need to make sure to set the content_type() yourself.
  • handle:The handle to use with this request. If not supplied, will be retrieved and reused from the handle_pool based on the scheme, hostname and port of the url. By default httr requests to the same scheme/host/port combo. This substantially reduces connection time, and ensures that cookies are maintained over multiple requests to the same host. See handle_pool for more details.

为了更好理解httr:POST这个函数如何抓取动态异步加载网页,下面以网易云课程为例做简单介绍!

网易云课堂案例说明

进入网易云课堂,点击编程课程页面,网址为https://study.163.com/category/480000003131009,后台显示了该网址的请求和响应,表明该页面信息是可以通过该网址获取的。

image

接下来,第二页,常见情况下在网址的后面添加页码,请求和响应得到第二页面的信息。但是,网易云再点击第2/3/....页后,网址会依次添加页面,但网页后台工具显示(记得刷新),没有添加页码的网页请求,网址维持不变,如下图。这说明通过【https://study.163.com/category/480000003131009#/?p=数字】这类网址是得不到课程信息的。

image

这个时候,考虑异步加载,回到首页,进入开发后台,再进入XHR面板。然后点击页码翻页,第2页,第3页,第4页..... ,发现XHR面板再每点击页码,就会生成新的内容。这其中就包括studycourse.json,发现其响应内容就是课程信息。比较不同的studycourse.json,发现request paload的具体参数不同,经分析,不难发现这些studycourse.json存储着课程信息,每50个课程生成一个文件,共计13个文件。所以,要获取课程,应该针对https://study.163.com/p/search/studycourse.json网址,POST提交参数信息,然后获取课程信息。

image

在正式爬取之前,需要对下面爬虫主要涉及的参数做下介绍:General里面的Request URL、Request Method、Status Code;Response Headers里面的Content-Type;Request Headers 里面的 Accept、Content-Type、Cookie、Referer、User-Agent等以及最后Form data/Request Paylond里面的所有参数。

  • General里面的Request URL和Request Method方法即是即决定访问的资源对象和使用的技术手段。
  • Response Headers里面的Content-Type决定着你获得的数据以什么样的编码格式返回。
  • Request Headers 里面的 Accept、Content-Type、Cookie、Referer、User-Agent等是你客户端的浏览器信息,其中Cookie是你浏览器登录后缓存在本地的登录状态信息,使用Cookie登入可以避免爬虫程序被频繁拒绝。这其中的参数不一定全部需要提交。
  • Form data/Request Paylond信息最为关键,是POST提交请求必备的定位信息,因为浏览器的课程页有很多页信息,但是实际上访问同一个地址(就是General里面的url),而真正起到切换页面的就是这个Form data/Request Paylond里面的表单信息。

如何实际操作?

1. 加载所需要的R包,没安装的提前安装。

rm(list=ls())
library("httr") 
library("dplyr") 
library("jsonlite")
library("curl")
library("magrittr")
library("rlist")
library("pipeR")
library("plyr")

2. 构造url,通过观察XHR面板得到,如下图所示。

url <- c('https://study.163.com/p/search/studycourse.json')
image

3. 构造请求提交信息,根据Request headers的内容填写,如下图所示。


mycookie <- 'EDUWEBDEVICE=2558e3af159d412cbd77e1bd4d2ac2b2; hb_MA-BFF5-63705950A31C_source=www.google.com; UM_distinctid=172b723649a9f8-0656102adeb43b-d373666-144000-172b723649b942; EDU-YKT-MODULE_GLOBAL_PRIVACY_DIALOG=true; STUDY_NOT_SHOW_PROMOTION_WIN=true; eds_utm=eyJjIjoiIiwiY3QiOiIiLCJpIjoiIiwibSI6IiIsInMiOiIiLCJ0IjoiIn0=|aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8=; __utmc=129633230; __utmz=129633230.1592444560.4.4.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); sideBarPost=891; NTESSTUDYSI=858bed85fc9541a18c43cd5a6934f38e; NTES_YD_SESS=1tT526KyasdDFqzud7RGk5jej6MFRxOISQfMF1iNn3v_7LHE7Muc9bTJyJAttj.joOdOzfye0L5uGhlKLcFiJ1eQEHvs_xxxUv2aQbafpcbS6hsTMsDdEfzDYOYnYNE2JhmQJ_b6EjujeMBPiY2p_uAiwiZBBAZ9oXs3aZVXgDz4Ml6MBD1fU.BmguhUJap3SRPbGtT.47UnGi3dnfLx7kKzWXRNJ42La4HlM7147KO_x; NTES_YD_PASSPORT=JFDBLA68q0IasVpovWl2uu.D2kvzuFwobwXjKpbdMsw_Wh3VWELSHa6OxOp55IPIQtgtekxnBhmK0HdF9PDQRC7OCZxa2Gn.qoFKRMmPKdxg0hVqvtGpPZo857IgjcbxNCJrffK9DY58Nw8OhoxGUKm7GIS1cSfY3FKlF_TkZSVADf5877_.SzHE5ImYW3nhuQ1V61_vI1NX8v5QQxL2GSO4e; S_INFO=1592451940|0|3&80##|13120412092; P_INFO=13120412092|1592451940|1|study|00&99|null&null&null#bej&null#10#0#0|&0|null|13120412092; STUDY_INFO="[email protected]|8|1415444383|1592451940634"; STUDY_SESS="fxEsQC5LvlYHX4GhzCnnlTl2iio0a1fDsRiqpj5hqzQh5bTFReRt55rb+vnOnEZ45V/nvYjstqkjGmTNU344+VHBcA9uP0xzObA9G8ot0BOrnyy1VnW5ZKxb44cf+3ZGTda5t/QjAY20KOE0EF9+TP8RDwCBhi8apnA+E128sPALhur2Nm2wEb9HcEikV+3FTI8+lZKyHhiycNQo+g+/oA=="; STUDY_PERSIST="+GsuXMNai/WHdRbupv7eEeLokfCkeuarSAaYVuDIvpH9hK3yadTjOawfrSa+uwNez0VFU4i3ndRj9W9omWbyCuPHHsek+Esh2RBGBgwayFaMnccfeJrtvHow3DYivsfP8diGmo2uGQEZemAqCA5us1F3KV8BqtsUDrCO1ITmM6Zt5913n1CKxdhKmqhyOdXoJ5pxN8si2bAj3KSQtVtAP/OnSEP5aavbVEpleM+fnYXZgpjCC7Iso4RP9U87vJE8LtaQzUT1ovP2MqtW5+L3Hw+PvH8+tZRDonbf7gEH7JU="; NETEASE_WDA_UID=1415444383#|#1581733410145; NTES_STUDY_YUNXIN_ACCID=s-1415444383; NTES_STUDY_YUNXIN_TOKEN=73db624b15dc91b405d838a2433bafc9; __utma=129633230.575204015.1592210449.1592445051.1592458754.6; utm=eyJjIjoiIiwiY3QiOiIiLCJpIjoiIiwibSI6IiIsInMiOiIiLCJ0IjoiIn0=|aHR0cHM6Ly9zdHVkeS4xNjMuY29tL2NhdGVnb3J5LzQ4MDAwMDAwMzEzMTAwOQ==; CNZZDATA1272960468=716509904-1592205743-https%253A%252F%252Fwww.google.com%252F%7C1592461992; STUDY_UUID=d9689752-da3c-43d3-9c17-471aefab6636; __utmb=129633230.22.9.1592463486945'
#注意比较,保留不变的参数
myheaders <- c('accept' ='application/json',
               'accept-encoding' = 'gzip, deflate, br',
               'accept-language' = 'zh-CN,zh;q=0.9',
               'content-type' = 'application/json',
               'edu-script-token' = '858bed85fc9541a18c43cd5a6934f38e',
               'origin' = 'https://study.163.com',
               'referer' = 'https://study.163.com/category/480000003131009',
               'sec-fetch-dest' = 'empty',
               'sec-fetch-mode' = 'cors',
               'sec-fetch-site' = 'same-origin',
               'user-agent' = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
               'cookie' = mycookie)
image

4. 构造请求头参数信息,根据Request Payload内容填写,如下图所示。仔细比较,发现pageIndexrelativeOffset是有规律变化的。以下代码是第一页的Request Payload参数,当请求不同页面时,有此规律:"pageIndex"= i, "relativeOffset"= 50*(i-1) 。以下代码是第一页的Request Payload参数。

mypayload <- list("pageIndex"= 1,
                  "pageSize"= 50,
                  "relativeOffset"= 0,
                  "frontCategoryId"= "480000003131009",
                  "searchTimeType"= -1,
                  "orderType"= 50,
                  "priceType"= -1,
                  "activityId"= 0,
                  "keyword"= "")
image

5. 执行第一页,httr的POST函数,这里用什么函数是由General里的Request Method参数决定的,如下图。

image
POST(url , add_headers(.headers = 待爬取网页的头部信息),
           set_cookies(.cookies =自己的cookie,
           body = Form data/Request Paylond里面的参数, 
           encode = c("multipart"/"form"/"json"/"raw"),
           timeout(最大请求时长/秒),
           use_proxy(代理IP)......)

#针对POST请求而言,POST请求灰常复杂,它的查询参数必须含在请求体(body)中,
#而且参数发送前需要做指定的编码方式(就是request header中的content-type).
长见的编码方式有4种:前两种比较常见
application/x-www-form-urlencoded —— form
application/json                  —— json
multipart/form-data               —— multipart
text/xml                          —— raw 需要自己设置content_type()

根据POST函数用法,正式进行网络请求,服务器响应返回包含有json格式课程的数据。

response <- POST(url = url, add_headers(.headers = myheaders),body = mypayload, encode="json",verbose())
#从数据中获取正文数据,包含4个list,选择第3个list,然后再其中选择第2个list,得到50份课信息(这时还是list格式)
#接着通过toJSON和fromJSON函数将数据转化为矩阵
result <- response %>% content()  %>%`[[`(3) %>% `[[`(2) %>% toJSON() %>% fromJSON(simplifyDataFrame=TRUE)
colnames(result)#查看都有哪些信息
usefulname <- c("productId","courseId","productName","lectorName","provider","score","scoreLevel","learnerCount","originalPrice","discountPrice","discountRate","description")
result <- result %>% select(usefulname)

第一页爬取结果如下所示,共计50条课程信息:

image.png

6. 执行获取所有页面信息,共13页,利用循环抓取。

myfullresult<-list()
i=1
for (i in 1:13){
  print(paste0("正在抓取第",i,"页"))
  mypayload <- list("pageIndex"= i,
                    "pageSize"= 50,
                    "relativeOffset"= 50*(i-1),
                    "frontCategoryId"= "480000003131009",
                    "searchTimeType"= -1,
                    "orderType"= 50,
                    "priceType"= -1,
                    "activityId"= 0,
                    "keyword"= "")
  web <- POST(url = url,add_headers(.headers =myheaders),body = mypayload,encode="json",verbose())
  myresult<-web %>% content() %>% `[[`(3) %>% `[[`(2) 
  myfullresult<-c(myfullresult,myresult)
}

7. 然后,整理数据并输出到本地。

#以上获取数据为列表格式,需转换为数据框,并提取出要保留的列。
mydata<-do.call(rbind,myfullresult) %>% as.data.frame() %>% select(usefulname)
# mydata整体是数据框,但是单个变量仍然是lsit(原因是原始信息中出现大量的NULL值),
# 需要将所有NULL替换为NA,方可对mydata的个列进行向量化。

# 替换NULL值
for (j in 1:length(mydata)){
  for (i in 1:nrow(mydata)){
    if(is.null(mydata[i,j][[1]])){
      mydata[i,j][[1]]=NA
    }
  }
}

#将所有list列转为向量
for (i in usefulname){
  mydata[[i]]<-mydata[[i]] %>% unlist()
}

#去重和保存
mydata<-unique(mydata)
write.csv(mydata, file ="course.csv")

所有课程信息如下,共计627条。

image.png

最后,完整代码汇总如下:

rm(list=ls())
library("httr") 
library("dplyr") 
library("jsonlite")
library("curl")
library("magrittr")
library("rlist")
library("pipeR")
library("plyr")

#构造url,通过观察XHR面板得到
url <- c('https://study.163.com/p/search/studycourse.json')
#构造请求提交信息
mycookie <- 'EDUWEBDEVICE=2558e3af159d412cbd77e1bd4d2ac2b2; hb_MA-BFF5-63705950A31C_source=www.google.com; UM_distinctid=172b723649a9f8-0656102adeb43b-d373666-144000-172b723649b942; EDU-YKT-MODULE_GLOBAL_PRIVACY_DIALOG=true; STUDY_NOT_SHOW_PROMOTION_WIN=true; eds_utm=eyJjIjoiIiwiY3QiOiIiLCJpIjoiIiwibSI6IiIsInMiOiIiLCJ0IjoiIn0=|aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8=; __utmc=129633230; __utmz=129633230.1592444560.4.4.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); sideBarPost=891; NTESSTUDYSI=858bed85fc9541a18c43cd5a6934f38e; NTES_YD_SESS=1tT526KyasdDFqzud7RGk5jej6MFRxOISQfMF1iNn3v_7LHE7Muc9bTJyJAttj.joOdOzfye0L5uGhlKLcFiJ1eQEHvs_xxxUv2aQbafpcbS6hsTMsDdEfzDYOYnYNE2JhmQJ_b6EjujeMBPiY2p_uAiwiZBBAZ9oXs3aZVXgDz4Ml6MBD1fU.BmguhUJap3SRPbGtT.47UnGi3dnfLx7kKzWXRNJ42La4HlM7147KO_x; NTES_YD_PASSPORT=JFDBLA68q0IasVpovWl2uu.D2kvzuFwobwXjKpbdMsw_Wh3VWELSHa6OxOp55IPIQtgtekxnBhmK0HdF9PDQRC7OCZxa2Gn.qoFKRMmPKdxg0hVqvtGpPZo857IgjcbxNCJrffK9DY58Nw8OhoxGUKm7GIS1cSfY3FKlF_TkZSVADf5877_.SzHE5ImYW3nhuQ1V61_vI1NX8v5QQxL2GSO4e; S_INFO=1592451940|0|3&80##|13120412092; P_INFO=13120412092|1592451940|1|study|00&99|null&null&null#bej&null#10#0#0|&0|null|13120412092; STUDY_INFO="[email protected]|8|1415444383|1592451940634"; STUDY_SESS="fxEsQC5LvlYHX4GhzCnnlTl2iio0a1fDsRiqpj5hqzQh5bTFReRt55rb+vnOnEZ45V/nvYjstqkjGmTNU344+VHBcA9uP0xzObA9G8ot0BOrnyy1VnW5ZKxb44cf+3ZGTda5t/QjAY20KOE0EF9+TP8RDwCBhi8apnA+E128sPALhur2Nm2wEb9HcEikV+3FTI8+lZKyHhiycNQo+g+/oA=="; STUDY_PERSIST="+GsuXMNai/WHdRbupv7eEeLokfCkeuarSAaYVuDIvpH9hK3yadTjOawfrSa+uwNez0VFU4i3ndRj9W9omWbyCuPHHsek+Esh2RBGBgwayFaMnccfeJrtvHow3DYivsfP8diGmo2uGQEZemAqCA5us1F3KV8BqtsUDrCO1ITmM6Zt5913n1CKxdhKmqhyOdXoJ5pxN8si2bAj3KSQtVtAP/OnSEP5aavbVEpleM+fnYXZgpjCC7Iso4RP9U87vJE8LtaQzUT1ovP2MqtW5+L3Hw+PvH8+tZRDonbf7gEH7JU="; NETEASE_WDA_UID=1415444383#|#1581733410145; NTES_STUDY_YUNXIN_ACCID=s-1415444383; NTES_STUDY_YUNXIN_TOKEN=73db624b15dc91b405d838a2433bafc9; __utma=129633230.575204015.1592210449.1592445051.1592458754.6; utm=eyJjIjoiIiwiY3QiOiIiLCJpIjoiIiwibSI6IiIsInMiOiIiLCJ0IjoiIn0=|aHR0cHM6Ly9zdHVkeS4xNjMuY29tL2NhdGVnb3J5LzQ4MDAwMDAwMzEzMTAwOQ==; CNZZDATA1272960468=716509904-1592205743-https%253A%252F%252Fwww.google.com%252F%7C1592461992; STUDY_UUID=d9689752-da3c-43d3-9c17-471aefab6636; __utmb=129633230.22.9.1592463486945'
#注意比较,保留不变的参数
myheaders <- c('accept' ='application/json',
               'accept-encoding' = 'gzip, deflate, br',
               'accept-language' = 'zh-CN,zh;q=0.9',
               'content-type' = 'application/json',
               'edu-script-token' = '858bed85fc9541a18c43cd5a6934f38e',
               'origin' = 'https://study.163.com',
               'referer' = 'https://study.163.com/category/480000003131009',
               'sec-fetch-dest' = 'empty',
               'sec-fetch-mode' = 'cors',
               'sec-fetch-site' = 'same-origin',
               'user-agent' = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
               'cookie' = mycookie)
#构造请求头参数信息,特别是那些有规律变化的参数,list格式
mypayload <- list("pageIndex"= 1,
                  "pageSize"= 50,
                  "relativeOffset"= 0,
                  "frontCategoryId"= "480000003131009",
                  "searchTimeType"= -1,
                  "orderType"= 50,
                  "priceType"= -1,
                  "activityId"= 0,
                  "keyword"= "")

#执行第一页,httr的POST函数
response <- POST(url = url, add_headers(.headers = myheaders),body = mypayload, encode="json",verbose())
#从数据中获取正文数据,包含4个list,选择第3个list,然后再其中选择第2个list,得到50份课信息(这时还是list格式)
#接着通过toJSON和fromJSON函数将数据转化为矩阵
result <- response %>% content()  %>%`[[`(3) %>% `[[`(2) %>% toJSON() %>% fromJSON(simplifyDataFrame=TRUE)
colnames(result)#查看都有哪些信息
usefulname <- c("productId","courseId","productName","lectorName","provider","score","scoreLevel","learnerCount","originalPrice","discountPrice","discountRate","description")
result <- result %>% select(usefulname)

#执行获取所有页面信息,共13页,利用循环抓取。
myfullresult<-list()
for (i in 1:13){
  print(paste0("正在抓取第",i,"页"))
  mypayload <- list("pageIndex"= i,
                    "pageSize"= 50,
                    "relativeOffset"= 50*(i-1),
                    "frontCategoryId"= "480000003131009",
                    "searchTimeType"= -1,
                    "orderType"= 50,
                    "priceType"= -1,
                    "activityId"= 0,
                    "keyword"= "")
  web <- POST(url = url,add_headers(.headers =myheaders),body = mypayload,encode="json",verbose())
  myresult<-web %>% content() %>% `[[`(3) %>% `[[`(2) 
  myfullresult<-c(myfullresult,myresult)
}

#以上获取数据为列表格式,需转换为数据框,并提取出要保留的列。
mydata<-do.call(rbind,myfullresult) %>% as.data.frame() %>% select(usefulname)
# mydata整体是数据框,但是单个变量仍然是lsit(原因是原始信息中出现大量的NULL值),
# 我们需要将所有NULL替换为NA,方可对mydata的个列进行向量化。

# 替换NULL值
for (j in 1:length(mydata)){
  for (i in 1:nrow(mydata)){
    if(is.null(mydata[i,j][[1]])){
      mydata[i,j][[1]]=NA
    }
  }
}

#将所有list列转为向量
for (i in usefulname){
  mydata[[i]]<-mydata[[i]] %>% unlist()
}

#去重和保存
mydata<-unique(mydata)
write.csv(mydata, file ="course.csv")

这一期主要介绍了httr如何进行POST请求类爬虫,下一期再介绍httr如何进行GET类请求类爬虫!

参考:https://cloud.tencent.com/developer/article/1092893

更多内容可关注公共号“YJY技能修炼”~~~

往期回顾
R爬虫在工作中的一点妙用
R爬虫必备基础——HTML和CSS初识
R爬虫必备基础——静态网页+动态网页
R爬虫必备——rvest包的使用
R爬虫必备基础——CSS+SelectorGadget
R爬虫必备基础—Chrome开发者工具(F12)
R爬虫必备基础—HTTP协议

你可能感兴趣的:(R爬虫必备—httr+POST请求类爬虫(网易云课堂))