RCurl和rvest

这篇是很久之前学习r爬虫时写的,搬到这里来

格式转化

iconv(text,"UTF-8")

方法一,通过RCurl实现

正则表达式/xml

install.packages("RCurl")
install.packages("XML")
library(RCurl)
library(XML)

myHttpheader <- c(
"User-Agent"="Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.6) ",
"Accept"="text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8",
"Accept-Language"="en-us",
"Connection"="keep-alive",
"Accept-Charset"="GB2312,utf-8;q=0.7,*;q=0.7")

url <- "https://book.douban.com/top250?icn=index-book250-all"

webpage <- getURL(url,httpheader=myHttpheader,.encoding="UTF-8")

pagetree <- htmlTreeParse(webpage,encoding="UTF-8", error=function(...){}, useInternalNodes = TRUE,trim=TRUE)

node<-getNodeSet(pagetree, "//p[@class='pl']/text()")
info<-sapply(node,xmlValue)
info

node

方法二,通过rvest实现

知识储备:css/xpath

install.packages("rvest")
library(rvest)
web<-read_html("https://book.douban.com/top250?icn=index-book250-all",encoding="UTF-8")

position<-web %>% html_nodes("p.pl") %>% html_text()

position

评价书

选取所有的评价

position2<-web %>% html_nodes("span.pl") %>% html_text()

position2<-web %>% html_nodes("div span.pl") %>% html_text()

选区所有的简介(2种写法)

position3<-web %>% html_nodes("p.quote") %>% html_text()

position3<-web %>% html_nodes("span.inq") %>% html_text()

选取所有的书名

position4<-web %>% html_nodes("a[title]") %>% html_text()

position5

你可能感兴趣的:(RCurl和rvest)