Rcurl小应用,爬取京东评论

利用Rcurl包做的一个小爬虫,爬取了京东上电热水器的评论

#利用Rcurl抓取京东页面上电热水器的评论
library(RCurl)
library(XML)
library(plyr)


#要爬取数据的(京东)网址,共有56页
page <- 1:56
urlist <- paste("http://club.jd.com/allconsultations/1121567-",page,"-1.html",sep="")

#伪造请求报头
myheader=c("User-Agent"="Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.6) ",
           "Accept"="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
           "Accept-Language"="en-us",
           "Connection"="keep-alive",
           "Accept-Charset"="GB2312,utf-8;q=0.7,*;q=0.7"
)



#下载网址
webpage = getURL(urlist,httpheader=myheader,.encoding='utf-8')
#解析xml文件
pagetree = htmlParse(webpage,encoding='utf-8')
#利用xpath语句读取在节点中的信息
time = xpathSApply(pagetree,"//div[@class='r_info']",xmlValue)
ask = xpathSApply(pagetree,"//dl[@class='ask']/dd/a",xmlValue)


#解决中文乱码问题
ask = iconv(ask,"utf-8","LATIN1")
#数据微调
time = laply(time,function(x){
  unlist(substring(x,nchar(x)-18,nchar(x)))
})
ask = laply(ask,function(x){
  unlist(strsplit(x,'\n'))[2]
})
ask = gsub(" ","",ask)

#组装成数据框
data = data.frame("时间"=time,"内容"=ask)
#导出数据
write.csv(data,"评论数据.csv")

最后爬取的数据如下 

                  时间                               内容
403 2014-05-27 11:09:20                           有线控么
404 2014-05-24 20:55:29             烧水一次断电保温多久?
405 2014-05-23 12:50:41                         几级能效的
406 2014-05-23 12:00:30                       几级能效的?
407 2014-05-20 14:48:49                   此款有木有防电墙
408 2014-05-13 09:54:47 热水器以后是京东负责维修还是海尔?


你可能感兴趣的:(R)