利用RCurl实现爬虫实战

本文环境为CentOS Linux release 7.1.1503 (Core) +RStudio Version 0.98.1091
在调用RCurl和XML都遇到问题

提示错误如下:

> install.packages("RCurl")
Installing package into ‘/home/steven/R/x86_64-pc-linux-gnu-library/3.0’
(as ‘lib’ is unspecified)
trying URL 'http://cran.rstudio.com/src/contrib/RCurl_1.95-4.1.tar.gz'
Content type 'application/x-gzip' length 870915 bytes (850 Kb)
opened URL
==================================================
downloaded 850 Kb

* installing *source* package ‘RCurl’ ...
** package ‘RCurl’ successfully unpacked and MD5 sums checked
checking for curl-config... no
Cannot find curl-config
ERROR: configuration failed for package ‘RCurl’
* removing ‘/home/steven/R/x86_64-pc-linux-gnu-library/3.0/RCurl’
Warning in install.packages :
  installation of package ‘RCurl’ had non-zero exit status

The downloaded source packages are in
    ‘/tmp/RtmpUwBkbS/downloaded_packages’
> install.packages("XML")
Installing package into ‘/home/steven/R/x86_64-pc-linux-gnu-library/3.0’
(as ‘lib’ is unspecified)
trying URL 'http://cran.rstudio.com/src/contrib/XML_3.98-1.1.tar.gz'
Content type 'application/x-gzip' length 1582216 bytes (1.5 Mb)
opened URL
==================================================
downloaded 1.5 Mb

* installing *source* package ‘XML’ ...
** package ‘XML’ successfully unpacked and MD5 sums checked
checking for gcc... gcc
checking for C compiler default output file name... 
rm: cannot remove 'a.out.dSYM': Is a directory
a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables... 
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking how to run the C preprocessor... gcc -E
checking for sed... /bin/sed
checking for pkg-config... /usr/bin/pkg-config
checking for xml2-config... no
Cannot find xml2-config
ERROR: configuration failed for package ‘XML’
* removing ‘/home/steven/R/x86_64-pc-linux-gnu-library/3.0/XML’
Warning in install.packages :
  installation of package ‘XML’ had non-zero exit status

The downloaded source packages are in
    ‘/tmp/RtmpUwBkbS/downloaded_packages’
发现是系统缺失,解决如下

sudo yum -y install curl
sudo yum -y install libcurl libcurl-devel

而xml由于系统有两个不同版本的包而不能这么解决,提示如下:

错误: Multilib version problems found. This often means that the root
      cause is something else and multilib version checking is just
      pointing out that there is a problem. Eg.:
      
        1. You have an upgrade for libxml2 which is missing some
           dependency that another package requires. Yum is trying to
           solve this by installing an older version of libxml2 of the
           different architecture. If you exclude the bad architecture
           yum will tell you what the root cause is (which package
           requires what). You can try redoing the upgrade with
           --exclude libxml2.otherarch ... this should give you an error
           message showing the root cause of the problem.
      
        2. You have multiple architectures of libxml2 installed, but
           yum can only see an upgrade for one of those architectures.
           If you don't want/need both architectures anymore then you
           can remove the one with the missing update and everything
           will work.
      
        3. You have duplicate versions of libxml2 installed already.
           You can use "yum check" to get yum show these errors.
      
      ...you can also use --setopt=protected_multilib=false to remove
      this checking, however this is almost never the correct thing to
      do as something else is very likely to go wrong (often causing
      much more problems).
      
      保护多库版本:libxml2-2.9.1-5.el7_1.2.x86_64 != libxml2-2.9.1-5.el7_0.1.i686
错误:保护多库版本:xz-libs-5.1.2-9alpha.el7.i686 != xz-libs-5.1.2-8alpha.el7.x86_64

运行

yum list libxml2
发现

已安装的软件包
libxml2.x86_64                     2.9.1-5.el7_0.1                     @updates 
libxml2.x86_64                     2.9.1-5.el7_1.2                     installed

于是

sudo yum remove libxml2-2.9.1-5.el7_0.1.x86_64

sudo yum -y install libxml2 libxml2-devel

回到rstudio发现正常

爬虫代码如下

version
library(RCurl)
library(XML)
#读取拉手北京电影
start_url = "http://beijing.lashou.com/movies/"
#构造请求头
cust_header =c("User-Agent"="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0",
               "Accept"="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language"="en-us",
               "Connection"="keep-alive")
#读取PageSouce
pagesource <- getURL(start_url,httpheader=cust_header,.encoding="utf-8")
#解析页数
parseTotalPage <- function(pagesource){
  doc <- htmlParse(pagesource)  
  as.numeric(sapply(getNodeSet(doc, '//div[@class="page"]/a[last()-1]/text()'), xmlValue))
}
#解析页面内容,获取门店名称、描述、优惠价,门店价
parseContent <-  function(pagesource){
  doc <- htmlParse(pagesource)
  goods_name <- sapply(getNodeSet(doc, '//div[contains(@class,"goods")]//a[@class="goods-name"]//text()'), xmlValue)
  goods_text <- sapply(getNodeSet(doc, '//div[contains(@class,"goods")]//a[@class="goods-text"]//text()'), xmlValue)
  price <- sapply(getNodeSet(doc, '//div[contains(@class,"goods")]//span[@class="price"]/text()'), xmlValue)
  org_price <- sapply(getNodeSet(doc, '//div[contains(@class,"goods")]//span[@class="money"]/del/text()'), xmlValue)
  result <- data.frame(goods_name, goods_text, price, org_price)
}
#获取总页数和第一页内容
total_page <- parseTotalPage(pagesource)
pageresults <- parseContent(pagesource)
#生成2-n页url
page = 1:(total_page -1)
url_list = ""
url_list[page] = paste0("http://beijing.lashou.com/movies/page",page +1)
#循环读取url,并进行下载解析
for (url in url_list){
  pagesource <- getURL(url,httpheader=cust_header,.encoding="utf-8")
  pageresult <- parseContent(pagesource)
  pageresults <- rbind(pageresults,pageresult)
}
#输出结果到文件
write.table(pageresults,"bjmovie.txt",row.names=FALSE)

输出格式如下

"goods_name" "goods_text" "price" "org_price"
"【旧宫】嘉美国际影城" "单人电影票/卖品套餐,可观看2D/3D,免费停车" "26" "100"
"【朝外大街】紫光影城" "单人电影票,可观看2D/3D/VIP" "21" "100"
"【北太平庄】国安剧院" "单人电影票,可观看2D/3D" "22" "80"
"【64店通用】中影星美院线电影" "单人兑换券A券,兑换规则以影城为准" "45" "100"
"【天通苑】鲁信影城" "单人电影票,可观看2D/3D" "25" "80"
"【三里屯】美嘉欢乐影城(三里屯店)" "单人周末2D电影票,可观看2D" "66" "90"
"【劲松】劲松电影院" "单人电影票,可观看2D/3D" "26" "70"
"【60店通用】中影星美院线电影" "单人兑换券B券,兑换规则以影城为准" "35" "80"
"【顺义城区】大地数字影院(北京顺义店)" "单人电影票,可观看2D/3D" "30.5" "80"
"【68店通用】网票网" "单人3D电影票,高端设备" "61.5" "100"
"【方庄】博纳国际影城(方庄店)" "单人电影票,可观看2D/3D" "39" "120"
"【北苑家园】K酷国际影城" "单人电影票,可观看2D/3D" "35.5" "120"
"【石景山】古城电影院" "单人电影票,可观看2D/3D" "30" "60"
"【万寿路】博纳国际影城(万寿路店)" "单人电影票,可观看2D/3D" "31" "120"
"【宋家庄】正华影城" "单人电影票,所有影片无需补差价" "32" "80"
"【宋家庄】正华影城" "单人电影票,所有影片无需补差价" "32" "80"
"【亚运村】北京剧院" "单人电影票,可观看2D/3D,送可乐1杯" "25" "80"
"【王府井】横店电影城" "单人电影票/卖品套餐2选1,可看2D/3D" "27" "80"
"【24店通用】觅影连连看" "单人电影票,可观看2D" "41" "100"
"【延庆】大地数字影院(北京延庆金锣湾店)" "单人电影票,可观看2D/3D" "35.5" "50"
"【新华大街】通州电影院" "单人电影票,可观看2D/3D" "26.5" "80"
"【方庄】博纳国际影城(方庄店)" "单人卖品套餐,香脆可口" "25" "26"
"【7店通用】票友网" "单人电影票,可观看2D/3D" "22.5" "100"
"【78店通用】网票网" "单人2D电影票,高端设备" "41.5" "100"
"【北太平庄】中影电影院(小西天店)" "单人电影票,可观看2D/3D" "29.9" "50"
"【门头沟】幸福蓝海国际影城(门头沟店)" "单人电影票,可观看2D/3D" "29.9" "80"
"【21店通用】网票网" "单人2D电影票,高端设备" "31.5" "80"
"【常营】沃美影城(常营店)" "单人电影票,可观看2D/3D" "40" "75"
"【梨园】佳合时光影城" "单人电影票,可观看2D/3D" "30" "100"
"【立水桥】星环影城" "单人电影票,可观看2D/3D/限价片" "38" "120"
"【马驹桥】百尚影城" "单人电影票,可观看2D/3D" "22" "80"
"【回龙观】保利国际影城(龙旗广场店)" "单人电影票,可观看2D/3D" "40" "100"
"【13店通用】觅影连连看" "单人电影票,可观看2D" "31" "100"
"【团结湖】朝阳剧场" "单人电影票,可观看2D/3D" "21" "60"
"【朝阳】枫花园汽车影院" "平日电影票,一车一票" "61" "100"
"【新华联】大地数字影院(米拉影城)" "单人电影票,可观看2D/3D" "32" "60"
"【三里屯】美嘉欢乐影城(三里屯店)" "单人周末电影票,可观看2D/3D" "92" "130"
"【房山】良乡影剧院" "单人观影票,可观看2D/3D" "28" "40"
"【望京】大地数字影院(望京麒麟社店)" "单人电影票,可观看2D/3D" "26.5" "70"
"【房山】北京市燕山影剧院" "单人电影票,可观看2D/3D" "35" "100"
"【回龙观】沃美影城(回龙观店)" "单人电影票,2D/3D可兑" "40" "85"
"【西三旗】大地数字影院(西三旗物美店)" "单人卖品套餐,奶茶/咖啡/中可/雪碧/芬达+中爆" "15" "28"
"【欢乐谷】大地数字影院(垡头永辉店)" "单人电影票,可观看2D/3D" "32" "60"
"【平谷】平谷区影剧院" "单人电影票,可观看2D/3D" "31" "40"
"【西四】北京地质礼堂" "单人电影票,可观看2D/3D" "27" "80"
"【西四】北京地质礼堂" "单人电影票,可观看2D/3D" "27" "80"
"【望京】保利DMC影城望京店" "单人电影票,2D/3D可兑,非黄金时段专享" "27" "100"
"【马驹桥】百尚影城" "单人电影票,可观看2D/3D" "22" "80"
"【三里屯】美嘉欢乐影城(三里屯店)" "单人平日电影票,可观看2D/3D" "66" "130"
"【望京】保利DMC影城望京店" "单人电影票,2D/3D可兑,黄金时段专享" "37" "100"
"【9店通用】E票网" "单人2D电影票,高端设备" "36" "100"
"【4店通用】今典院线" "单人电影票,可观看2D/3D" "30" "70"
"【十里河】大地数字影院(十里河铭泽店)" "单人卖品餐,美味可口" "16" "25"
"【望京】大地数字影院(望京麒麟社店)" "单人卖品套餐,香脆可口" "16" "26"
"【顺义城区】大地数字影院(北京顺义店)" "单人卖品套餐,中爆+中杯汇源果汁/冰奶茶/咖啡" "15.9" "26"
"【梨园】佳合时光影城" "单人电影票,可观看2D/3D" "30" "100"
"【5店通用】E票网" "单人2D电影票,高端设备" "35" "80"
"【十里河】大地数字影院(十里河铭泽店)" "单人电影票,可观看2D/3D" "32" "60"
"【平谷】平谷区影剧院" "单人电影票,可观看2D/3D" "31" "40"
"【欢乐谷】大地数字影院(垡头永辉店)" "单人卖品套餐,香脆可口" "16" "26"
"【新华联】大地数字影院(米拉影城)" "单人卖品套餐,香脆可口" "16" "26"
"【望京】今典院线" "单人电影票,可观看2D/3D" "39" "70"
"【方庄】博纳国际影城(方庄店)" "双人卖品套餐,香脆可口" "39" "46"
"【新华联】大地数字影院(米拉影城)" "单人卖品餐,美味可口" "16" "25"
"【石景山】古城电影院" "单人电影票,可观看2D/3D" "30" "60"
"【十里河】大地数字影院(十里河铭泽店)" "单人卖品套餐,香脆可口" "16" "26"
"【欢乐谷】大地数字影院(垡头永辉店)" "单人卖品餐,美味可口" "16" "25"
"【昌平镇】大地数字影院(北京昌平菓岭假日店)" "单人卖品套餐,1中爆+1咖啡/奶茶,香脆可口" "16" "28"
"【望京】大地数字影院(望京麒麟社店)" "单人卖品套餐,奶茶/咖啡/中雪碧/可乐/芬达+中爆" "15" "28"
"【常营】沃美影城(常营店)" "单人电影票,可观看2D/3D" "40" "75"
"【立水桥】星环影城" "单人电影票,可观看2D/3D" "26" "60"
"【三里屯】美嘉欢乐影城(三里屯店)" "单人平日2D电影票,可观看2D" "46" "90"
"【延庆】大地数字影院(北京延庆金锣湾店)" "单人卖品套餐,1桶中爆+1杯冰咖啡/冰奶茶" "15" "28"
"【知春路】17.5影城" "单人电影票,可观看2D/3D" "31" "90"
"【宋家庄】正华影城" "单人夜场电影票,2D/3D可兑,3部连放无需补差价" "80" "240"





你可能感兴趣的:(利用RCurl实现爬虫实战)