R 语言爬虫 之 cnblog博文爬取

来源:R 语言爬虫 之 cnblog博文爬取

a). 加载用到的R包

##library packages needed in this case
library(proto) library(gsubfn)
## Warning in doTryCatch(return(expr), name, parentenv, handler): 无法载入共享目标对象‘/Library/Frameworks/R.framework/Resources/modules//R_X11.so’::
##   dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
##   Referenced from: /Library/Frameworks/R.framework/Resources/modules//R_X11.so
##   Reason: image not found
## Could not load tcltk.  Will use slower R code instead.
library(bitops)
library(rvest) library(stringr) library(DBI) library(RSQLite) library(sqldf) library(RCurl) library(ggplot2) library(sp) library(raster) ##由于我们的电脑一般是中文环境,但是我想要Monday,Tuesday,所以,这时需要增加设置参数 ##来告知系统采用英文(北美)环境用法。 Sys.setlocale("LC_TIME", "C")
## [1] "C"

b). 自定义一个函数,后续用于爬取信息。

个人实操补充: www.cnblogs.com/p按照原来的代码输入,改成 www.cnblogs.com却错了。
对于安装包问题,重复再来一次就加载成功了!!!!!!!!!!!!!!!!!!!

下载的二进制程序包在

/var/folders/mn/sbfbhyln5rdddk4rp42mflpw0000gn/T//RtmpQ3p1jK/downloaded_packages里

Warning message:

In doTryCatch(return(expr), name, parentenv, handler) :

  无法载入共享目标对象‘/Library/Frameworks/R.framework/Resources/modules//R_X11.so’::

  dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib

  Referenced from: /Library/Frameworks/R.framework/Resources/modules//R_X11.so

  Reason: image not found

> library(data.table)   # 为了rbindlist函数

data.table 1.10.4

  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way

  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")

  Release notes, videos and slides: http://r-datatable.com

> install.packages("data.table")

试开URL’https://mirror.lzu.edu.cn/CRAN/bin/macosx/el-capitan/contrib/3.4/data.table_1.10.4.tgz'

Content type 'application/octet-stream' length 1436950 bytes (1.4 MB)

==================================================

downloaded 1.4 MB

》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》》






## Create a function,the parameter 'i' means page number.
getdata <- function(i){ url <- paste0("www.cnblogs.com/p",i)##generate url combined_info <- url%>%html_session()%>%html_nodes("div.post_item div.post_item_foot")%>%html_text()%>%strsplit(split="\r\n") post_date <- sapply(combined_info, function(v) return(v[3]))%>%str_sub(9,24)%>%as.POSIXlt()##get the date post_year <- post_date$year+1900 post_month <- post_date$mon+1 post_day <- post_date$mday post_hour <- post_date$hour post_weekday <- weekdays(post_date) title <- url%>%html_session()%>%html_nodes("div.post_item h3")%>%html_text()%>%as.character()%>%trim() link <- url%>%html_session()%>%html_nodes("div.post_item a.titlelnk")%>%html_attr("href")%>%as.character() author <- url%>%html_session()%>%html_nodes("div.post_item a.lightblue")%>%html_text()%>%as.character()%>%trim() author_hp <- url%>%html_session()%>%html_nodes("div.post_item a.lightblue")%>%html_attr("href")%>%as.character() recommendation <- url%>%html_session()%>%html_nodes("div.post_item span.diggnum")%>%html_text()%>%trim()%>%as.numeric() article_view <- url%>%html_session()%>%html_nodes("div.post_item span.article_view")%>%html_text()%>%str_sub(4,20) article_view <- gsub(")","",article_view)%>%trim()%>%as.numeric() article_comment <- 

你可能感兴趣的:(R 语言爬虫 之 cnblog博文爬取)