R语言rvest包使用

R中有好几个包都可以抓取网页数据,但是rvest + CSS Selector最方便。

R语言rvest包使用_第1张图片

通过查看器立刻知道表格数据都在td:nth-child(1),td:nth-child(3)之类的节点中,直接代码提取就行了。

library(rvest)

先看看都有什么

freak <- html_session("http://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/")

freak

<session> http://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/

 Status: 200

 Type:   text/html; charset=UTF-8

 Size:   24983

 

freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10]

[1] "Silver Linings Playbook "          

[2] "The Hobbit: An Unexpected Journey "

[3] "Life of Pi (DVDscr/DVDrip)"        

[4] "Argo (DVDscr)"                    

[5] "Identity Thief "                  

[6] "Red Dawn "                        

[7] "Rise Of The Guardians (DVDscr)"    

[8] "Django Unchained (DVDscr)"        

[9] "Lincoln (DVDscr)"                  

[10] "Zero Dark Thirty "  

 

freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11]

[1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

 

freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10]

[1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer"

[5] "8.2 / trailer" "5.3 / trailer" "7.5 / trailer" "8.8 / trailer"

[9] "8.2 / trailer" "7.6 / trailer"

 

freak %>% html_nodes("td:nth-child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10]

[1] "http://www.imdb.com/title/tt1045658/"

[2] "http://www.imdb.com/title/tt0903624/"

[3] "http://www.imdb.com/title/tt0454876/"

[4] "http://www.imdb.com/title/tt1024648/"

[5] "http://www.imdb.com/title/tt2024432/"

[6] "http://www.imdb.com/title/tt1234719/"

[7] "http://www.imdb.com/title/tt1446192/"

[8] "http://www.imdb.com/title/tt1853728/"

[9] "http://www.imdb.com/title/tt0443272/"

[10] "http://www.imdb.com/title/tt1790885/?"

#构建数据框

data.frame(movie=freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10],

           rank=freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11],

           rating=freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10],

           imdb.url=freak %>% html_nodes("td:nth child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10],stringsAsFactors=FALSE)

                               movie rank        rating                              imdb.url

1            Silver Linings Playbook     1 7.4 / trailer  http://www.imdb.com/title/tt1045658/

2  The Hobbit: An Unexpected Journey     2 8.2 / trailer  http://www.imdb.com/title/tt0903624/

3          Life of Pi (DVDscr/DVDrip)    3 8.3 / trailer  http://www.imdb.com/title/tt0454876/

4                       Argo (DVDscr)    4 8.2 / trailer  http://www.imdb.com/title/tt1024648/

5                     Identity Thief     5 8.2 / trailer  http://www.imdb.com/title/tt2024432/

6                           Red Dawn     6 5.3 / trailer  http://www.imdb.com/title/tt1234719/

7      Rise Of The Guardians (DVDscr)    7 7.5 / trailer  http://www.imdb.com/title/tt1446192/

8           Django Unchained (DVDscr)    8 8.8 / trailer  http://www.imdb.com/title/tt1853728/

9                    Lincoln (DVDscr)    9 8.2 / trailer  http://www.imdb.com/title/tt0443272/

10                  Zero Dark Thirty    10 7.6 / trailer  http://www.imdb.com/title/tt1790885/?

 

如果不考虑网址,还有更简单的方式:

 

freak %>% html_nodes("table") %>% html_table()

[[1]]

           Ranking (last week)                             Movie IMDb Rating / Trailer

1  torrentfreak.com        <NA>                              <NA>                  <NA>

2                 1         (5)           Silver Linings Playbook         7.4 / trailer

3                 2      (back) The Hobbit: An Unexpected Journey         8.2 / trailer

4                 3         (9)        Life of Pi (DVDscr/DVDrip)         8.3 / trailer

5                 4      (back)                     Argo (DVDscr)         8.2 / trailer

6                 5         (…)                    Identity Thief         8.2 / trailer

7                 6         (1)                          Red Dawn         5.3 / trailer

8                 7         (2)    Rise Of The Guardians (DVDscr)         7.5 / trailer

9                 8         (4)         Django Unchained (DVDscr)         8.8 / trailer

10                9         (6)                  Lincoln (DVDscr)         8.2 / trailer

11               10      (back)                  Zero Dark Thirty         7.6 / trailer

你可能感兴趣的:(R语言rvest包使用)