R语言读取网站文件

一、readLine()
    readLine()读取web网页文本文件。读取法国巴黎第七大学首页html前十行。
> urlinternetaddr='http://www.univ-paris-diderot.fr/sc/site.php?bc=accueil&np=accueil'
>  dlist1=readLines(urlinternetaddr,n=10)
> dlist1
 [1] "  [2] "\"http://www.w3.org/TR/html4/loose.dtd\">"                                                                                                                    
 [3] ""                                                                                                                                           
 [4] ""                                                                                                                                                       
 [5] ""                                         
 [6] "\t"                                                                                                                
 [7] "\t"                                                                                                                                                
 [9] "\t        "
[10]"\t  "       
显然,我们在这里找不到想要的内容,例如招生信息。在这种情况下,最常规的方法是增加样本容量,将n=10改为n=50。
> dlist2=readLines(urlinternetaddr,n=50)
> dlist2
 [1] "  [2] "\"http://www.w3.org/TR/html4/loose.dtd\">"                                                                                                                                                                                                                     
 [3] ""                                                                                                                                                                                                                                            
 [4] ""                                                                                                                                                                                                                                                        
 [5] ""                                                                                                                                          
 [6] "\t"                                                                                                                                                                                                                 
 [7] "\t"                                                                                                                                                                                                                                                 
 [9] "\t        "                                                                                                
[10] "\t  "                                                                                                                                                                               
[11] "        "                                                                                                                                                                          
[12] "        "                                                                                                                                                                                     
[13] "       "                                                                                                                                                                            
[14] "       "                                                                                                                                                                         
[15] "      "                                                                                                                                                                                       
[16] "       "                                                                                                                                                                                                                               
[17] ""                                                                                                                                                                       
[18] ""                                                                                                                                                                  
[19] ""                                                                                                                                                                                                                                                              
[20] "  "                                                                                                                                                                                                                                    
[21] ""                                                                                                                                                                                                                                                     
[36] ""                                                                                                                                                                                                                                                              
[37] "      "                                                                                                                                                                                                                                  
[38] ""                                                                                                                                                                                     
[39] ""                                                                                                                                                                                         
[40] ""                                                                                                                                                                                           
[41] ""                                                                                                                                                                                
[42] ""                                                                                                                                                                             
[43] ""                                                                                                                                                                           
[44] ""                                                                                                                                                                                           
[45] ""                                                                                                                                                                                                                                                        <br> [46] "Universit\xe9 Paris Diderot-Paris 7"                                                                                                                                                                                                                   
[47] ""                                                                                                                                                                                      
[48] ""                                                                                                                                                                                    
[49] ""                                                                                                                                                                                                 
[50] " "                                                                           
>
是否有办法判断html的行数呢?
一个途径是下载huml文件。
readLine()读取.txt文件,因此用grep()进行过滤。
     中法文关键字对照表
                    vacancies
                    jobs
                          

     景点
卢浮宫(Musee du Louvre)
· 艾菲尔铁塔(Tour Eiffel)
· 凯旋门(Arc de Triomphe)
· 大皇宫国家美术馆(Galeries nationales du Grand Palais)
· 小皇宫博物馆(Petit Palais)
· 协和广场(Place de la Concorde)
· 橘园美术馆(Musee de l’Orangerie)
· 探索皇宫(Palais de la Découverte)
· 玛德莲教堂(La Madeleine)
· 居斯塔.牟侯美术馆(Musee Gustave Moreau)
· 巴黎清真寺(Mosguee de Paris)
· 植物园(Jardin des Plantes)
· 万神庙(Panthéon)
· 卢森堡公园(Palais et Jardin du Luxembourg)
· 索邦大学(La Sorbonne)
· 圣赛芙韩教堂(St-Severin)
· 克鲁尼美术馆(Musee de Cluny)
· 巴黎圣母院(Notre Dame de Paris)
· 圣礼拜堂(Sainte Chapelle)
· 巴黎古监狱(Conciergerie)
· 圣心堂(Basllique du Sacre Coeur)
· 达利美术馆(Espace Montmartre Salvador Dari)
· 罗丹美术馆(Musee Rodin)
· 巴黎市立近代美术馆(Musée d’Art Moderne de la Ville de Paris)
· 夏悠宫(Palais de Chaillot)
· 荣民院(Hotel des Invalides)
· 奥塞博物馆( Musue d’Orsay)
· 巴士底歌剧院(L’Opera de la Bastille)
· 冬之马戏团馆(Le Cirque d’Hiver)
· 庞毕度中心(Pompidou)
· 马蒙丹-莫奈美术馆(Musée Marmottan-Monet)
· 新凯旋门(La Grande Arche de la Defense)
· 凡尔塞宫(Chateau de Versailles)
· 枫丹白露宫(Fontainebleau)
· 欧洲迪斯尼乐园(Euro Disney Resort)
· 卡尔赛广场(PLACE DU CARROUSEL)
· 阿拉伯世界博物馆 (Musée de l’Institut du Monde Arabe)
· 市政厅(Hôtel de Ville)
· 爱丽舍宫(Le Palais de l’Elysée)
· 杜乐丽花园(Jardin des Tuileries)
· 波旁宫(Palais Bourbon)
· 毕加索博物馆(Musée Picasso)
· 卢森堡宫(Palais du Luxembourg)
· 亚利山大三世桥(Pont Alexandre III)
· 小凯旋门(Arc de Triomphe du Carrousel)


里昂景点翻译

· 圣.让教堂(Cathédrale St-Jean)
· 高卢罗马博物馆 (Musée Gallo-Romain)
· 富尔韦圣母院(Basilique de Notre Dame de Fourvière)
· 古罗马大剧院(Roman theater)
· 里昂歌剧院(The National Opera of Lyon )
· 里昂市政厅-Hôtel de Ville (City Hall)
· 题德多公园(Tête d’Or Park)
· 丝织博物馆-Musee des Tissus(Textile Museum )
· 艺术博物馆-Musée des Beaux-Arts(Museum of Fine Arts)
· 里昂装饰艺术博物馆--Musee des Arts Decoratifs(The Lyon Decorative Arts Museum )
· 印刷博物馆(Museum of Printing)

城市文化
· 旧日里昂音乐嘉年华(Fesival de Musique de Vieux Lyon)
· 维安那爵士音乐节(Festival du Jazz a Vienna)

城市美食
· Cafe des Federations Bistro Plaza
· Bistro Plaza
· Chez Chabert
· Paul Bocus

巴黎各个博物馆的免费日

Arc de Triomphe 凯旋门
Entree gratuite le premier dimanche de chaque mois du 1.10au31.5inclus
10月1日至5月31日,每月第一个周日免费

Musee de l’Armee (Tombeau de Napoleon)军事博物馆(拿破仑墓)
没有优惠日,或免费日

Musee d’Art moderne 现代艺术博物馆
Entree gratuite le dimanche de 10h a 13h
每周日10点到13点免费

Musee national d’Art moderne--Centre Georges Pompidou 庞皮度艺术中心
Entree gratuite le premier dimanche de chaque mois
每月第一个周日免费

Maison de Balzac 巴尔扎克故居
Entree gratuite le dimanche de 10h a 13h
每周日10点到13点免费

Maison de Victor Hugo 雨果故居
Entree gratuite le dimanche de 10h a 13h
每周日10点到13点免费

Musee de Louver 卢浮宫
Entree gratuite le premier dimanche de chaque mois
每月第一个周日免费



你可能感兴趣的:(R语言)