新浪微博热榜爬虫(利用BeautifulSoup以及xpath两种方法)

import requests
from bs4 import BeautifulSoup
link='https://s.weibo.com/top/summary?Refer=top_hot&topnav=1&wvr=6'
hd={'cookie':'SINAGLOBAL=8338198557772.616.1566808641899; un=18801349425; UOR=,,www.baidu.com; wvr=6; login_sid_t=9a2c04c2a85dadc52ae9c185b86ff5fe; cross_origin_proto=SSL; _s_tentry=passport.weibo.com; Apache=1028918497690.1122.1569980882356; ULV=1569980882364:13:1:2:1028918497690.1122.1569980882356:1569828413813; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9WWR1P.g8IzEbevCzU0_6a5l5JpX5K2hUgL.Fo-cS0nXeh-cS0e2dJLoIpiz9gL09g4b9PirTHUDdspDqgXt; ALF=1601516889; SSOLoginState=1569980889; SCF=AraMfDPW3QszFC7UpkLAhrJ51ndjabfGqXxYlID6YutYkR-DCWQJuO1poOkRl1q9BlJlNjWER1Hx2TXrqv3HXo0.; SUB=_2A25wkHGKDeRhGeNI7FoV8CvKzD-IHXVT5ORCrDV8PUNbmtANLRHAkW9NSDpPy47d2Z5GpdMD7ihoqHh5vGesZgkp; SUHB=0XDlwN_Qhmep-L; WBStorage=384d9091c43a87a5|undefined; webim_unReadCount=%7B%22time%22%3A1569980982285%2C%22dm_pub_total%22%3A0%2C%22chat_group_client%22%3A0%2C%22allcountNum%22%3A1%2C%22msgbox%22%3A0%7D',
   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'}
r=requests.get(link,headers=hd)
soup=BeautifulSoup(r.text,'lxml')
topic_list=soup.find_all('tr')
for each in topic_list:
    index=each.find('td',class_='td-01 ranktop')
    if index!=None:
        topic=each.find('td',class_='td-02').text.strip().replace('\n',' ')
        print(index.text,topic)

新浪微博热榜爬虫(利用BeautifulSoup以及xpath两种方法)_第1张图片

利用xpath更简单

import requests
from lxml import etree
link='https://s.weibo.com/top/summary?Refer=top_hot&topnav=1&wvr=6'
hd={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'}
r=requests.get(link,headers=hd)
html=etree.HTML(r.text)
path1='//*[@id="pl_top_realtimehot"]/table/tbody/tr['
path2=']/td[2]/a/text()'
path3='//*[@id="pl_top_realtimehot"]/table/tbody/tr['
path4=']/td[2]/span/text()'
for i in range(2,52):
    title_list=html.xpath(path1+str(i)+path2)
    hot_index=html.xpath(path3+str(i)+path4)
    print(i-1,title_list[0],hot_index[0])

更新于2019.10.7

新浪微博热榜爬虫(利用BeautifulSoup以及xpath两种方法)_第2张图片

 

你可能感兴趣的:(python爬虫学习)