【Python】bs4解析网页

一、用bs4解析网页

1.1、关于网页:

bs

class属性允许重复,但是id不允许重复;

str.split()返回的数据类型是列表

lis1 = [i for i in lis1 if i!='']
去掉空字符串


import requests
from lxml import etree
from bs4 import BeautifulSoup
url = r'https://www.sac.net.cn/ljxh/jgsz/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
          'authority': 'www.sac.net.cn',
          'cookie': 'acw_tc=7cc1f4b316733131136096468ed100f23189482e29de3d5f2ae19f91ce'}
response = requests.get(url,headers = headers).content
# print(response)
soup = BeautifulSoup(response,'html.parser')
main = soup.find_all('span', class_ = 'jgsz_1224')
list = []
fp = open(r'D:\test\a.txt','w')
for u in main:
    uu = u.text
    ut=uu.strip('\n')
#     print(ut)
    list.append(ut)
list1 = []
for i in list:
#     print(i)
    i = i.replace('\t\xa0\xa0','')
    i = i.replace('\n','')
    i = i.split('              ')
    list1.append(i)
#     print(i)
for m in list1[1]:
    if m != '':
        m = m+'\n'
        fp.write(m)
fp.close()
# 爬取多个普通网页的一般流程,先把网页的链接拿到,然后挨个遍历就行了
from urllib.request import urlopen
from bs4 import BeautifulSoup
import time
start = time.time()
html = urlopen('https://www.sac.net.cn/ljxh/jgsz/') #获取网页
soup = BeautifulSoup(html, 'html.parser') #解析网页
main = soup.find_all('a', class_ = 'jgsz_1224') 
# 这里的tag值如果是span,可以得出所有的string,但是链接拿不到,原因是因为这个的标签值不对,应该是a
link_list = []
for u in main:
    uu = u.get('href')
    if uu.startswith('http'):
        print(uu)
        link_list.append(uu)
# print(link_list)
fp = open(r'D:\test\a.txt','a',encoding = 'utf-8')
for link in link_list:
    html1 = urlopen(link)
    soup1 = BeautifulSoup(html1, 'html.parser')
    main = soup1.find_all('td', class_ = 'xl_cen')  
    for i in main:
        try:
            t = i.text
            fp.write(t)
        except Exception as e:
            print(e)
fp.close()
end = time.time()
run = end-start
print(run)
print('finish')  

这里,就是爬取下来的数据,十分不规则,感觉很鸡肋,而且有很多空行,也替换不了,不知道为什么,后面再继续弄吧;

这里还有个问题,就是爬取链接的时候,注意属性值

你可能感兴趣的:(Python,python)