python爬虫笔记(二):提取(一)

本次笔记主要记录BeautifulSoup的一些基本概念和用法

beautifulsoup入门

BeautifulSoup库的基本元素

  1. 网页语法解析

python爬虫笔记(二):提取(一)_第1张图片

例如:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get("https://python123.io/ws/demo.html")
soup = bs(r.text, 'html.parser')#使用bs4的HTML解析器
print(soup.prettify())#打印美化
  1. BeautifulSoup类的基本元素

python爬虫笔记(二):提取(一)_第2张图片

例如:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get("https://python123.io/ws/demo.html")
soup = bs(r.text, 'html.parser')
#print(soup.prettify())
print(soup.title)
print(soup.a)
print(soup.a.name)
print(soup.a.parent.name)
print(soup.a.attrs)
print(soup.a.string)
  1. beautifulsoup库的理解

python爬虫笔记(二):提取(一)_第3张图片

例如:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get("https://python123.io/ws/demo.html")
soup = bs(r.text, 'html.parser')
#print(soup)
#print(soup.a)
print(soup.a.attrs)

基于bs4库的HTML内容遍历方法

HTML的基本格式

python爬虫笔记(二):提取(一)_第4张图片

  1. 标签树的下行遍历

python爬虫笔记(二):提取(一)_第5张图片

例如:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get("https://python123.io/ws/demo.html")
soup = bs(r.text, 'html.parser')

#print(soup.head)#head标签
#print(soup.head.contents)#它儿子标签

print(soup.body.contents)#body标签
print(len(soup.body.contents))#body标签下所有标签的

用于循环遍历的.children属性

python爬虫笔记(二):提取(一)_第6张图片

import requests
from bs4 import BeautifulSoup as bs

r = requests.get("https://python123.io/ws/demo.html")
soup = bs(r.text, 'html.parser')

for child in soup.body.children:
    print(child)
  1. 标签树的上行遍历、

python爬虫笔记(二):提取(一)_第7张图片

例如:

import requests
from bs4 import BeautifulSoup as bs
import os
r = requests.get("https://python123.io/ws/demo.html")
soup = bs(r.text, 'html.parser')

#print(soup.title.parent)#title的上行标签
for parent in soup.a.parents:  #遍历父辈标签
    if parent is None:
        print(parent)
    else:
        print(parent.name)
  1. 标签树的平行遍历

python爬虫笔记(二):提取(一)_第8张图片

平行遍历的条件

python爬虫笔记(二):提取(一)_第9张图片

import requests
from bs4 import BeautifulSoup as bs
import os
r = requests.get("https://python123.io/ws/demo.html")
soup = bs(r.text, 'html.parser')

print(soup.a)
print(soup.a.next_sibling.next_sibling)#a标签的下下一个平行的标签

import requests
from bs4 import BeautifulSoup as bs
import os
r = requests.get("https://python123.io/ws/demo.html")
soup = bs(r.text, 'html.parser')

for sibling in soup.a.next_siblings: #遍历后续所有平行节点
    print(sibling)

python爬虫笔记(二):提取(一)_第10张图片

基于bs4库的HTML格式化和编码

就是prettify的使用

import requests
from bs4 import BeautifulSoup as bs
import os
r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
#print(demo)
soup = bs(demo, 'html.parser')
print(soup.prettify())#显示更格式化

你可能感兴趣的:(python爬虫笔记(二):提取(一))