本文是我在使用网易云课堂学习日月光华老师讲的“Python爬虫零基础入门到进阶实战”课程所做的笔记,如果大家觉得不错,可以去看一下老师的视频课,讲的还是很棒的。
本文没什么营养,只是做个笔记。
# 引入beautifulsoup
from bs4 import BeautifulSoup
html = """
- first item
- second item
- third item
- fourth item
- else item
another item
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.ul)
# 引入beautifulsoup
from bs4 import BeautifulSoup
html = """
- first item
- second item
- third item
- fourth item
- else item
another item
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.ul.li)
# 引入beautifulsoup
from bs4 import BeautifulSoup
html = """
- first item
- second item
- third item
- fourth item
- else item
another item
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.ul.li.a.string)
# 引入beautifulsoup
from bs4 import BeautifulSoup
html = """
- first item
- second item
- third item
- fourth item
- else item
another item
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.a['href'])
# 引入beautifulsoup
from bs4 import BeautifulSoup
html = """
- first item
- second item
- third item
- fourth item
- else item
another item
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.a.get('href'))
# 引入beautifulsoup
from bs4 import BeautifulSoup
html = """
- first item
- second item
- third item
- fourth item
- else item
another item
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('a'))
执行效果图:
# 引入beautifulsoup
from bs4 import BeautifulSoup
html = """
- first item
- second item
- third item
- fourth item
- else item
another item
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('a')[2].string)
# 引入beautifulsoup
from bs4 import BeautifulSoup
html = """
- first item
- second item
- third item
- fourth item
- else item
another item
"""
soup = BeautifulSoup(html, 'lxml')
print(soup('a')[2].string)
# 引入beautifulsoup
from bs4 import BeautifulSoup
html = """
- first item
- second item
- third item
- fourth item
- else item
another item
"""
soup = BeautifulSoup(html, 'lxml')
print(soup(class_='item-0'))
# 引入beautifulsoup
from bs4 import BeautifulSoup
html = """
- first item
- second item
- third item
- fourth item
- else item
another item
"""
soup = BeautifulSoup(html, 'lxml')
print(soup(class_='item-0')[0].string)
# 引入beautifulsoup
from bs4 import BeautifulSoup
# 引入正则表达式
import re
html = """
- first item
- second item
- third item
- fourth item
- else item
another item
"""
soup = BeautifulSoup(html, 'lxml')
print(soup(class_=re.compile('item-'))[3].string)
执行效果图:
# 引入beautifulsoup
from bs4 import BeautifulSoup
html = """
- first item
- second item
- third item
- fourth item
- else item
another item
"""
soup = BeautifulSoup(html, 'lxml')
print([x.strip() for x in soup.ul.get_text().split('\n') if x.strip()])
# print(soup.ul.get_text())
Python标准库中的re模块提供正则表达式的全部功能。
import re
import re
text = "Beautiful is better than ugly, Explicit is better than implicit, double click 666"
# 从头开始匹配
print(re.match('Beautiful', text))
import re
text = "Beautiful is better than ugly, Explicit is better than implicit, double click 666"
print(re.match('Beautiful', text).span())
import re
text = "Beautiful is better than ugly, Explicit is better than implicit, double click 666"
print(re.match('Beautiful', text).group())
只有用括号括起来的内容才算是对象
import re
text = "Beautiful is better than ugly, Explicit is better than implicit, double click 666"
print(re.match('(\w+) is (\w+)', text).group(1))
import re
text = "Beautiful is better than ugly, Explicit is better than implicit, double click 666"
print(re.match('\w+ is \w+', text).group())
import re
text = "Beautiful is better than ugly, Explicit is better than implicit, double click 666"
print(re.search('ugly', text).group())
第一个参数为被替换对象,第二个参数为替换成什么,第三个参数为替换地址,第四个参数为替换次数。
import re
text = "Beautiful is better than ugly, Explicit is better than implicit, double click 666"
print(re.sub('better', '666', text, count=1))
import re
text = "Beautiful is better than ugly, Explicit is better than implicit, double click 666"
print(re.sub(', dou.*', '', text))
最简单的分割:
import re
text = "Beautiful is better than ugly, Explicit is better than implicit, double click 666"
print(re.split(', ', text))
执行结果:
利用数字分割:
import re
text = "Beautiful is better than ugly, Explicit is better than implicit, double click 666"
print(re.split('\d+ ', text))
返回一个迭代对象,存储于列表之中。
import re
text = "Beautiful is better than ugly, Explicit is better than implicit, double click 666"
print(re.findall('is \w+',text))
import re
text = "Beautiful is better than ugly, Explicit is better than implicit, double click 666"
print(re.findall('is (\w+)',text))
对需要匹配的模式尽心预编译,会让速度变快。可以直接在预编译下进行查找。
import re
text = "Beautiful is better than ugly, Explicit is better than implicit, double click 666"
pat = re.compile('is (\w+)').findall(text)
print(pat)
常用的正则表达式模式
import re
text = "Beautiful is better than ugly, Explicit is better than implicit, double click 666"
# []表示 或者的意思
pat = re.compile('[une]').findall(text)
print(pat)
import re
html = """
Example website
"""
# 根据前后内容,构造正则表达式模式
print(re.compile("image1.html'>(.*)
").findall(html))
import re
html = """
Example website
"""
# 找所有文本的共同内容
print(re.compile("html'>(.*)
").findall(html))
import re
html = """
Example website
"""
print(re.compile("a href='(\w+.\w+)").findall(html))