使用requests 库获取网站html信息
import requests
response = requests.get("https://jingyan.baidu.com/article/17bd8e52c76b2bc5ab2bb8a2.html#:~:text=1.%E6%89%93%E5%BC%80%E6%B5%8F%E8%A7%88%E5%99%A8F12%202.%E6%89%BE%E5%88%B0headers%E9%87%8C%E9%9D%A2%E7%9A%84cookie,3.%E5%A6%82%E6%9E%9C%E8%A6%81%E6%89%BE%E5%88%B0%E5%AF%B9%E5%BA%94%E7%9A%84%E7%82%B9%E5%87%BBcookie%204.%E8%BF%9E%E7%BB%AD%E4%B8%89%E6%AC%A1%E7%82%B9%E5%87%BB%E5%8F%B3%E9%94%AE%E5%A4%8D%E5%88%B6")
print(response)
print(response.status_code)
if response.status_code >= 200 and response.status_code < 400:
...
elif response.status_code >= 400 and response.status_code < 500:
print("request failed for the client has error客户端错误")
elif response.status_code >= 500:
print("request failed for the server has error服务端错误")
if response.ok:
print(response.text)
...
else:
print("request failed")
import requests
response = requests.get("https://movie.douban.com/top250")
print(response)
print(response.text)
import requests
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
}
response = requests.get("https://movie.douban.com/top250", headers=headers)
print(response)
print(response.text)
HTML简单结构入门
HTML 定义网页的结构和信息,文件名为xxx.html,用浏览器打开
CSS 定义网页样式
JavaScript 定义用户和网页的交互逻辑
DOCTYPE HTML>
<html>
<body>
<h1>titleh1>
<p>some textsp>
body>
html>
标题
<h1>h1>
<h2>h2>
<h3>h3>
<h4>h4>
<h5>h5>
<h6>h6>
文本段落
<p>p>
强制换行 <br>
加粗 <b>b>
斜体 <i>i>
下划线 <u>u>
图片 <img src="..." width="" height="">
链接 <a href="https://..." target="_self">texta> (target表示打开的方式,当前页面跳转,新页面跳转等)
容器 块级元素-div-独占一行,span为内嵌元素
<div>
...
div>
<span>
...
span>
列表 有序列表ol,无序列表ul
<ol>
<li>chineseli>
<li>mathli>
ol>
<ul>
<li>chineseli>
<li>mathli>
ul>
表格 td数据,
<table border=“1”> 表格属性之一,显示边框
<table>
<table border=“1”>
<thead>
<tr>
<td>tableheader1td>
<td>tableheader2td>
tr>
thead>
<tbody>
<tr>
<td>111td>
<td>2222td>
tr>
<tr>
<td>333td>
<td>444td>
tr>
tbody>
table>
class属性 -- 帮助分类
<p class="content">给岁月以文明p>
<p class="content">而不是给文明以岁月p>
<p class="review">五星好评!p>
爬取网页中的书的价格和名称
from bs4 import BeautifulSoup
import requests
content = requests.get("http://books.toscrape.com/").text()
soup = BeautifulSoup(content, "html.parser")
all_prices = soup.findAll("p", attrs={"class":"price_color"})
for price in all_prices:
print(prices.string[2:])
all_titles = soup.findAll("h3")
for title in all_titles:
all_links = title.findAll("a")
for link in all_links:
print(link.string)