(源自《实用数据分析》(原书第2版),网站更新后原文代码不能用了所以自己写了个小爬虫)
开发人员工具(F12)用自带的元素定位
查看到该内容的两个标签
from bs4 import BeautifulSoup
import urllib.request
from time import sleep
from datetime import datetime
1.Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.
中文文档:https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
2.urllib.request
urllib 的 request 模块可以非常方便地抓取 URL 内容,也就是发送一个 GET 请求到指定的页面,然后返回 HTTP 的响应
url = "https://www.gold.org/"
req = urllib.request.urlopen(url)
page = req.read()
运行的时候出现了问题
百度之,发现是网站对于自动化爬虫的限制。解决方案大体就是加一个访问的时候 header 伪装成正常浏览器的样子就可以了。
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
url = "https://www.gold.org/"
req = urllib.request.Request(url,None,headers)
response = urllib.request.urlopen(req)
page = response.read()
soup = BeautifulSoup(page,'lxml')
price = soup.find("div",class_=["asset","ask"]).find_next(class_="value")
print(price.text)
with open("goldPrice.out","w") as f:
sNow = datetime.now().strftime("%I:%M:%S%p")
f.write("{0},{1} \n".format(sNow, getGoldPrice()))
"%I:%M:%S%P",%I 代表小时,%M 代表分钟,%S 代表秒,%p 代表 A.M. 或 P.M.
from bs4 import BeautifulSoup
import urllib.request
from time import sleep
from datetime import datetime
def getGoldPrice():
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
url = "https://www.gold.org/"
req = urllib.request.Request(url,None,headers)
response = urllib.request.urlopen(req)
page = response.read()
soup = BeautifulSoup(page,'lxml')
price = soup.find("div",class_=["asset","ask"]).find_next(class_="value")
return price.text
with open("goldPrice.out","w") as f:
for x in range(0,60):
sNow = datetime.now().strftime("%I:%M:%S%p")
f.write("{0},{1} \n".format(sNow, getGoldPrice()))
print("{0},{1} \n".format(sNow, getGoldPrice()))
sleep(59)
加了一个循环结构,每一分钟获取一次。