关注微信公众号(瓠悠笑软件部落),一起学习,一起摸鱼
涉及以下模块:
webbrowser 模块的 open() 函数可以启动一个浏览器,并定位到一个特定的URL。
>>> import webbrowser
>>> webbrowser.open('http://inventwithpython.com/')
True
这几行代码可以启动浏览器并打开网址。这是 webbrowser 模块能够做的唯一事情。即便如此,open() 函数也可以是一些有趣的事情变得可能。例如: 把一个地址拷贝到粘贴板, 然后打开google地图并输入进去,是一件很乏味的事情。你可以写一个脚本来自动完成这些事情。这样,你只需要将地址拷贝到粘贴板上面,然后运行这个脚本,那么地图就会自动打开了。
程序看上去像这样子:
#! /usr/bin/python3
# mapIt.py - Launches a map in the brower using an address from the
# command line or clipboard.
import webbrowser, sys, pyperclip
if len(sys.argv) > 1:
# Get address from command line.
address = ' '.join(sys.argv[1:])
else:
# Get address from clipboard.
address = pyperclip.paste()
webbrowser.open('https://www.google.com/maps/place/' + address)
需要安装 requests 模块: sudo pip install requests.
#! /usr/bin/python3
import requests
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
print(type(res))
print(res.status_code == requests.codes.ok)
print(len(res.text))
print(res.text[:250])
# 输出内容
<class 'requests.models.Response'>
True
178978
The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Projec
一个简单的方法是使用 Response object 的方法 raise_for_status() 来校验是否下载成功。如果下载过程中出问题了,她会爬出一个异常。如果下载成功,就不会做任何事情。
>>> import requests
>>> res = requests.get('http://inventwithpython.com/page_not_exist')
>>> res.raise_for_status()
Traceback (most recent call last):
File "" , line 1, in <module>
File "/usr/lib/python3/dist-packages/requests/models.py", line 840, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://inventwithpython.com/page_not_exist
>>>
如果下载失败,在你的程序里面不会导致终端退出,你可以用 try and except statements 包围 raise_for_status() 行,以处理这些错误,而不是引起程序崩溃。
import requests
res = requests.get('http://localhost:8080/index.html')
try:
res.raise_for_status()
except Exception as exc:
print('There was a problem: %s' % (exc))
#!/usr/bin/python3
import requests
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
res.raise_for_status()
playFile = open('RomeoAndJuliet.txt', 'wb')
for chunk in res.iter_content(100000):
playFile.write(chunk)
playFile.close()
迭代循环中iter_content()方法返回每个内容的“块”。每个块都是字节数据类型,而你需要指定每个块将包含多少字节。一百千字节通常是一个很好的大小,所以传递100000作为参iter_content()。代码运行完,会在当前目录生成RomeoAndJuliet.txt文件。 write()方法会返回这个文件的字节大小。总的来说,完成文件下载到保存,需要以下步骤:
HTML入门
HTML介绍
Beautiful Soup 模块可以从HTML页面中提取信息。 BeautifulSoup 模块的名称叫 bs4(for Beautiful Soup, version 4). 安装命令:sudo pip3 install beautfilsoup4 . 在安装的时候使用beautfilsoup4名称,但是在导入的时候,需要用: import bs4
#! /usr/bin/python3
import bs4
exampleFile = open('example.html')
exampleSoup = bs4.BeautifulSoup(exampleFile.read())
elems = exampleSoup.select('#author')
print(type(elems))
print(str(len(elems)))
print(type(elems[0]))
print(elems[0].getText())
print(str(elems[0]))
print(elems[0].attrs)
# 输出是
<class 'list'>
1
<class 'bs4.element.Tag'>
Al Sweigart
<span id="author">Al Sweigart</span>
{'id': 'author'}
从 http://xkcd.com 下载图片,并且定位向前按钮,继续下载图片。
#! /usr/bin/python3
# downloadXkcd.py - Downloads every signle XKCD comic.
import requests, os, bs4
url = 'http://xkcd.com' # starting url
os.makedirs('xkcd', exist_ok=True) # store comics in ./xkcd
while not url.endswith('#'):
# Download the page.
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
# Find the URL of the comic image.
comicElem = soup.select('#comic img')
if comicElem == []:
print('Could not find comic image.')
else:
comicUrl = 'http:' + comicElem[0].get('src')
# Download the image.
print('Downloading image %s...' % (comicUrl))
res = requests.get(comicUrl)
res.raise_for_status()
# Save the image to ./xkcd.
print('save the image '+os.path.basename(comicUrl)+' into xkcd')
imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
# Get the Prev button's url.
prevLink = soup.select('a[rel="prev"]')[0]
url = 'http://xkcd.com' + prevLink.get('href')
print('Done.')
BeautifulSoup参考文档