beatifulsoup库使用

管理员权限安装库文件

pip install beautifulsoup4
如果出现报错:
    OSError: raw write() returned invalid length 2 (should have been between 0 and 1)

需要安装
pip install win_unicode_console

在进行操作即可:
    安装成功如下:
C:\WINDOWS\system32>pip install beautifulsoup4
Collecting beautifulsoup4
  Downloading https://files.pythonhosted.org/packages/cb/a1/c698cf319e9cfed6b17376281bd0efc6bfc8465698f54170ef60a485ab5d/beautifulsoup4-4.8.2-py3-none-any.whl (106kB)
    100% |████████████████████████████████| 112kB 15kB/s
Collecting soupsieve>=1.2 (from beautifulsoup4)
  Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.python.org', port=443): Read timed out. (read timeout=15)",)': /simple/soupsieve/
  Downloading https://files.pythonhosted.org/packages/05/cf/ea245e52f55823f19992447b008bcbb7f78efc5960d77f6c34b5b45b36dd/soupsieve-2.0-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.8.2 soupsieve-2.0
You are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

小测试:

网站源代码:
This is a python demo page

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: Basic Python and Advanced Python.

測試情況

>>> import requests
>>> r = requests.get(http://python123.io/ws/demo.html)
SyntaxError: invalid syntax
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> r.text
'This is a python demo page\r\n\r\n

The demo python introduces several python courses.

\r\n

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\nBasic Python and Advanced Python.

\r\n' >>> demo = r.text >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(demo , "html.parser") >>> print(soup.prettify()) This is a python demo page

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: Basic Python and Advanced Python .

beatifulsoup語法

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo , "html.parser")

bs4 庫名
BeautifulSoup 类名
html.parser 解析器
soup 变量名;
demo 定义的变量 ;也可以直接换成:
    soup = BeautifulSoup(open("D://demo.html") , "html.parser")

beatifulsoup库解析器

bs4的html解析器;使用方法:BeautifulSoup(mk,'html.parser');安装:bs4库
lxml的html解析器;使用方法:BeautifulSoup(mk,'lxml');安装:pip install lxml
lxml的xml解析器;使用方法:BeautifulSoup(mk,'lxml');安装:pip install lxml
html5lib的解析器;使用方法:BeautifulSoup(mk,'html5lib');安装:pip insatll html5lib

beatifulsoup类的基本元素

基本元素 说明
Tag 标签,最基本的信息组织单元,分别用<>,表明开头和结尾
Name 标签的名字,

,格式是tag.name
Attributes 标签的属性,字典形式组织,格式:.attrs
NavigableString 标签内非属性字符串,<>…中字符串,格式:.string
Comment 标签内字符串的注释部分,一种特殊的comment类型;

承接以上环境,进行试验

>>> soup.a
Basic Python
>>> soup.a.string
'Basic Python'
>>> soup.b
The demo python introduces several python courses.
>>> soup.b
The demo python introduces several python courses.
>>> soup.b.string
'The demo python introduces several python courses.'
>>> tag = soup.a
>>> tag.attrs
{'id': 'link1', 'class': ['py1'], 'href': 'http://www.icourse163.org/course/BIT-268001'}
>>> tag.attrs['class']
['py1']
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'
>>> type(tag.attrs)

>>> type(tag)

>>> soup.a.name
'a'
>>> soup.a,parent.name
Traceback (most recent call last):
  File "", line 1, in 
    soup.a,parent.name
NameError: name 'parent' is not defined
>>> soup.a.parent.name
'p'
>>> soup.a.parent.parent.name
'body'
>>> tag
Basic Python
>>> soup/a
Traceback (most recent call last):
  File "", line 1, in 
    soup/a
NameError: name 'a' is not defined
>>> soup.a
Basic Python
>>> soup.title
This is a python demo page
>>> soup.p

The demo python introduces several python courses.

>>> soup.p.string 'The demo python introduces several python courses.' >>> type(soup.p.string) 注释试验: >>> newsoup = BeautifulSoup("

This is not a comment

", "html.parser") >>> newsoup.b.string 'This is a comment' >>> type(newsoup.b.string) >>> type(newsoup.p.string)

试验理解

.....

获取标签 ..... ;.string 获取非属性字符串/注释

你可能感兴趣的:(python)