爬虫入门

解析html

  • 标签树遍历
    • 下行遍历
    • 上行遍历
    • 平行遍历
  • 信息提取
    • 查找标签B
    • 查找P标签

import requests
url = “https://python123.io/ws/demo.html”
r = requests.get(url)
r.text
‘This is a python demo page\r\n\r\n

The demo python introduces several python courses.

\r\n

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\nBasic Python and Advanced Python.

\r\n’

demo = r.text
from bs4 import BeautifulSoup //解析css html
soup = BeautifulSoup(demo,“html.parser”)
soup.title

This is a python demo page

标签树遍历

soup =BeautifulSoup(demo,“html.parser”)
soup.head

This is a python demo page >>> soup.head.contents [ This is a python demo page] >>> soup.body.contents ['\n',

The demo python introduces several python courses.

, '\n',

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

, ‘\n’]

len(soup.body.contents)
5

soup.body.contents[1]

The demo python introduces several python courses.

>>> soup.title.parent This is a python demo page >>> soup.html.parent This is a python demo page

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

>>> soup.parent >>> soup.a.next_sibling ' and ' >>> soup.a.next_sibling.next_sibling //下一个节点 Advanced Python >>> soup.a.previous_sibling //前一个节点 'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n' >>> soup.a.previous_sibling.previous_sibling >>> soup.a.parent

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

下行遍历

.contents .chidren .descendant

上行遍历

。.parent .prents

平行遍历

.next_sibling .previous_sibling .next_siblings .previous_sibling
换行

soup.prettify

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

> >>> soup.prettify() '\n \n \n This is a python demo page\n \n \n \n

\n \n The demo python introduces several python courses.\n \n

\n

\n Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n \n Basic Python\n \n and\n \n Advanced Python\n \n .\n

\n \n' >>> print(soup.prettify()) This is a python demo page

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: Basic Python and Advanced Python .

>>> soup = BeautifulSoup("

中文

",“html.parser”) >>> soup.p.string >>> print(soup.p.prettify())

信息提取

import requests
url = “https://python123.io/ws/demo.html”
r = requests.get(url)
r.text
‘This is a python demo page\r\n\r\n

The demo python introduces several python courses.

\r\n

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\nBasic Python and Advanced Python.

\r\n’

demo = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,“html.parser”)
for link in soup.find_all(‘a’):
print(link.get(‘href’))

http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001

soup.find_all([‘a’,‘b’]) //查找a b 标签
[The demo python introduces several python courses., Basic Python, Advanced Python]

for tag in soup.find_all(True):
print(tag.name)

html
head
title
body
p
b
p
a
a
//查找标签名字

查找标签B

import re
for tag in soup.find_all(re.compile(‘b’)):
print(tag.name)

body
b

查找P标签

soup.find_all(‘p’,‘course’)
[

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

]

soup.find_all(id=‘link1’)
[Basic Python]

soup.find_all(id=‘link’)
[] //查找名字是link

soup.find_all(id=re.compile(‘link’))
[Basic Python, Advanced Python]
//查找链接地址>>> soup.find_all(‘a’)
[Basic Python, Advanced Python]

soup.find_all(‘a’,recursive=False)
[]
//括号里是对应标签的(name,attrs,recursive,string,…)名字 属性,子孙全部默认true,string检索字符

soup

This is a python demo page

The demo python introduces several python courses.

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Basic Python and Advanced Python.

//完整html

soup.find_all(string = “Basic Python”)
[‘Basic Python’]
//检索字符串Basic Python

import re
soup.find_all(string = re.compile(‘python’))
[‘This is a python demo page’, ‘The demo python introduces several python courses.’]
//检索含有python的信息

你可能感兴趣的:(笔记)