《计算传播基础》读书笔记:第三章 数据抓取

目录

第三章 数据抓取

基本原理

第一个爬虫

Select 方法

Select 方法: 通过类名查找

Select 方法: 通过类名查找

Select 方法: 通过id名查找

Select 方法: 组合查找

Select 方法: 属性查找

find_all方法


第三章 数据抓取

基本原理

爬虫就是请求网站并提取数据的自动化程序。其中请求,提取,自动化是爬虫的关键!爬虫的基本流程:

  • 发起请求

    • 通过HTTP库向目标站点发起请求,也就是发送一个Request,请求可以包含额外的header等信息,等待服务器响应

  • 获取响应内容

    • 如果服务器能正常响应,会得到一个Response。Response的内容便是所要获取的页面内容,类型可能是HTML、Json字符串、二进制数据(图片或者视频)等类型

  • 解析内容

    • 得到的内容可能是HTML,可以用页面解析库、正则表达式进行解析;可能是Json,可以直接转换为Json对象解析;可能是二进制数据,可以做保存或者进一步的处理

  • 保存数据

    • 保存形式多样,可以存为文本,也可以保存到数据库,或者保存特定格式的文件

浏览器发送消息给网址所在的服务器,这个过程就叫做Http Request;服务器收到浏览器发送的消息后,能够根据浏览器发送消息的内容,做相应的处理,然后把消息回传给浏览器,这个过程就是Http Response.

第一个爬虫

import requests
from bs4 import BeautifulSoup

url = 'https://vp.fact.qq.com/home'
content = requests.get(url)
soup = BeautifulSoup(content.text, 'html.parser')

help(requests.get)

print(content.text)

(结果太长不在此展示)

print(content.encoding)
utf-8
import requests
from bs4 import BeautifulSoup as BeautifulSoup

url = 'http://socratesacademy.github.io/bigdata/data/test.html'
content = requests.get(url)
content = content.text
print(content)
soup = BeautifulSoup(content, 'html.parser')
print(soup)
print(soup.prettify())

Select 方法

  • 标签名不加任何修饰

  • 类名前加点

  • id名前加 #

我们也可以利用这种特性,使用soup.select()方法筛选元素,返回类型是 list。

Select 方法: 通过类名查找

import requests
from bs4 import BeautifulSoup as BeautifulSoup

url = 'http://socratesacademy.github.io/bigdata/data/test.html'
content = requests.get(url)
content = content.text
print(content)
soup = BeautifulSoup(content, 'html.parser')
# print(soup)
# print(soup.prettify())

print(soup.select('title')[0].text)
print(soup.select('a'))
print(soup.select('b'))
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

The Dormouse's story [Elsie, Lacie, Tillie] [The Dormouse's story]

Select 方法: 通过类名查找

print(soup.select('.title'))
print(soup.select('.sister'))
print(soup.select('.story'))
[

The Dormouse's story

] [Elsie, Lacie, Tillie] [

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

,

...

]

Select 方法: 通过id名查找

print(soup.select('#link1'))
print(soup.select('#link1')[0]['href'])
[Elsie]
http://example.com/elsie

Select 方法: 组合查找

print(soup.select('p #link1'))
[Elsie]

Select 方法: 属性查找

print(soup.select('head > title'))
print(soup.select('body > p'))
[The Dormouse's story]
[

The Dormouse's story

,

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

,

...

]

find_all方法

print(soup.find_all('p'))
[

The Dormouse's story

,

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

,

...

]
for i in soup('p'):
    print(i.text)
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
for tag in soup.find_all(True):
    print(tag.name)
html
head
title
body
p
b
p
a
a
a
p
print(soup('head'))
[The Dormouse's story]
print(soup('body'))
[

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

]
print(soup('title'))
[The Dormouse's story]
print(soup('p'))
[

The Dormouse's story

,

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

,

...

]
print(soup.p)

The Dormouse's story

print(soup.title.name)
title
print(soup.title.string)
The Dormouse's story
print(soup.title.text)
The Dormouse's story
print(soup.title.parent.name)
head
print(soup.get_text())
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

你可能感兴趣的:(《计算传播基础》读书笔记,自然语言处理,深度学习,nlp,人工智能)