Python爬虫入门

JSON库：转换Python列表或字典数据为字符串，保存至JSON文本，或读取JSON文本转为Python列表或字典数据
requests库：发送网络请求，返回响应数据
lxml库：解析HTML文档标签，返回指定内容

import json
import requests
from lxml import etree

抓取网页文本和图片

url = 'https://www.baidu.com/'

r = requests.get(url)  # 发送网络请求，返回响应数据
r

抓取文本数据

x = r.content
x

x2 = r.content.decode()
x2

'\r\n 百度一下，你就知道         
        
 
 
  新闻 hao123 地图 视频 贴吧   更多产品 
 
 
    关于百度 About Baidu 
 ©2017 Baidu 使用百度前必读  意见反馈 京ICP证030173号   
 
 
   \r\n'

保存为网页文本文件

with open("pachong/baidu.html", "w", encoding='utf-8') as f:
    f.write(x2)

抓取图片或其他二进制数据

img = 'https://www.baidu.com/img/bd_logo1.png?where=super'

r = requests.get(img).content  # 纯二进制文件不需要解码
r

# 二进制数据保存为图片文件
with open('pachong/baidu.png', 'wb') as f:
    f.write(r)

简单的反反爬虫措施

伪装浏览器访问

User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36

以登录用户身份访问：

Cookie:

# 带header请求

headers = {
    # 去浏览器拷贝
    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
    "Cookie":"BAIDUID=7EC1D75354F6C7C565DEF73796283C53:FG=1; BIDUPSID=7EC1D75354F6C7C565DEF73796283C53; PSTM=1514274634; MCITY=131-131%3A; BDSFRCVID=92CsJeCCxG37sLJ71wuc0gflDkFOeQZRddMu3J; H_BDCLCKID_SF=tR30WJbHMTrDHJTg5DTjhPrMeH5mbMT-027OKK85-hRkMIj65t-2bPKjQMRlW-QIyHrb0p6athF0HPonHjtBDjbP; BD_CK_SAM=1; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; delPer=0; ZD_ENTRY=baidu; pgv_pvi=7431016448; pgv_si=s8918307840; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; BDRCVFR[tox4WRQ4-Km]=mk3SLVN4HKm; BDRCVFR[-pGxjrCMryR]=mk3SLVN4HKm; LOCALGX=%u957F%u6625%7C%31%37%38%34%7C%u957F%u6625%7C%31%37%38%34; Hm_lvt_e9e114d958ea263de46e080563e254c4=1540026478,1540446397,1540885527; Hm_lpvt_e9e114d958ea263de46e080563e254c4=1540886723; PSINO=2; BDSVRTM=358; H_PS_PSSID=",
}

url = 'https://www.baidu.com/'

# rh = requests.get(url).content.decode()  # 不带header
rh = requests.get(url, headers=headers).content.decode()  # 带header，解除可能的反爬虫措施

# 保存文件
with open("pachong/baidunews.html", "w", encoding='utf-8') as f:
    
    f.write(rh)

rh

用lxml解析网页结构

html = etree.HTML(rh, etree.HTMLParser())
html

XPath语法：选取节点

HTML三种节点：

元素节点
属性节点
文本节点

/   从根节点选取。
//  从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。
.   选取当前节点。
..  选取当前节点的父节点。

nodename  选取此节点的所有子节点，如：div 选取div元素对象，及子对象
@  选取属性值
text()  选取元素内容，文本值

抓取的百度首页html代码一部分


   新闻
   hao123
   地图
   视频
   贴吧
   学术
   登录
   设置
   更多产品

href = html.xpath('//div[@id="u1"]//a[@class="mnav"]//@href')  # 解析标签属性
href

['http://news.baidu.com',
 'https://www.hao123.com',
 'http://map.baidu.com',
 'http://v.baidu.com',
 'http://tieba.baidu.com',
 'http://xueshu.baidu.com']

text = html.xpath('//div[@id="u1"]//a[@class="mnav"]//text()')  # 解析标签属性
text

['新闻', 'hao123', '地图', '视频', '贴吧', '学术']

一般工作中，数据抓取并解析后，需要字符串清洗处理后才能转为干净的列表格式

例如使用Python的字符串处理函数：join(),replace(),split(),remove(),append() 等等，清理数据

JSON相关操作

json系列操作函数

字符串和字典互相转换
- json.loads():str转成dict
- json.dumps():dict转成str,常用===============
读取和写入文件
- json.load():文件对象读取json数据,常用===============
- json.dump():python数据保存成json文件

抓取数据生成列表或字典

c = []

for i in range(len(href)):
    c.append({
        'href': href[i],
        'text': text[i],
    })
    
c

[{'href': 'http://news.baidu.com', 'text': '新闻'},
 {'href': 'https://www.hao123.com', 'text': 'hao123'},
 {'href': 'http://map.baidu.com', 'text': '地图'},
 {'href': 'http://v.baidu.com', 'text': '视频'},
 {'href': 'http://tieba.baidu.com', 'text': '贴吧'},
 {'href': 'http://xueshu.baidu.com', 'text': '学术'}]

json.dumps()把列表或字典转化为字符串

js = json.dumps(c)
js

'[{"href": "http://news.baidu.com", "text": "\\u65b0\\u95fb"}, {"href": "https://www.hao123.com", "text": "hao123"}, {"href": "http://map.baidu.com", "text": "\\u5730\\u56fe"}, {"href": "http://v.baidu.com", "text": "\\u89c6\\u9891"}, {"href": "http://tieba.baidu.com", "text": "\\u8d34\\u5427"}, {"href": "http://xueshu.baidu.com", "text": "\\u5b66\\u672f"}]'

# JSON数据更易读，但体积会变大
js = json.dumps(c, ensure_ascii=False, indent=2) # 参数：编码，格式化
js

'[\n  {\n    "href": "http://news.baidu.com",\n    "text": "新闻"\n  },\n  {\n    "href": "https://www.hao123.com",\n    "text": "hao123"\n  },\n  {\n    "href": "http://map.baidu.com",\n    "text": "地图"\n  },\n  {\n    "href": "http://v.baidu.com",\n    "text": "视频"\n  },\n  {\n    "href": "http://tieba.baidu.com",\n    "text": "贴吧"\n  },\n  {\n    "href": "http://xueshu.baidu.com",\n    "text": "学术"\n  }\n]'

type(js)

str

with open("pachong/a.json", "w", encoding='utf-8') as f:
    f.write(js)

json.load()从文本中读取json信息

with open("pachong/a.json", "r", encoding='utf-8') as f:
    j = json.load(f)

j

[{'href': 'http://news.baidu.com', 'text': '新闻'},
 {'href': 'https://www.hao123.com', 'text': 'hao123'},
 {'href': 'http://map.baidu.com', 'text': '地图'},
 {'href': 'http://v.baidu.com', 'text': '视频'},
 {'href': 'http://tieba.baidu.com', 'text': '贴吧'},
 {'href': 'http://xueshu.baidu.com', 'text': '学术'}]

type(j)

list

j[0]

{'href': 'http://news.baidu.com', 'text': '新闻'}

j[0]['text']

'新闻'

数据分析之自己获取数据