Python之爬虫一

爬虫学习：

1、python如何访问互联网？

URL ---- 网页地址油 protocol://hostname[:port]/path/[;parameters][?query]#fragment

urllib包： python3将urllib和urllib2合并为urllib包

urllib.request模块：包含了对服务器的请求发出跳转安全代理等几大部分

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

第一个参数必须，其他都有默认值

data 不为None 则以POST的形式取代GET的形式提交数据

data是一个字典

data必须是基于application/x-www-form-urlencoded 标准的格式

可以用urllib.parse.urlencode()转换为这种格式

例子：

#读取网页，还回一个对象

response = urllib.request.urlopen("http://fishc.com")

#读取出网页内容来二进制的字符串

html = reponse.read()

#解码操作解码为UTF-8

html = html.decode("utf-8")

2、在 placeKitten.com 网站上下载一直猫的例子：

import urllib.request

#response = urllib.request.urlopen("http://placeKitten.com/g/500/600")

#cat_img = resqonse.read()

#with open("cat_500_600.jpg",'wb') as f:

# f.write(cat_img)

req = urllib.request.Request('http://placeKitten.com/g/500/600')

response = urllib.request.urlopen(req) # urlopen（）中的第一个参数既可以是一个字符串，也可以是一个Request对象

cat_img = response.read()

with open("cat_500_600.jpg", 'wb') as f:

f.write(cat_img)

geturl()

info()

getcode()

第二个例子：利用有道词典进行翻译

审查元素 Network

POST 向指定服务器提交被处理的信息

GET 指从服务器请求获得数据

priview 正文

Headers：

Request Headers 客服端即浏览器一般被服务端判断是否非人类访问

User-Agent 识别是浏览访问还是代码访问

From Data post提供的数据

Json模块 JSON可以把Python数据的层次结构，并将其转换为字符串表示的标准模块;这个过程被称为串行化。

Json.loads()

target

代码：

import urllib.request

import urllib.parse

import json

comtext = input("请输入要翻译的内容：")

url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=https://www.baidu.com/link"

data = {}

data['doctype'] = "json"

data['i'] = comtext

data['keyfrom'] = "fanyi.web"

data['type'] = "AUTO"

data['typoResult']= "true"

data['ue'] = "UTF-8"

data['xmlVersion'] = "1.8"

data = urllib.parse.urlencode(data).encode('utf-8') #编码转换为utf-8

response = urllib.request.urlopen(url, data)

html = response.read().decode('utf-8') #解码成Unicode

# 此处读取到的数据为字典类型

target = json.loads(html) #这个简单的序列化技术可以处理列表和词典

print("翻译结果：%s" %(target['translateResult'][0][0]['tgt']))

Python之爬虫一

你可能感兴趣的:(Python之爬虫一)