python爬虫学习-day1

目录

  1. python爬虫学习-day1
  2. python爬虫学习-day2正则表达式
  3. python爬虫学习-day3-BeautifulSoup
  4. python爬虫学习-day4-使用lxml+xpath提取内容
  5. python爬虫学习-day5-selenium
  6. python爬虫学习-day6-ip池
  7. python爬虫学习-day7-实战

查考python爬虫学习-day1

环境变量

python 3.7 
pycharm 2018.3

1.1 学习get与post请求

通过requests实现

requests的get与post函数:

get(url, params=None, **kwargs)
    Sends a GET request.
    
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary, list of tuples or bytes to send
        in the body of the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response ` object
    :rtype: requests.Response
    
 post(url, data=None, json=None, **kwargs)
    Sends a POST request.
    
    :param url: URL for the new :class:`Request` object.
    :param data: (optional) Dictionary, list of tuples, bytes, or file-like
        object to send in the body of the :class:`Request`.
    :param json: (optional) json data to send in the body of the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response ` object
    :rtype: requests.Response   
    
python爬虫学习-day1_第1张图片
image.png

示例


import requests
# get请求
url = 'https://www.baidu.com'
response = requests.get(url)
print(response.text)

print('--------------分割线--------------')
# post 请求
data = {
    "name": "aa",
    "school": 'linan'
}

response = requests.post(url, data=data)
print(response.text)

结果
get结果

python爬虫学习-day1_第2张图片
02.png

post结果
python爬虫学习-day1_第3张图片
03.png

json的处理模块 json

json

json模块提供了一种简单的方式来编码和解码JSON数据,其主要函数是json.dumps()和json.loads()

import json

data = {
    "name": "aa",
    "school": 'linan'
}
#  将json对象转换成json字符串
json_str = json.dumps(data)
print(json_str)
# 将json字符串转换成json对象
data = json.loads(json_str)
print(data)

结果

{"name": "aa", "school": "linan"}
{'name': 'aa', 'school': 'linan'}

解释:print输出json字符串,没问题,注意是的print(data),与输出字符串结果一致,因为python解释器底层实现的(这块猜测,无法得知)

通过urllib实现

函数

urlopen(url, data=None, timeout=, *, cafile=None, capath=None, cadefault=False, context=None)
    Open the URL url, which can be either a string or a Request object.
    
    *data* must be an object specifying additional data to be sent to
    the server, or None if no such data is needed.  See Request for
    details.
    
    urllib.request module uses HTTP/1.1 and includes a "Connection:close"
    header in its HTTP requests.
    
    The optional *timeout* parameter specifies a timeout in seconds for
    blocking operations like the connection attempt (if not specified, the
    global default timeout setting will be used). This only works for HTTP,
    HTTPS and FTP connections.
    
    If *context* is specified, it must be a ssl.SSLContext instance describing
    the various SSL options. See HTTPSConnection for more details.
    
    The optional *cafile* and *capath* parameters specify a set of trusted CA
    certificates for HTTPS requests. cafile should point to a single file
    containing a bundle of CA certificates, whereas capath should point to a
    directory of hashed certificate files. More information can be found in
    ssl.SSLContext.load_verify_locations().
    
    The *cadefault* parameter is ignored.
    
    This function always returns an object which can work as a context
    manager and has methods such as
    
    * geturl() - return the URL of the resource retrieved, commonly used to
      determine if a redirect was followed
    
    * info() - return the meta-information of the page, such as headers, in the
      form of an email.message_from_string() instance (see Quick Reference to
      HTTP Headers)
    
    * getcode() - return the HTTP status code of the response.  Raises URLError
      on errors.
    
    For HTTP and HTTPS URLs, this function returns a http.client.HTTPResponse
    object slightly modified. In addition to the three new methods above, the
    msg attribute contains the same information as the reason attribute ---
    the reason phrase returned by the server --- instead of the response
    headers as it is specified in the documentation for HTTPResponse.
    
    For FTP, file, and data URLs and requests explicitly handled by legacy
    URLopener and FancyURLopener classes, this function returns a
    urllib.response.addinfourl object.
    
    Note that None may be returned if no handler handles the request (though
    the default installed global OpenerDirector uses UnknownHandler to ensure
    this never happens).
    
    In addition, if proxy settings are detected (for example, when a *_proxy
    environment variable like http_proxy is set), ProxyHandler is default
    installed and makes sure the requests are handled through the proxy.
 
 

urlopen可以进行get和post访问

示例

import urllib.request as ur
print('--------get')
# get
url = 'https://www.baidu.com'
res = ur.urlopen(url)
#  read first line
fristline = res.readline()
print(fristline)

print('-------post')
# post
req = ur.Request(url=url, data=b'the first day of web crawler')
res_data = ur.urlopen(req)
res = res_data.read()
print(res)

结果

python爬虫学习-day1_第4张图片
04.png

示例2

import urllib.request, urllib
from urllib.request import URLError

basic_url = 'https://www.cnblogs.com/billyzh/p/5819957.html'
# 由str 转换成byte
"".encode('utf-8')


# 获取URLs
# 简单的使用
# get_url_1
def get_url():
    response = urllib.request.urlopen('https://www.cnblogs.com/billyzh/p/5819957.html')
    html = response.read().decode('utf-8')
    print(html)

# HTTP是基于请求和响应---客户端发出请求和服务器端发送响应。Urllib2 对应Request对象表示你做出HTTP请求,最简单的形式,创建一个指定要获取的网址的Request对象。这个Request对象调用urlopen,返回URL请求的Response对象。Response对象是一个类似于文件对象,你可以在Response中使用 .read()。
# get_url_2
def get_url_2():
    request = urllib.request.Request('https://www.cnblogs.com/billyzh/p/5819957.html')
    response = urllib.request.urlopen(request)
    html = response.read().decode('utf-8')
    print(html)   
    
# url_info
# 查看请求的信息
def url_info():
    request = urllib.request.Request('https://www.cnblogs.com/billyzh/p/5819957.html')
    response = urllib.request.urlopen(request)
    info = response.info()
    url = response.geturl()
    print(info)
    print('---------url---------')
    print(url)
    html = response.read().decode('utf-8')
    # print(html)
    
 # Data  -post
def data_post():
    url = 'http://www.someserver.com/cgi-bin/register.cgi'
    values = {}
    values['name'] = 'Alison'
    values['password'] = 'Alison'

    data = urllib.parse.urlencode(values)
    request = urllib.request.Request(url, data)
    response = urllib.request.urlopen(request)
    this_page = response.read().decode('utf-8')
    print(this_page)
 
# Data -get
def data_post():
    url = 'http://user.51sole.com/'
    values = {}
    values['txtUserName'] = '1'
    values['txtPwd'] = '1'

    data = urllib.parse.urlencode(values).encode('utf-8')
    request = urllib.request.Request(url, data)
    response = urllib.request.urlopen(request)
    this_page = response.read().decode('utf-8')
    print(this_page)
  

断开网络后发出申请

这里继承上面的urllib.request.urlopen()

发送请求之后,出现如下图的结果


python爬虫学习-day1_第5张图片
05.png

请求头

请求头的作用,通俗来讲,就是能够告诉被请求的服务器需要传送什么样的格式的信息。由于时间关系,这里就贴一下从百度百科看来的一些我认为比较重要的请求头类型:

Accept:浏览器可接受的MIME类型。
Accept-Charset:浏览器可接受的字符集。
Accept-Language:浏览器所希望的语言种类,当服务器能够提供一种以上的语言版本时要用到。
Authorization:授权信息,通常出现在对服务器发送的WWW-Authenticate头的应答中。
Connection:表示是否需要持久连接。
Content-Length:表示请求消息正文的长度。
Cookie:这是最重要的请求头信息之一。
User-Agent:浏览器类型,如果Servlet返回的内容与浏览器类型有关则该值非常有用。
…  

如何添加请求头?

在爬虫的时候,如果不添加请求头,可能网站会阻止一个用户的登陆,此时我们就需要添加请求头来进行模拟伪装,使用python添加请求头方法如下。

import urllib.request, urllib
from urllib.request import URLError
from io import BytesIO
import gzip
# Headers
# 用户代理(User-Agent)头
def header_demo():
    user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
    values = {}
    values['name'] = 'alison'
    values['passwd'] = '123'
    data = urllib.parse.urlencode(values).encode('utf-8')
    headers = {'user-agent': user_agent}
    headers[
        'accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3'
    headers['accept-encoding'] = 'gzip, deflate, br'
    headers['accept-language'] = 'zh-CN,zh;q=0.9,en;q=0.8'
    headers['Connection'] = 'keep-alive'
    req = urllib.request.Request(basic_url, data, headers)
    response = urllib.request.urlopen(req)
    print('---------response.geturl')
    print(response.geturl())
    # 以“b'\x1f\x8b\x08”开头的数据是经过gzip压缩过的数据,这里当然需要进行解压了
    buff = BytesIO(response.read())
    f = gzip.GzipFile(fileobj=buff)
    page = f.read().decode('UTF-8')
    print('----------page')
    print(page)
    print('------------info')
    print(response.info())
    print('------------req.data')
    print(req.data)
    

后期调用的话,直接调用方法即可

PS: 若你觉得可以、还行、过得去、甚至不太差的话,可以“关注或点赞”一下,就此谢过!

你可能感兴趣的:(python爬虫学习-day1)