weixin_30549657

爬虫之requests+BeautifulSoup详解

简介

Python标准库中提供了：urllib、urllib2、httplib等模块以供Http请求，但是，它的 API 太渣了。它是为另一个时代、另一个互联网所创建的。它需要巨量的工作，甚至包括各种方法覆盖，来完成最简单的任务。

Requests 是使用 Apache2 Licensed 许可证的基于Python开发的HTTP 库，其在Python内置模块的基础上进行了高度的封装，从而使得Pythoner进行网络请求时，变得美好了许多，使用Requests可以轻而易举的完成浏览器可有的任何操作。

爬虫的本质：模仿浏览器的行为,爬取网页信息。

请求的方法

1、GET请求

# 1、无参数实例
  
import requests
  
ret = requests.get('https://github.com/timeline.json')
  
print ret.url
print ret.text
  
  
  
# 2、有参数实例
  
import requests
  
payload = {'key1': 'value1', 'key2': 'value2'}
ret = requests.get("http://httpbin.org/get", params=payload)
  
print ret.url
print ret.text

2、POST请求

# 1、基本POST实例
  
import requests
  
payload = {'key1': 'value1', 'key2': 'value2'}
ret = requests.post("http://httpbin.org/post", data=payload)
  
print ret.text
  
  
# 2、发送请求头和数据实例
  
import requests
import json
  
url = 'https://api.github.com/some/endpoint'
payload = {'some': 'data'}
headers = {'content-type': 'application/json'}
  
ret = requests.post(url, data=json.dumps(payload), headers=headers)
  
print ret.text
print ret.cookies

3、其他请求

requests.get(url, params=None, **kwargs)
requests.post(url, data=None, json=None, **kwargs)
requests.put(url, data=None, **kwargs)
requests.head(url, **kwargs)
requests.delete(url, **kwargs)
requests.patch(url, data=None, **kwargs)
requests.options(url, **kwargs)
  
# 以上方法均是在此方法的基础上构建
requests.request(method, url, **kwargs)

请求的参数

常见参数

1  url
2  headers
3  cookies
4  params
5  data，传请求体
         
        requests.post(
            ...,
            data={'user':'alex','pwd':'123'}
        )
         
        GET /index http1.1\r\nhost:c1.com\r\n\r\nuser=alex&pwd=123
         
6  json，传请求体
        requests.post(
            ...,
            json={'user':'alex','pwd':'123'}
        )
         
        GET /index http1.1\r\nhost:c1.com\r\nContent-Type:application/json\r\n\r\n{"user":"alex","pwd":123}
7 代理 proxies
    # 无验证
        proxie_dict = {
            "http": "61.172.249.96:80",
            "https": "http://61.185.219.126:3128",
        }
        ret = requests.get("https://www.proxy360.cn/Proxy", proxies=proxie_dict)
         
     
    # 验证代理
        from requests.auth import HTTPProxyAuth
         
        proxyDict = {
            'http': '77.75.105.165',
            'https': '77.75.106.165'
        }
        auth = HTTPProxyAuth('用户名', '密码')
         
        r = requests.get("http://www.google.com",data={'xxx':'ffff'} proxies=proxyDict, auth=auth)
        print(r.text)
-----------------------------------------------------------------------------------------
8 文件上传 files
    # 发送文件
        file_dict = {
            'f1': open('xxxx.log', 'rb')
        }
        requests.request(
            method='POST',
            url='http://127.0.0.1:8000/test/',
            files=file_dict
        )
         
9 认证 auth
 
    内部：
        用户名和密码，用户和密码加密，放在请求头中传给后台。
         
            - "用户:密码"
            - base64("用户:密码")
            - "Basic base64("用户|密码")"
            - 请求头：
                Authorization： "basic base64("用户|密码")"
         
    from requests.auth import HTTPBasicAuth, HTTPDigestAuth
 
    ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
    print(ret.text)
     
10 超时 timeout
    # ret = requests.get('http://google.com/', timeout=1)
    # print(ret)
 
    # ret = requests.get('http://google.com/', timeout=(5, 1))
    # print(ret)
     
11 允许重定向  allow_redirects
    ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False)
    print(ret.text)
     
12 大文件下载 stream
    from contextlib import closing
    with closing(requests.get('http://httpbin.org/get', stream=True)) as r1:
    # 在此处理响应。
    for i in r1.iter_content():
        print(i)
         
13 证书 cert
    - 百度、腾讯 => 不用携带证书（系统帮你做了）
    - 自定义证书
        requests.get('http://127.0.0.1:8000/test/', cert="xxxx/xxx/xxx.pem")
        requests.get('http://127.0.0.1:8000/test/', cert=("xxxx/xxx/xxx.pem","xxx.xxx.xx.key"))
14 确认 verify =False

关于auth认证

认证 auth
 浏览器的弹窗认证，在浏览器中
    内部：
        用户名和密码，用户和密码加密，放在请求头中传给后台。
         
            - "用户:密码"
            - base64("用户:密码")
            - "Basic base64("用户：密码")"
            - 请求头：
                Authorization： "basic base64("用户：密码")"
    request的 HTTPBasicAuth帮助做以上操作
    from requests.auth import HTTPBasicAuth, HTTPDigestAuth
 
    ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
    print(ret.text)

def param_method_url():
    ret=requests.request(method='get', url='http://127.0.0.1:8000/test/')
    ret=requests.request(method='post', url='http://127.0.0.1:8000/test/')

method

import requests  

  requests.get(url='http://127.0.0.1:8000/test/',
    params={'k1': 'v1', 'k2': 'v2'})

#他的本质与requests.get(url='xxxxx?k1=v1&k2=v2')

params

  # 可以是字典
    # 可以是字符串
    # 可以是字节
    # 可以是文件对象
    # requests.request(method='POST',
    # url='http://127.0.0.1:8000/test/',
    # data={'k1': 'v1', 'k2': '水电费'})

    # requests.request(method='POST',
    # url='http://127.0.0.1:8000/test/',
    # data="k1=v1; k2=v2; k3=v3; k3=v4"
    # )

    # requests.request(method='POST',
    # url='http://127.0.0.1:8000/test/',
    # data="k1=v1;k2=v2;k3=v3;k3=v4",
    # headers={'Content-Type': 'application/x-www-form-urlencoded'}
    # )

    # requests.request(method='POST',
    # url='http://127.0.0.1:8000/test/',
    # data=open('data_file.py', mode='r', encoding='utf-8'), # 文件内容是：k1=v1;k2=v2;k3=v3;k3=v4
    # headers={'Content-Type': 'application/x-www-form-urlencoded'}
    # )

data

#如果请求体是 payload的话则需要传入json格式
requests.request(method='POST',
                     url='http://127.0.0.1:8000/test/',
                     json={'k1': 'v1', 'k2': '水电费'})

json

ret1 = requests.get(
    url='https://dig.chouti.com/',
    headers={
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
    }   )
ret1_cookies = ret1.cookies.get_dict()
#获取的ret1.cookies是访问该url返回的cookies对象
#通过get_dict()获取到字典类型的cookies

  # 发送请求头到服务器端
    requests.request(method='POST',
                     url='http://127.0.0.1:8000/test/',
                     json={'k1': 'v1', 'k2': '水电费'},
                     headers={'Content-Type': 'application/x-www-form-urlencoded'}
                     )
    #具体需要什么请求头要看服务器端

header

 # 发送文件
    # file_dict = {
    # 'f1': open('readme', 'rb')
    # }
    # requests.request(method='POST',
    # url='http://127.0.0.1:8000/test/',
    # files=file_dict)

    # 发送文件，定制文件名
    # file_dict = {
    # 'f1': ('test.txt', open('readme', 'rb'))
    # }
    # requests.request(method='POST',
    # url='http://127.0.0.1:8000/test/',
    # files=file_dict)

    # 发送文件，定制文件名
    # file_dict = {
    # 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf")
    # }
    # requests.request(method='POST',
    # url='http://127.0.0.1:8000/test/',
    # files=file_dict)

    # 发送文件，定制文件名
    # file_dict = {
    #     'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf", 'application/text', {'k1': '0'})
    # }
    # requests.request(method='POST',
    #                  url='http://127.0.0.1:8000/test/',
    #                  files=file_dict)

    pass

files

 设置超时时间,如果访问超过超时时间就停止访问
# ret = requests.get('http://google.com/', timeout=1)
    # print(ret)

    # ret = requests.get('http://google.com/', timeout=(5, 1))
    # print(ret)
    pass

timeout

#是否允许重定向,默认为true
ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False)
    print(ret.text)

allow_redirects

BeautifulSoup

BeautifulSoup是一个模块，该模块用于接收一个HTML或XML字符串，然后将其进行格式化，之后遍可以使用他提供的方法进行快速查找指定元素，从而使得在HTML或XML中查找指定元素变得简单。

from bs4 import BeautifulSoup
 
html_doc = """
The Dormouse's story

asdf
    
        The Dormouse's story总共
        f
    
Once upon a time there were three little sisters; and their names were
    Elsfie,
    Lacie and
    Tillie;
and they lived at the bottom of a well.
ad
sf
...


"""
 
soup = BeautifulSoup(html_doc, 'html.parse')
# 找到第一个a标签
tag1 = soup.find(name='a')
# 找到所有的a标签
tag2 = soup.find_all(name='a')
# 找到id＝link2的标签
tag3 = soup.select('#link2')

安装：

pip3 install beautifulsoup4

1. name，标签名称

# tag = soup.find('a')
# name = tag.name # 获取
# print(name)
# tag.name = 'span' # 设置
# print(soup)

2. attr，标签属性

# tag = soup.find('a')
# attrs = tag.attrs    # 获取
# print(attrs)
# tag.attrs = {'ik':123} # 设置
# tag.attrs['id'] = 'iiiii' # 设置
# print(soup)

3. children,所有子标签

# body = soup.find('body')
# v = body.children

4. children,所有子子孙孙标签

# body = soup.find('body')
# v = body.descendants

5. clear,将标签的所有子标签全部清空（保留标签名）

# tag = soup.find('body')
# tag.clear()
# print(soup)

6. decompose,递归的删除所有的标签

# body = soup.find('body')
# body.decompose()
# print(soup)

7. extract,递归的删除所有的标签，并获取删除的标签

# body = soup.find('body')
# v = body.extract()
# print(soup)

8. decode,转换为字符串（含当前标签）；decode_contents（不含当前标签）

# body = soup.find('body')
# v = body.decode()
# v = body.decode_contents()
# print(v)

9. encode,转换为字节（含当前标签）；encode_contents（不含当前标签）

# body = soup.find('body')
# v = body.encode()
# v = body.encode_contents()
# print(v)

10. find,获取匹配的第一个标签

# tag = soup.find('a')
# print(tag)
# tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
# tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
# print(tag)

11. find_all,获取匹配的所有标签

# tags = soup.find_all('a')
# print(tags)
 
# tags = soup.find_all('a',limit=1)
# print(tags)
 
# tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
# # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
# print(tags)
 
 
# ####### 列表 #######
# v = soup.find_all(name=['a','div'])
# print(v)
 
# v = soup.find_all(class_=['sister0', 'sister'])
# print(v)
 
# v = soup.find_all(text=['Tillie'])
# print(v, type(v[0]))
 
 
# v = soup.find_all(id=['link1','link2'])
# print(v)
 
# v = soup.find_all(href=['link1','link2'])
# print(v)
 
# ####### 正则 #######
import re
# rep = re.compile('p')
# rep = re.compile('^p')
# v = soup.find_all(name=rep)
# print(v)
 
# rep = re.compile('sister.*')
# v = soup.find_all(class_=rep)
# print(v)
 
# rep = re.compile('http://www.oldboy.com/static/.*')
# v = soup.find_all(href=rep)
# print(v)
 
# ####### 方法筛选 #######
# def func(tag):
# return tag.has_attr('class') and tag.has_attr('id')
# v = soup.find_all(name=func)
# print(v)
 
 
# ## get,获取标签属性
# tag = soup.find('a')
# v = tag.get('id')
# print(v)

12. has_attr,检查标签是否具有该属性

# tag = soup.find('a')
# v = tag.has_attr('id')
# print(v)

13. get_text,获取标签内部文本内容

# tag = soup.find('a')
# v = tag.get_text('id')
# print(v)

14. index,检查标签在某标签中的索引位置

# tag = soup.find('body')
# v = tag.index(tag.find('div'))
# print(v)
 
# tag = soup.find('body')
# for i,v in enumerate(tag):
# print(i,v)

15. is_empty_element,是否是空标签(是否可以是空)或者自闭合标签，

判断是否是如下标签：'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base'

# tag = soup.find('br')
# v = tag.is_empty_element
# print(v)

16. 当前的关联标签

# soup.next
# soup.next_element
# soup.next_elements
# soup.next_sibling
# soup.next_siblings
 
#
# tag.previous
# tag.previous_element
# tag.previous_elements
# tag.previous_sibling
# tag.previous_siblings
 
#
# tag.parent
# tag.parents

17. 查找某标签的关联标签

# tag.find_next(...)
# tag.find_all_next(...)
# tag.find_next_sibling(...)
# tag.find_next_siblings(...)
 
# tag.find_previous(...)
# tag.find_all_previous(...)
# tag.find_previous_sibling(...)
# tag.find_previous_siblings(...)
 
# tag.find_parent(...)
# tag.find_parents(...)
 
# 参数同find_all

18. select,select_one, CSS选择器

soup.select("title")
 
soup.select("p nth-of-type(3)")
 
soup.select("body a")
 
soup.select("html head title")
 
tag = soup.select("span,a")
 
soup.select("head > title")
 
soup.select("p > a")
 
soup.select("p > a:nth-of-type(2)")
 
soup.select("p > #link1")
 
soup.select("body > a")
 
soup.select("#link1 ~ .sister")
 
soup.select("#link1 + .sister")
 
soup.select(".sister")
 
soup.select("[class~=sister]")
 
soup.select("#link1")
 
soup.select("a#link2")
 
soup.select('a[href]')
 
soup.select('a[href="http://example.com/elsie"]')
 
soup.select('a[href^="http://example.com/"]')
 
soup.select('a[href$="tillie"]')
 
soup.select('a[href*=".com/el"]')
 
 
from bs4.element import Tag
 
def default_candidate_generator(tag):
    for child in tag.descendants:
        if not isinstance(child, Tag):
            continue
        if not child.has_attr('href'):
            continue
        yield child
 
tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator)
print(type(tags), tags)
 
from bs4.element import Tag
def default_candidate_generator(tag):
    for child in tag.descendants:
        if not isinstance(child, Tag):
            continue
        if not child.has_attr('href'):
            continue
        yield child
 
tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator, limit=1)
print(type(tags), tags)

19. 标签的内容

# tag = soup.find('span')
# print(tag.string)          # 获取
# tag.string = 'new content' # 设置
# print(soup)
 
# tag = soup.find('body')
# print(tag.string)
# tag.string = 'xxx'
# print(soup)
 
# tag = soup.find('body')
# v = tag.stripped_strings  # 递归内部获取所有标签的文本
# print(v)

20.append在当前标签内部追加一个标签

# tag = soup.find('body')
# tag.append(soup.find('a'))
# print(soup)
#
# from bs4.element import Tag
# obj = Tag(name='i',attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('body')
# tag.append(obj)
# print(soup)

21.insert在当前标签内部指定位置插入一个标签

# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('body')
# tag.insert(2, obj)
# print(soup)

22. insert_after,insert_before 在当前标签后面或前面插入

# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('body')
# # tag.insert_before(obj)
# tag.insert_after(obj)
# print(soup)

23. replace_with 在当前标签替换为指定标签

 # from bs4.element import Tag 
        
 # obj = Tag(name='i', attrs={'id': 'it'}) 
        
 # obj.string = '我是一个新来的' 
        
 # tag = soup.find('div') 
        
 # tag.replace_with(obj) 
        
 # print(soup)

24. 创建标签之间的关系

# tag = soup.find('div')
# a = soup.find('a')
# tag.setup(previous_sibling=a)
# print(tag.previous_sibling)

25. wrap，将指定标签把当前标签包裹起来

# from bs4.element import Tag
# obj1 = Tag(name='div', attrs={'id': 'it'})
# obj1.string = '我是一个新来的'
#
# tag = soup.find('a')
# v = tag.wrap(obj1)
# print(soup)
 
# tag = soup.find('a')
# v = tag.wrap(soup.find('p'))
# print(soup

26. unwrap，去掉当前标签，将保留其包裹的标签

# tag = soup.find('a')
# v = tag.unwrap()
# print(soup)

更多参数官方：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

一大波"自动登陆"示例

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import requests


# ############## 方式一 ##############
"""
# ## 1、首先登陆任何页面，获取cookie
i1 = requests.get(url="http://dig.chouti.com/help/service")
i1_cookies = i1.cookies.get_dict()

# ## 2、用户登陆，携带上一次的cookie，后台对cookie中的 gpsd 进行授权
i2 = requests.post(
    url="http://dig.chouti.com/login",
    data={
        'phone': "8615131255089",
        'password': "xxooxxoo",
        'oneMonth': ""
    },
    cookies=i1_cookies
)

# ## 3、点赞（只需要携带已经被授权的gpsd即可）
gpsd = i1_cookies['gpsd']
i3 = requests.post(
    url="http://dig.chouti.com/link/vote?linksId=8589523",
    cookies={'gpsd': gpsd}
)

print(i3.text)
"""


# ############## 方式二 ##############
"""
import requests

session = requests.Session()
i1 = session.get(url="http://dig.chouti.com/help/service")
i2 = session.post(
    url="http://dig.chouti.com/login",
    data={
        'phone': "8615131255089",
        'password': "xxooxxoo",
        'oneMonth': ""
    }
)
i3 = session.post(
    url="http://dig.chouti.com/link/vote?linksId=8589523"
)
print(i3.text)

"""

抽屉新热榜

抽屉新热榜

返回主页    
春生

    博客园
    首页
    新随笔
    联系
    订阅
    管理

随笔 - 181  文章 - 2  评论 - 24
requests+BeautifulSoup详解
简介

Python标准库中提供了：urllib、urllib2、httplib等模块以供Http请求，但是，它的 API 太渣了。它是为另一个时代、另一个互联网所创建的。它需要巨量的工作，甚至包括各种方法覆盖，来完成最简单的任务。

Requests 是使用 Apache2 Licensed 许可证的 基于Python开发的HTTP 库，其在Python内置模块的基础上进行了高度的封装，从而使得Pythoner进行网络请求时，变得美好了许多，使用Requests可以轻而易举的完成浏览器可有的任何操作。
请求的方法
1、GET请求
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
    
# 1、无参数实例
  
import requests
  
ret = requests.get('https://github.com/timeline.json')
  
print ret.url
print ret.text
  
  
  
# 2、有参数实例
  
import requests
  
payload = {'key1': 'value1', 'key2': 'value2'}
ret = requests.get("http://httpbin.org/get", params=payload)
  
print ret.url
print ret.text
2、POST请求
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
    
# 1、基本POST实例
  
import requests
  
payload = {'key1': 'value1', 'key2': 'value2'}
ret = requests.post("http://httpbin.org/post", data=payload)
  
print ret.text
  
  
# 2、发送请求头和数据实例
  
import requests
import json
  
url = 'https://api.github.com/some/endpoint'
payload = {'some': 'data'}
headers = {'content-type': 'application/json'}
  
ret = requests.post(url, data=json.dumps(payload), headers=headers)
  
print ret.text
print ret.cookies
3、其他请求
1
2
3
4
5
6
7
8
9
10
    
requests.get(url, params=None, **kwargs)
requests.post(url, data=None, json=None, **kwargs)
requests.put(url, data=None, **kwargs)
requests.head(url, **kwargs)
requests.delete(url, **kwargs)
requests.patch(url, data=None, **kwargs)
requests.options(url, **kwargs)
  
# 以上方法均是在此方法的基础上构建
requests.request(method, url, **kwargs)
请求的参数
常见参数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
    
1  url
2  headers
3  cookies
4  params
5  data，传请求体
         
        requests.post(
            ...,
            data={'user':'alex','pwd':'123'}
        )
         
        GET /index http1.1\r\nhost:c1.com\r\n\r\nuser=alex&pwd=123
         
6  json，传请求体
        requests.post(
            ...,
            json={'user':'alex','pwd':'123'}
        )
         
        GET /index http1.1\r\nhost:c1.com\r\nContent-Type:application/json\r\n\r\n{"user":"alex","pwd":123}
7 代理 proxies
    # 无验证
        proxie_dict = {
            "http": "61.172.249.96:80",
            "https": "http://61.185.219.126:3128",
        }
        ret = requests.get("https://www.proxy360.cn/Proxy", proxies=proxie_dict)
         
     
    # 验证代理
        from requests.auth import HTTPProxyAuth
         
        proxyDict = {
            'http': '77.75.105.165',
            'https': '77.75.106.165'
        }
        auth = HTTPProxyAuth('用户名', '密码')
         
        r = requests.get("http://www.google.com",data={'xxx':'ffff'} proxies=proxyDict, auth=auth)
        print(r.text)
-----------------------------------------------------------------------------------------
8 文件上传 files
    # 发送文件
        file_dict = {
            'f1': open('xxxx.log', 'rb')
        }
        requests.request(
            method='POST',
            url='http://127.0.0.1:8000/test/',
            files=file_dict
        )
         
9 认证 auth
 
    内部：
        用户名和密码，用户和密码加密，放在请求头中传给后台。
         
            - "用户:密码"
            - base64("用户:密码")
            - "Basic base64("用户|密码")"
            - 请求头：
                Authorization： "basic base64("用户|密码")"
         
    from requests.auth import HTTPBasicAuth, HTTPDigestAuth
 
    ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
    print(ret.text)
     
10 超时 timeout
    # ret = requests.get('http://google.com/', timeout=1)
    # print(ret)
 
    # ret = requests.get('http://google.com/', timeout=(5, 1))
    # print(ret)
     
11 允许重定向  allow_redirects
    ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False)
    print(ret.text)
     
12 大文件下载 stream
    from contextlib import closing
    with closing(requests.get('http://httpbin.org/get', stream=True)) as r1:
    # 在此处理响应。
    for i in r1.iter_content():
        print(i)
         
13 证书 cert
    - 百度、腾讯 => 不用携带证书（系统帮你做了）
    - 自定义证书
        requests.get('http://127.0.0.1:8000/test/', cert="xxxx/xxx/xxx.pem")
        requests.get('http://127.0.0.1:8000/test/', cert=("xxxx/xxx/xxx.pem","xxx.xxx.xx.key"))
14 确认 verify =False

　　
更多参数
参数列表

 
参数示例

 

官方文档：http://cn.python-requests.org/zh_CN/latest/user/quickstart.html#id4
BeautifulSoup

BeautifulSoup是一个模块，该模块用于接收一个HTML或XML字符串，然后将其进行格式化，之后遍可以使用他提供的方法进行快速查找指定元素，从而使得在HTML或XML中查找指定元素变得简单。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
    
from bs4 import BeautifulSoup
 
html_doc = """
The Dormouse's story

asdf
    
        The Dormouse's story总共
        f
    
Once upon a time there were three little sisters; and their names were
    Elsfie,
    Lacie and
    Tillie;
and they lived at the bottom of a well.
ad
sf
...


"""
 
soup = BeautifulSoup(html_doc, features="lxml")
# 找到第一个a标签
tag1 = soup.find(name='a')
# 找到所有的a标签
tag2 = soup.find_all(name='a')
# 找到id＝link2的标签
tag3 = soup.select('#link2')

安装：
1
    
pip3 install beautifulsoup4

使用示例：
1
2
3
4
5
6
7
8
9
10
11
    
from bs4 import BeautifulSoup
 
html_doc = """
The Dormouse's story

    ...


"""
 
soup = BeautifulSoup(html_doc, features="lxml")

1. name，标签名称
1
2
3
4
5
    
# tag = soup.find('a')
# name = tag.name # 获取
# print(name)
# tag.name = 'span' # 设置
# print(soup)

2. attr，标签属性
1
2
3
4
5
6
    
# tag = soup.find('a')
# attrs = tag.attrs    # 获取
# print(attrs)
# tag.attrs = {'ik':123} # 设置
# tag.attrs['id'] = 'iiiii' # 设置
# print(soup)

3. children,所有子标签
1
2
    
# body = soup.find('body')
# v = body.children

4. children,所有子子孙孙标签
1
2
    
# body = soup.find('body')
# v = body.descendants

5. clear,将标签的所有子标签全部清空（保留标签名）
1
2
3
    
# tag = soup.find('body')
# tag.clear()
# print(soup)

6. decompose,递归的删除所有的标签
1
2
3
    
# body = soup.find('body')
# body.decompose()
# print(soup)

7. extract,递归的删除所有的标签，并获取删除的标签
1
2
3
    
# body = soup.find('body')
# v = body.extract()
# print(soup)

8. decode,转换为字符串（含当前标签）；decode_contents（不含当前标签）
1
2
3
4
    
# body = soup.find('body')
# v = body.decode()
# v = body.decode_contents()
# print(v)

9. encode,转换为字节（含当前标签）；encode_contents（不含当前标签）
1
2
3
4
    
# body = soup.find('body')
# v = body.encode()
# v = body.encode_contents()
# print(v)

10. find,获取匹配的第一个标签
1
2
3
4
5
    
# tag = soup.find('a')
# print(tag)
# tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
# tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
# print(tag)

11. find_all,获取匹配的所有标签
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
    
# tags = soup.find_all('a')
# print(tags)
 
# tags = soup.find_all('a',limit=1)
# print(tags)
 
# tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
# # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
# print(tags)
 
 
# ####### 列表 #######
# v = soup.find_all(name=['a','div'])
# print(v)
 
# v = soup.find_all(class_=['sister0', 'sister'])
# print(v)
 
# v = soup.find_all(text=['Tillie'])
# print(v, type(v[0]))
 
 
# v = soup.find_all(id=['link1','link2'])
# print(v)
 
# v = soup.find_all(href=['link1','link2'])
# print(v)
 
# ####### 正则 #######
import re
# rep = re.compile('p')
# rep = re.compile('^p')
# v = soup.find_all(name=rep)
# print(v)
 
# rep = re.compile('sister.*')
# v = soup.find_all(class_=rep)
# print(v)
 
# rep = re.compile('http://www.oldboy.com/static/.*')
# v = soup.find_all(href=rep)
# print(v)
 
# ####### 方法筛选 #######
# def func(tag):
# return tag.has_attr('class') and tag.has_attr('id')
# v = soup.find_all(name=func)
# print(v)
 
 
# ## get,获取标签属性
# tag = soup.find('a')
# v = tag.get('id')
# print(v)

12. has_attr,检查标签是否具有该属性
1
2
3
    
# tag = soup.find('a')
# v = tag.has_attr('id')
# print(v)

13. get_text,获取标签内部文本内容
1
2
3
    
# tag = soup.find('a')
# v = tag.get_text('id')
# print(v)

14. index,检查标签在某标签中的索引位置
1
2
3
4
5
6
7
    
# tag = soup.find('body')
# v = tag.index(tag.find('div'))
# print(v)
 
# tag = soup.find('body')
# for i,v in enumerate(tag):
# print(i,v)

15. is_empty_element,是否是空标签(是否可以是空)或者自闭合标签，

     判断是否是如下标签：'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base'
1
2
3
    
# tag = soup.find('br')
# v = tag.is_empty_element
# print(v)

16. 当前的关联标签
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
    
# soup.next
# soup.next_element
# soup.next_elements
# soup.next_sibling
# soup.next_siblings
 
#
# tag.previous
# tag.previous_element
# tag.previous_elements
# tag.previous_sibling
# tag.previous_siblings
 
#
# tag.parent
# tag.parents

17. 查找某标签的关联标签
1
2
3
4
5
6
7
8
9
10
11
12
13
14
    
# tag.find_next(...)
# tag.find_all_next(...)
# tag.find_next_sibling(...)
# tag.find_next_siblings(...)
 
# tag.find_previous(...)
# tag.find_all_previous(...)
# tag.find_previous_sibling(...)
# tag.find_previous_siblings(...)
 
# tag.find_parent(...)
# tag.find_parents(...)
 
# 参数同find_all

18. select,select_one, CSS选择器
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
    
soup.select("title")
 
soup.select("p nth-of-type(3)")
 
soup.select("body a")
 
soup.select("html head title")
 
tag = soup.select("span,a")
 
soup.select("head > title")
 
soup.select("p > a")
 
soup.select("p > a:nth-of-type(2)")
 
soup.select("p > #link1")
 
soup.select("body > a")
 
soup.select("#link1 ~ .sister")
 
soup.select("#link1 + .sister")
 
soup.select(".sister")
 
soup.select("[class~=sister]")
 
soup.select("#link1")
 
soup.select("a#link2")
 
soup.select('a[href]')
 
soup.select('a[href="http://example.com/elsie"]')
 
soup.select('a[href^="http://example.com/"]')
 
soup.select('a[href$="tillie"]')
 
soup.select('a[href*=".com/el"]')
 
 
from bs4.element import Tag
 
def default_candidate_generator(tag):
    for child in tag.descendants:
        if not isinstance(child, Tag):
            continue
        if not child.has_attr('href'):
            continue
        yield child
 
tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator)
print(type(tags), tags)
 
from bs4.element import Tag
def default_candidate_generator(tag):
    for child in tag.descendants:
        if not isinstance(child, Tag):
            continue
        if not child.has_attr('href'):
            continue
        yield child
 
tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator, limit=1)
print(type(tags), tags)

19. 标签的内容
1
2
3
4
5
6
7
8
9
10
11
12
13
    
# tag = soup.find('span')
# print(tag.string)          # 获取
# tag.string = 'new content' # 设置
# print(soup)
 
# tag = soup.find('body')
# print(tag.string)
# tag.string = 'xxx'
# print(soup)
 
# tag = soup.find('body')
# v = tag.stripped_strings  # 递归内部获取所有标签的文本
# print(v)

20.append在当前标签内部追加一个标签
1
2
3
4
5
6
7
8
9
10
    
# tag = soup.find('body')
# tag.append(soup.find('a'))
# print(soup)
#
# from bs4.element import Tag
# obj = Tag(name='i',attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('body')
# tag.append(obj)
# print(soup)

21.insert在当前标签内部指定位置插入一个标签
1
2
3
4
5
6
    
# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('body')
# tag.insert(2, obj)
# print(soup)

22. insert_after,insert_before 在当前标签后面或前面插入
1
2
3
4
5
6
7
    
# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('body')
# # tag.insert_before(obj)
# tag.insert_after(obj)
# print(soup)

23. replace_with 在当前标签替换为指定标签
1
2
3
4
5
6
    
# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('div')
# tag.replace_with(obj)
# print(soup)

24. 创建标签之间的关系
1
2
3
4
    
# tag = soup.find('div')
# a = soup.find('a')
# tag.setup(previous_sibling=a)
# print(tag.previous_sibling)

25. wrap，将指定标签把当前标签包裹起来
1
2
3
4
5
6
7
8
9
10
11
    
# from bs4.element import Tag
# obj1 = Tag(name='div', attrs={'id': 'it'})
# obj1.string = '我是一个新来的'
#
# tag = soup.find('a')
# v = tag.wrap(obj1)
# print(soup)
 
# tag = soup.find('a')
# v = tag.wrap(soup.find('p'))
# print(soup)

26. unwrap，去掉当前标签，将保留其包裹的标签
1
2
3
    
# tag = soup.find('a')
# v = tag.unwrap()
# print(soup)

更多参数官方：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
一大波"自动登陆"示例
按 Ctrl+C 复制代码
按 Ctrl+C 复制代码
复制代码

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import requests
from bs4 import BeautifulSoup

# ############## 方式一 ##############
#
# # 1. 访问登陆页面，获取 authenticity_token
# i1 = requests.get('https://github.com/login')
# soup1 = BeautifulSoup(i1.text, features='lxml')
# tag = soup1.find(name='input', attrs={'name': 'authenticity_token'})
# authenticity_token = tag.get('value')
# c1 = i1.cookies.get_dict()
# i1.close()
#
# # 1. 携带authenticity_token和用户名密码等信息，发送用户验证
# form_data = {
# "authenticity_token": authenticity_token,
#     "utf8": "",
#     "commit": "Sign in",
#     "login": "[email protected]",
#     'password': 'xxoo'
# }
#
# i2 = requests.post('https://github.com/session', data=form_data, cookies=c1)
# c2 = i2.cookies.get_dict()
# c1.update(c2)
# i3 = requests.get('https://github.com/settings/repositories', cookies=c1)
#
# soup3 = BeautifulSoup(i3.text, features='lxml')
# list_group = soup3.find(name='div', class_='listgroup')
#
# from bs4.element import Tag
#
# for child in list_group.children:
#     if isinstance(child, Tag):
#         project_tag = child.find(name='a', class_='mr-1')
#         size_tag = child.find(name='small')
#         temp = "项目:%s(%s); 项目路径:%s" % (project_tag.get('href'), size_tag.string, project_tag.string, )
#         print(temp)



# ############## 方式二 ##############
# session = requests.Session()
# # 1. 访问登陆页面，获取 authenticity_token
# i1 = session.get('https://github.com/login')
# soup1 = BeautifulSoup(i1.text, features='lxml')
# tag = soup1.find(name='input', attrs={'name': 'authenticity_token'})
# authenticity_token = tag.get('value')
# c1 = i1.cookies.get_dict()
# i1.close()
#
# # 1. 携带authenticity_token和用户名密码等信息，发送用户验证
# form_data = {
#     "authenticity_token": authenticity_token,
#     "utf8": "",
#     "commit": "Sign in",
#     "login": "[email protected]",
#     'password': 'xxoo'
# }
#
# i2 = session.post('https://github.com/session', data=form_data)
# c2 = i2.cookies.get_dict()
# c1.update(c2)
# i3 = session.get('https://github.com/settings/repositories')
#
# soup3 = BeautifulSoup(i3.text, features='lxml')
# list_group = soup3.find(name='div', class_='listgroup')
#
# from bs4.element import Tag
#
# for child in list_group.children:
#     if isinstance(child, Tag):
#         project_tag = child.find(name='a', class_='mr-1')
#         size_tag = child.find(name='small')
#         temp = "项目:%s(%s); 项目路径:%s" % (project_tag.get('href'), size_tag.string, project_tag.string, )
#         print(temp)

复制代码
知乎
博客园
拉勾网
 
 
分类: 爬虫
好文要顶 关注我 收藏该文
春生
关注 - 22
粉丝 - 68
+加关注
0
0
« 上一篇：Flask之flask-session
» 下一篇：Scrapy框架基础
posted @ 2018-06-25 23:01 春生 阅读(66) 评论(0) 编辑 收藏
刷新评论刷新页面返回顶部
发表评论

昵称：
评论内容：
引用 粗体 链接 缩进 代码 图片

退出 订阅评论

[Ctrl+Enter快捷键提交]
【推荐】超50万C++/C#源码: 大型实时仿真HMI组态CAD\GIS图形源码！
【推荐】专业便捷的企业级代码托管服务 - Gitee 码云
相关博文：
· python 安装插件 requests、BeautifulSoup
· Python 爬虫—— requests BeautifulSoup
· requests + BeautifulSoup + json
· requests和BeautifulSoup
· Requests与BeautifulSoup
最新新闻：
· 苹果大屏手机方面花了四年时间才赶上三星 在可折叠手机方面呢？
· 惹祸的就是它 图解马斯克的4925条推文
· FF：“遣散员工”传闻为误读 已召回百名员工
· IBM为招聘网页出现种族歧视选项致歉
· 承认吧星巴克，你就是个卖杯子的
» 更多新闻...
公告
昵称：春生
园龄：1年2个月
粉丝：68
关注：22
+加关注
<    2019年2月    >
日    一    二    三    四    五    六
27    28    29    30    31    1    2
3    4    5    6    7    8    9
10    11    12    13    14    15    16
17    18    19    20    21    22    23
24    25    26    27    28    1    2
3    4    5    6    7    8    9
搜索
 
 
常用链接

    我的随笔
    我的评论
    我的参与
    最新评论
    我的标签

随笔分类

    Ajax(2)
    Django model系统(5)
    Django-rest framework(8)
    DJango-templates系统(1)
    Django-处理流程(1)
    Django框架(21)
    Django-组件-???(8)
    Flask(9)
    Git(4)
    go 语言(1)
    linux(10)
    MySQL(15)
    PyCharm 教程使用文档(9)
    python(41)
    Python 常用模块(1)
    RabbitMQ(1)
    Redis(5)
    requirements.txt(1)
    Tornado(1)
    Vue(2)
    wepsocket(1)
    报错(1)
    静态文件 各种工具 (7)
    爬虫(7)
    前端(15)
    区块链(1)
    算法(1)
    网络(2)
    项目部署 上线(2)
    项目实战(4)
    信号(1)
    虚拟环境(2)
    支付宝(1)

随笔档案

    2019年2月 (1)
    2018年12月 (11)
    2018年11月 (1)
    2018年10月 (5)
    2018年8月 (9)
    2018年7月 (20)
    2018年6月 (32)
    2018年5月 (19)
    2018年4月 (19)
    2018年3月 (33)
    2018年2月 (6)
    2018年1月 (14)
    2017年12月 (11)

相册

    sdd(3)

最新评论

    1. Re:Django之logging日志
    问一下，handlers和loggers都有level，那么究竟以哪个为准handlers': { # 在终端打印 'console': { 'lev......
    --桦仔
    2. Re:Django之logging日志
    'file': { 'level': 'INFO', 'class': 'logging.handlers.TimedRotatingFileHandler.........
    --桦仔
    3. Re:Django之logging日志
    感谢
    --我好像在哪见过你
    4. Re:Python常用的标准库以及第三方库有哪些？
    楼主的资料太全了，对python的库有了大致了解。
    --高效快乐学习
    5. Re:面向对象 春生总结
    无敌
    --骑驴老神仙

阅读排行榜

    1. Python常用的标准库以及第三方库有哪些？(12659)
    2. redis之django-redis(2635)
    3. Django之logging日志(2060)
    4. Flask之flask-script 指定端口(1782)
    5. Celery 大量任务 分发(786)

评论排行榜

    1. python 控制台颜色(11)
    2. 文件处理(5)
    3. Django之logging日志(3)
    4. 面向对象 春生总结(2)
    5. MySQL 简洁 数据操作 增删改查 记不住的 看这里把(1)

推荐排行榜

    1. python 控制台颜色(9)
    2. WEB框架之Tornado(2)
    3. Django之logging日志(2)
    4. Python常用的标准库以及第三方库有哪些？(2)
    5. Font Awesome矢量图标框架(1)

Copyright ©2019 春生

github

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import time

import requests
from bs4 import BeautifulSoup

session = requests.Session()

i1 = session.get(
    url='https://www.zhihu.com/#signin',
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
    }
)

soup1 = BeautifulSoup(i1.text, 'lxml')
xsrf_tag = soup1.find(name='input', attrs={'name': '_xsrf'})
xsrf = xsrf_tag.get('value')

current_time = time.time()
i2 = session.get(
    url='https://www.zhihu.com/captcha.gif',
    params={'r': current_time, 'type': 'login'},
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
    })

with open('zhihu.gif', 'wb') as f:
    f.write(i2.content)

captcha = input('请打开zhihu.gif文件，查看并输入验证码：')
form_data = {
    "_xsrf": xsrf,
    'password': 'xxooxxoo',
    "captcha": 'captcha',
    'email': '[email protected]'
}
i3 = session.post(
    url='https://www.zhihu.com/login/email',
    data=form_data,
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
    }
)

i4 = session.get(
    url='https://www.zhihu.com/settings/profile',
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
    }
)

soup4 = BeautifulSoup(i4.text, 'lxml')
tag = soup4.find(id='rename-section')
nick_name = tag.find('span',class_='name').string
print(nick_name)

知乎

知乎

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import re
import requests

all_cookie = {}

# ############### 1. 查看登录页面 ###############
r1 = requests.get(
    url='https://passport.lagou.com/login/login.html',
    headers={
        'Host': 'passport.lagou.com',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
    }
)

all_cookie.update(r1.cookies.get_dict())

X_Anti_Forge_Token = re.findall(r"window.X_Anti_Forge_Token = '(.*?)'", r1.text, re.S)[0]
X_Anti_Forge_Code = re.findall(r"window.X_Anti_Forge_Code = '(.*?)'", r1.text, re.S)[0]

# ############### 2. 用户名密码登录 ###############
r2 = requests.post(
    url='https://passport.lagou.com/login/login.json',
    headers={
        'Host': 'passport.lagou.com',
        'Referer': 'https://passport.lagou.com/login/login.html',
        'X-Anit-Forge-Code': X_Anti_Forge_Code,
        'X-Anit-Forge-Token': X_Anti_Forge_Token,
        'X-Requested-With': 'XMLHttpRequest',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    },
    data={
        'isValidate': True,
        'username': '15131255089',
        'password': 'ab18d270d7126ea65915cc22c0d',
        'request_form_verifyCode': '',
        'submit': '',

    },
    cookies=r1.cookies.get_dict()
)

all_cookie.update(r2.cookies.get_dict())

# ############### 3. 用户授权 ###############
r3 = requests.get(
    url='https://passport.lagou.com/grantServiceTicket/grant.html',
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'

    },
    allow_redirects=False,
    cookies=all_cookie

)

all_cookie.update(r3.cookies.get_dict())

# ############### 4. 用户认证 ###############
r4 = requests.get(
    url=r3.headers['Location'],
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'

    },
    allow_redirects=False,
    cookies=all_cookie
)

all_cookie.update(r4.cookies.get_dict())

r5 = requests.get(
    url=r4.headers['Location'],
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'

    },
    allow_redirects=False,
    cookies=all_cookie
)
all_cookie.update(r5.cookies.get_dict())
r6 = requests.get(
    url=r5.headers['Location'],
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'

    },
    allow_redirects=False,
    cookies=all_cookie
)

all_cookie.update(r6.cookies.get_dict())
r7 = requests.get(
    url=r6.headers['Location'],
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'

    },
    allow_redirects=False,
    cookies=all_cookie
)

all_cookie.update(r7.cookies.get_dict())

# ############### 5. 查看个人页面 ###############
r5 = requests.get(
    url='https://www.lagou.com/resume/myresume.html',
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'

    },
    cookies=all_cookie
)
print('武沛齐' in r5.text)

# ############### 6. 查看 ###############
r6 = requests.get(
    url='https://gate.lagou.com/v1/neirong/account/users/0/',
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
        'X-L-REQ-HEADER': "{deviceType:1}",
        'Origin': 'https://account.lagou.com',
        'Host': 'gate.lagou.com',
    },
    cookies=all_cookie

)
r6_json = r6.json()
all_cookie.update(r6.cookies.get_dict())

# ############### 7. 修改个人信息 ###############
r7 = requests.put(
    url='https://gate.lagou.com/v1/neirong/account/users/0/',
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
        'Origin': 'https://account.lagou.com',
        'Host': 'gate.lagou.com',
        'X-Anit-Forge-Code': r6_json['submitCode'],
        'X-Anit-Forge-Token': r6_json['submitToken'],
        'X-L-REQ-HEADER': "{deviceType:1}",
    },
    cookies=all_cookie,
    json={"userName": "wupeiqi888", "sex": "MALE", "portrait": "images/myresume/default_headpic.png",
          "positionName": '...', "introduce": '....'}
)
print(r7.text)

拉勾网

Python标准库中提供了：urllib、urllib2、httplib等模块以供Http请求，但是，它的 API 太渣了。它是为另一个时代、另一个互联网所创建的。它需要巨量的工作，甚至包括各种方法覆盖，来完成最简单的任务。

Requests 是使用 Apache2 Licensed 许可证的 基于Python开发的HTTP 库，其在Python内置模块的基础上进行了高度的封装，从而使得Pythoner进行网络请求时，变得美好了许多，使用Requests可以轻而易举的完成浏览器可有的任何操作。
请求的方法
1、GET请求
# 1、无参数实例
import requests 
ret = requests.get('https://github.com/timeline.json')
print ret.url
print ret.text
# 2、有参数实例
import requests
payload = {'key1': 'value1', 'key2': 'value2'}
ret = requests.get("http://httpbin.org/get", params=payload)
print ret.url
print ret.text
2、POST请求
# 1、基本POST实例
import requests
payload = {'key1': 'value1', 'key2': 'value2'}
ret = requests.post("http://httpbin.org/post", data=payload)
print ret.text
# 2、发送请求头和数据实例
import requests
import json  
url = 'https://api.github.com/some/endpoint'
payload = {'some': 'data'}
headers = {'content-type': 'application/json'}
  
ret = requests.post(url, data=json.dumps(payload), headers=headers)
  
print ret.text
print ret.cookies
3、其他请求   
requests.get(url, params=None, **kwargs)
requests.post(url, data=None, json=None, **kwargs)
requests.put(url, data=None, **kwargs)
requests.head(url, **kwargs)
requests.delete(url, **kwargs)
requests.patch(url, data=None, **kwargs)
requests.options(url, **kwargs)
  
# 以上方法均是在此方法的基础上构建
requests.request(method, url, **kwargs)
请求的参数
常见参数
 url
 headers
 cookies
 params
 data，传请求体
         
        requests.post(
            ...,
            data={'user':'alex','pwd':'123'}
        )
         
        GET /index http1.1\r\nhost:c1.com\r\n\r\nuser=alex&pwd=123
 json，传请求体
        requests.post(
            ...,
            json={'user':'alex','pwd':'123'}
        )
         
        GET /index http1.1\r\nhost:c1.com\r\nContent-Type:application/json\r\n\r\n{"user":"alex","pwd":123}
代理 proxies
    # 无验证
        proxie_dict = {
            "http": "61.172.249.96:80",
            "https": "http://61.185.219.126:3128",
        }
        ret = requests.get("https://www.proxy360.cn/Proxy", proxies=proxie_dict)
         
     
    # 验证代理
        from requests.auth import HTTPProxyAuth
         
        proxyDict = {
            'http': '77.75.105.165',
            'https': '77.75.106.165'
        }
        auth = HTTPProxyAuth('用户名', '密码')
         
        r = requests.get("http://www.google.com",data={'xxx':'ffff'} proxies=proxyDict, auth=auth)
        print(r.text)
-----------------------------------------------------------------------------------------
文件上传 files
    # 发送文件
        file_dict = {
            'f1': open('xxxx.log', 'rb')
        }
        requests.request(
            method='POST',
            url='http://127.0.0.1:8000/test/',
            files=file_dict
        )
认证 auth
 
    内部：
        用户名和密码，用户和密码加密，放在请求头中传给后台。
         
            - "用户:密码"
            - base64("用户:密码")
            - "Basic base64("用户|密码")"
            - 请求头：
                Authorization： "basic base64("用户|密码")"
         
    from requests.auth import HTTPBasicAuth, HTTPDigestAuth
 
    ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
    print(ret.text)
超时 timeout
    # ret = requests.get('http://google.com/', timeout=1)
    # print(ret)
 
    # ret = requests.get('http://google.com/', timeout=(5, 1))
    # print(ret)
允许重定向  allow_redirects
    ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False)
    print(ret.text)
大文件下载 stream
    from contextlib import closing
    with closing(requests.get('http://httpbin.org/get', stream=True)) as r1:
    # 在此处理响应。
    for i in r1.iter_content():
        print(i)
证书 cert
    - 百度、腾讯 => 不用携带证书（系统帮你做了）
    - 自定义证书
        requests.get('http://127.0.0.1:8000/test/', cert="xxxx/xxx/xxx.pem")
        requests.get('http://127.0.0.1:8000/test/', cert=("xxxx/xxx/xxx.pem","xxx.xxx.xx.key"))
确认 verify =False

　　
更多参数
参数列表
 
参数示例


官方文档：http://cn.python-requests.org/zh_CN/latest/user/quickstart.html#id4
BeautifulSoup
BeautifulSoup是一个模块，该模块用于接收一个HTML或XML字符串，然后将其进行格式化，之后遍可以使用他提供的方法进行快速查找指定元素，从而使得在HTML或XML中查找指定元素变得简单。
from bs4 import BeautifulSoup
 
html_doc = """
The Dormouse's story

asdf
    
        The Dormouse's story总共
        f
    
Once upon a time there were three little sisters; and their names were
    Elsfie,
    Lacie and
    Tillie;
and they lived at the bottom of a well.
ad
sf
...


"""
 
soup = BeautifulSoup(html_doc, features="lxml")
# 找到第一个a标签
tag1 = soup.find(name='a')
# 找到所有的a标签
tag2 = soup.find_all(name='a')
# 找到id＝link2的标签
tag3 = soup.select('#link2')

安装：
    
pip3 install beautifulsoup4

使用示例：

from bs4 import BeautifulSoup
 
html_doc = """
The Dormouse's story

    ...


"""
 
soup = BeautifulSoup(html_doc, features="lxml")

1. name，标签名称

# tag = soup.find('a')
# name = tag.name # 获取
# print(name)
# tag.name = 'span' # 设置
# print(soup)

2. attr，标签属性
 
# tag = soup.find('a')
# attrs = tag.attrs    # 获取
# print(attrs)
# tag.attrs = {'ik':123} # 设置
# tag.attrs['id'] = 'iiiii' # 设置
# print(soup)

3. children,所有子标签
# body = soup.find('body')
# v = body.children

4. children,所有子子孙孙标签

# body = soup.find('body')
# v = body.descendants

5. clear,将标签的所有子标签全部清空（保留标签名）

# tag = soup.find('body')
# tag.clear()
# print(soup)

6. decompose,递归的删除所有的标签
 
# body = soup.find('body')
# body.decompose()
# print(soup)

7. extract,递归的删除所有的标签，并获取删除的标签

# body = soup.find('body')
# v = body.extract()
# print(soup)

8. decode,转换为字符串（含当前标签）；decode_contents（不含当前标签）
# body = soup.find('body')
# v = body.decode()
# v = body.decode_contents()
# print(v)

9. encode,转换为字节（含当前标签）；encode_contents（不含当前标签）

# body = soup.find('body')
# v = body.encode()
# v = body.encode_contents()
# print(v)

10. find,获取匹配的第一个标签

# tag = soup.find('a')
# print(tag)
# tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
# tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
# print(tag)

11. find_all,获取匹配的所有标签
# tags = soup.find_all('a')
# print(tags)
 
# tags = soup.find_all('a',limit=1)
# print(tags)
 
# tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
# # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
# print(tags)
 
 
# ####### 列表 #######
# v = soup.find_all(name=['a','div'])
# print(v)
 
# v = soup.find_all(class_=['sister0', 'sister'])
# print(v)
 
# v = soup.find_all(text=['Tillie'])
# print(v, type(v[0]))
 
 
# v = soup.find_all(id=['link1','link2'])
# print(v)
 
# v = soup.find_all(href=['link1','link2'])
# print(v)
 
# ####### 正则 #######
import re
# rep = re.compile('p')
# rep = re.compile('^p')
# v = soup.find_all(name=rep)
# print(v)
 
# rep = re.compile('sister.*')
# v = soup.find_all(class_=rep)
# print(v)
 
# rep = re.compile('http://www.oldboy.com/static/.*')
# v = soup.find_all(href=rep)
# print(v)
 
# ####### 方法筛选 #######
# def func(tag):
# return tag.has_attr('class') and tag.has_attr('id')
# v = soup.find_all(name=func)
# print(v)
 
 
# ## get,获取标签属性
# tag = soup.find('a')
# v = tag.get('id')
# print(v)

12. has_attr,检查标签是否具有该属性

# tag = soup.find('a')
# v = tag.has_attr('id')
# print(v)

13. get_text,获取标签内部文本内容

# tag = soup.find('a')
# v = tag.get_text('id')
# print(v)

14. index,检查标签在某标签中的索引位置

# tag = soup.find('body')
# v = tag.index(tag.find('div'))
# print(v)
 
# tag = soup.find('body')
# for i,v in enumerate(tag):
# print(i,v)

15. is_empty_element,是否是空标签(是否可以是空)或者自闭合标签，

     判断是否是如下标签：'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base'

# tag = soup.find('br')
# v = tag.is_empty_element
# print(v)

16. 当前的关联标签

# soup.next
# soup.next_element
# soup.next_elements
# soup.next_sibling
# soup.next_siblings
 
#
# tag.previous
# tag.previous_element
# tag.previous_elements
# tag.previous_sibling
# tag.previous_siblings
 
#
# tag.parent
# tag.parents

17. 查找某标签的关联标签

# tag.find_next(...)
# tag.find_all_next(...)
# tag.find_next_sibling(...)
# tag.find_next_siblings(...)
 
# tag.find_previous(...)
# tag.find_all_previous(...)
# tag.find_previous_sibling(...)
# tag.find_previous_siblings(...)
 
# tag.find_parent(...)
# tag.find_parents(...)
 
# 参数同find_all

18. select,select_one, CSS选择器

soup.select("title")
 
soup.select("p nth-of-type(3)")
 
soup.select("body a")
 
soup.select("html head title")
 
tag = soup.select("span,a")
 
soup.select("head > title")
 
soup.select("p > a")
 
soup.select("p > a:nth-of-type(2)")
 
soup.select("p > #link1")
 
soup.select("body > a")
 
soup.select("#link1 ~ .sister")
 
soup.select("#link1 + .sister")
 
soup.select(".sister")
 
soup.select("[class~=sister]")
 
soup.select("#link1")
 
soup.select("a#link2")
 
soup.select('a[href]')
 
soup.select('a[href="http://example.com/elsie"]')
 
soup.select('a[href^="http://example.com/"]')
 
soup.select('a[href$="tillie"]')
 
soup.select('a[href*=".com/el"]')
 
 
from bs4.element import Tag
 
def default_candidate_generator(tag):
    for child in tag.descendants:
        if not isinstance(child, Tag):
            continue
        if not child.has_attr('href'):
            continue
        yield child
 
tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator)
print(type(tags), tags)
 
from bs4.element import Tag
def default_candidate_generator(tag):
    for child in tag.descendants:
        if not isinstance(child, Tag):
            continue
        if not child.has_attr('href'):
            continue
        yield child
 
tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator, limit=1)
print(type(tags), tags)

19. 标签的内容
# tag = soup.find('span')
# print(tag.string)          # 获取
# tag.string = 'new content' # 设置
# print(soup)
 
# tag = soup.find('body')
# print(tag.string)
# tag.string = 'xxx'
# print(soup)
 
# tag = soup.find('body')
# v = tag.stripped_strings  # 递归内部获取所有标签的文本
# print(v)

20.append在当前标签内部追加一个标签

# tag = soup.find('body')
# tag.append(soup.find('a'))
# print(soup)
#
# from bs4.element import Tag
# obj = Tag(name='i',attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('body')
# tag.append(obj)
# print(soup)

21.insert在当前标签内部指定位置插入一个标签
2
4
6
    
# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('body')
# tag.insert(2, obj)
# print(soup)

22. insert_after,insert_before 在当前标签后面或前面插入

# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('body')
# # tag.insert_before(obj)
# tag.insert_after(obj)
# print(soup)

23. replace_with 在当前标签替换为指定标签

# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('div')
# tag.replace_with(obj)
# print(soup)

24. 创建标签之间的关系
    
# tag = soup.find('div')
# a = soup.find('a')
# tag.setup(previous_sibling=a)
# print(tag.previous_sibling)

25. wrap，将指定标签把当前标签包裹起来

# from bs4.element import Tag
# obj1 = Tag(name='div', attrs={'id': 'it'})
# obj1.string = '我是一个新来的'
#
# tag = soup.find('a')
# v = tag.wrap(obj1)
# print(soup)
 
# tag = soup.find('a')
# v = tag.wrap(soup.find('p'))
# print(soup)

26. unwrap，去掉当前标签，将保留其包裹的标签

# tag = soup.find('a')
# v = tag.unwrap()
# print(soup)

更多参数官方：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
一大波"自动登陆"示例


#!/usr/bin/env python
# -*- coding:utf-8 -*-

import requests
from bs4 import BeautifulSoup

# ############## 方式一 ##############
#
# # 1. 访问登陆页面，获取 authenticity_token
# i1 = requests.get('https://github.com/login')
# soup1 = BeautifulSoup(i1.text, features='lxml')
# tag = soup1.find(name='input', attrs={'name': 'authenticity_token'})
# authenticity_token = tag.get('value')
# c1 = i1.cookies.get_dict()
# i1.close()
#
# # 1. 携带authenticity_token和用户名密码等信息，发送用户验证
# form_data = {
# "authenticity_token": authenticity_token,
#     "utf8": "",
#     "commit": "Sign in",
#     "login": "[email protected]",
#     'password': 'xxoo'
# }
#
# i2 = requests.post('https://github.com/session', data=form_data, cookies=c1)
# c2 = i2.cookies.get_dict()
# c1.update(c2)
# i3 = requests.get('https://github.com/settings/repositories', cookies=c1)
#
# soup3 = BeautifulSoup(i3.text, features='lxml')
# list_group = soup3.find(name='div', class_='listgroup')
#
# from bs4.element import Tag
#
# for child in list_group.children:
#     if isinstance(child, Tag):
#         project_tag = child.find(name='a', class_='mr-1')
#         size_tag = child.find(name='small')
#         temp = "项目:%s(%s); 项目路径:%s" % (project_tag.get('href'), size_tag.string, project_tag.string, )
#         print(temp)



# ############## 方式二 ##############
# session = requests.Session()
# # 1. 访问登陆页面，获取 authenticity_token
# i1 = session.get('https://github.com/login')
# soup1 = BeautifulSoup(i1.text, features='lxml')
# tag = soup1.find(name='input', attrs={'name': 'authenticity_token'})
# authenticity_token = tag.get('value')
# c1 = i1.cookies.get_dict()
# i1.close()
#
# # 1. 携带authenticity_token和用户名密码等信息，发送用户验证
# form_data = {
#     "authenticity_token": authenticity_token,
#     "utf8": "",
#     "commit": "Sign in",
#     "login": "[email protected]",
#     'password': 'xxoo'
# }
#
# i2 = session.post('https://github.com/session', data=form_data)
# c2 = i2.cookies.get_dict()
# c1.update(c2)
# i3 = session.get('https://github.com/settings/repositories')
#
# soup3 = BeautifulSoup(i3.text, features='lxml')
# list_group = soup3.find(name='div', class_='listgroup')
#
# from bs4.element import Tag
#
# for child in list_group.children:
#     if isinstance(child, Tag):
#         project_tag = child.find(name='a', class_='mr-1')
#         size_tag = child.find(name='small')
#         temp = "项目:%s(%s); 项目路径:%s" % (project_tag.get('href'), size_tag.string, project_tag.string, )
#         print(temp)

防止xss攻击

from bs4 import BeautifulSoup
class XSSFilter(object):
    __instance = None
    def __init__(self):        # XSS白名单
        self.valid_tags = {
            "font": ['color', 'size', 'face', 'style'],
            'b': [],
            'div': [],
            "span": [],
            "table": [
                'border', 'cellspacing', 'cellpadding'
            ],
            'th': [
                'colspan', 'rowspan'
            ],
            'td': [
                'colspan', 'rowspan'
            ],
            "a": ['href', 'target', 'name'],
            "img": ['src', 'alt', 'title'],
            'p': ['align'],
            "pre": ['class'],
            "hr": ['class'],
            'strong': []
        }
    def __new__(cls, *args, **kwargs):
        if not cls.__instance:
            obj = object.__new__(cls, *args, **kwargs)
            cls.__instance = obj
        return cls.__instance
    def process(self, content):
        soup = BeautifulSoup(content, 'html.parser')        # 遍历所有HTML标签
        for tag in soup.find_all():        # 判断标签名是否在白名单中
            if tag.name not in self.valid_tags:
                tag.hidden = True
                if tag.name not in ['html', 'body']:
                    tag.hidden = True
                    tag.clear()
                continue                    # 当前标签的所有属性白名单
            attr_rules = self.valid_tags[tag.name]
            keys = list(tag.attrs.keys())
            for key in keys:
                if key not in attr_rules:
                    del tag[key]
        return soup.decode()                    #这里返回的就是过滤完的内容

content="""

   asdfaasdfasdfsdf


   asdf
   


   asdf

"""

content = XSSFilter().process(content)
print('content',content)

View Code

总结:

如果爬取的网站有反爬措施,请求里模仿浏览器发给服务器端
如果需要需要携带信息过去的
1. 去服务器返回的内容里找.如果有将他格式化成字典或其他保存在session
2. 看到159900098这样格式的一般都是时间戳,但是位数需要自己观察
3. 如果服务器返回的内容里没有key,那么去html或者js找相应的数据
4. 可能下一次的操作需要携带着上一次服务器发过来的key或其他
状态码:
1. 3开头的状态码是自动跳转.在自动跳转的时候可能进行cookies认证
2. 注意Response request 里的set-cookies参数

转载于:https://www.cnblogs.com/weidaijie/p/10441382.html

你可能感兴趣的:(爬虫,json,python)

Python的那些事第二十七篇：Python中的“数据魔法师”NumPy 暮雨哀尘 Python的那些事 python numpy 开发语言数据分析算法数组索引
摘要在这篇幽默风趣的论文中，我们将深入探讨NumPy——Python中最强大的数值计算库之一。它不仅提供了高性能的多维数组对象，还让复杂的数学运算变得像吃冰淇淋一样简单。本文将通过生动的代码示例和幽默的比喻，带你领略NumPy的魔法世界，让你在欢笑中掌握这个强大的工具。一、引言：为什么NumPy是程序员的“超级英雄”？1.1NumPy的起源：从“数据苦力”到“数据魔法师”想象一下，你被困在一个全是
uni.request 发起网络请求3种回调结果调用治金的blog 前端 uni-app
第一种标题：{{item.title}}内容：{{item.body}}import{ref}from'vue';letarrs=ref([]);//uni.request请求的三种方式functionrequest(){uni.request({url:"https://jsonplaceholder.typicode.com/posts",success:res=>{console.log(r
Python爬虫TLS dme. Python爬虫零基础入门爬虫 python
TLS指纹校验原理和绕过浏览器可以正常访问，但是用requests发送请求失败。后端是如何监测得呢？为什么浏览器可以返回结果，而requests模块不行呢？https://cn.investing.com/equities/amazon-com-inc-historical-data1.指纹校验案例1.1案例：ascii2dhttps://ascii2d.net/importrequestsres
python爬虫Selenium库详细教程_python爬虫之selenium库的使用详解嘻嘻哈哈学编程程序员 python 爬虫 selenium
网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是浅尝辄止，不再深入研究，那么很难做到真正的技术提升。需要这份系统化学习资料的朋友，可以戳这里获取一个人可以走的很快，但一群人才能走的更远！不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！2.2访问页面2.3查找元素2.3.1单个元素下面
排序算法：冒泡排序（Python）娱乐不打烊丶排序算法算法数据结构
思路：大家一定都喝过汽水吧，汽水中常常有许多小小的气泡，往上飘，这是因为组成小气泡的二氧化碳比水要轻，所以小气泡才会一点一点的向上浮。而冒泡排序之所以叫冒泡排序，正是因为这种排序算法的每一个元素都可以向小气泡一样，根据自身大小，一点一点向着数组的一侧移动。一图解百惑，上图！那么，话不多说，上代码！defbubble_sort(input_list):#冒泡排序：每次循环，锁定一个最值，并朝着最大或
supervisord 命令介绍和使用案例 lisanmengmeng linux 命令工具系统运维 shell编程服务器 linux 运维
supervisord命令介绍和使用案例supervisord是一个用Python编写的进程管理工具，用于监控和管理Linux系统中的进程。它可以将普通的命令行进程转变为后台守护进程（daemon），并监控进程状态，在进程异常退出时自动重启。它通过fork/exec的方式把被管理的进程当作自己的子进程来启动。主要功能:进程管理：能够启动、停止、重启和关闭进程.自动重启：监控进程状态，并在进程崩溃时
前后端分离跨域问题解决方案慕容屠苏大前端爬坑之路前后端分离跨域问题解决方案
前后端分离跨域问题解决方案现在的web开发中经常会用到前后分离技术，前后端分解技术，都会涉及到跨域问题。解决跨域问题的方法：第一种解决方案jsonp(不推荐使用)这种方案其实我是不赞同的，第一，在编码上jsonp会单独因为回调的关系，在传入传出还有定义回调函数上都会有编码的”不整洁”.简单阐述jsonp能够跨域是因为javascript的script标签，通过服务器返回script标签的code，
ptython setup.py install 设置python包编译时的并行数 leo0308 基础知识 Python python pytorch3d
通过源码编译安装pytorch3d的时候，直接执行pythonsetup.pyinstall时，默认开的并行数很多，有10几个，直接导致机器卡死。通过设置下面的环境变量，可以设置较小的并行数，避免占用过多的资源。exportMAX_JOBS=4设置后，同时只有4个编译的进程。
uniapp 右侧刷新图标和返回顶部图标的实现治金的blog uni-app 前端
{{item.content}}----{{item.author}}图标-->import{ref}from'vue';//触底加载更多，下拉刷新API(下拉刷新需要在pages.json里面开启这项功能)import{onReachBottom,onPullDownRefresh}from"@dcloudio/uni-app"constpets=ref([]);//触底加载更多，实现连接，使用
python 自动化数据提取之正则表达式_python 正则提取(2) m0_60607245 程序员 python 学习面试
一、Python所有方向的学习路线Python所有方向的技术点做的整理，形成各个领域的知识点汇总，它的用处就在于，你可以按照下面的知识点去找对应的学习资源，保证自己学得较为全面。二、Python必备开发工具工具都帮大家整理好了，安装就可直接上手！三、最新Python学习笔记当我学到一定基础，有自己的理解能力的时候，会去阅读一些前辈整理的书籍或者手写的笔记资料，这些笔记详细记载了他们对一些技术点的理
GUI编程（window系统→Linux系统）诚信爱国敬业友善心得 linux python gui
最近有个项目需要将windows系统的程序往Linux系统上面移植，由于之前程序没有考虑过多平台兼容的问题，导致部分功能不可用以下是对近期遇到的问题的总结，以及相应的解决方案和经验分享。1.Python模块安装与管理在Linux系统中，安装和管理Python模块时可能会遇到权限问题或依赖冲突。安装模块：使用pip安装模块时，建议使用--user选项，避免需要管理员权限：bash复制pipinsta
spring boot基于知识图谱的阿克苏市旅游管理系统python-计算机毕业设计 QQ1963288475 spring boot 知识图谱旅游 python vue.js django flask
目录功能和技术介绍具体实现截图开发核心技术：开发环境开发步骤编译运行核心代码部分展示系统设计详细视频演示可行性论证软件测试源码获取功能和技术介绍该系统基于浏览器的方式进行访问，采用springboot集成快速开发框架，前端使用vue方式，基于es5的语法，开发工具IntelliJIDEAx64，因为该开发工具，内嵌了Tomcat服务运行机制，可不用单独下载Tomcatserver服务器。由于考虑到
Python从0到100（三十九）：数据提取之正则（文末免费送书）是Dream呀 python mysql 开发语言
前言：零基础学Python：Python从0到100最新最全教程。想做这件事情很久了，这次我更新了自己所写过的所有博客，汇集成了Python从0到100，共一百节课，帮助大家一个月时间里从零基础到学习Python基础语法、Python爬虫、Web开发、计算机视觉、机器学习、神经网络以及人工智能相关知识，成为学习学习和学业的先行者！欢迎大家订阅专栏：零基础学Python：Python从0到100最新
Python学习心得两大编程思想 lifegoesonwjl python 开发语言 pycharm 前端 c语言
一、两大编程思想：1.面向过程：功能上的封装典型代表：C语言2.面向对象：属性和行为上的封装典型代表：Python、Java二、面向过程与面向对象的异同点：1.区别：面向过程：事物比较简单，可用线性的思维去解决面向对象：事务比较复杂，使用简单的线性思维无法解决2.共同点：（1）面向过程和面向对象都是解决实际问题的一种思维方式；（2）二者相辅相成，并不是对立的；（3）解决复杂问题，通过面向对象方式便
Linux升级Anacodna并配置jupyterLab 伪_装环境部署 linux 服务器 Anaconda python jupyter
在使用Anaconda的过程中，随着项目和需求的发展，可能需要升级Anaconda的Base环境中的Python版本。本文将详细介绍如何安全地进行升级，包括步骤、代码示例与最终流程图。升级Python一、环境准备在进行任何升级之前，建议先检查当前的Python版本以及各个库的兼容性。我们可以通过以下命令检查当前的Python版本：condainfo你会看到类似以下的输出，其中包含了当前Python
【Linux】删除Conda虚拟环境不是伍壹 Linux linux conda 运维
1、查看当前系统的conda虚拟环境condainfo--envscondaenvlist2、创建虚拟的环境condacreate-n（你的环境名字）python=（你需要的版本号，如（3.7,3.8,3.10））3、查看安装了哪些包condalist4、删除虚拟环境condaremove-nname--all5、删除虚拟环境中的包condaremove--name$（需要删除的环境名字）$（需要
cesium（vue）一些面试问题（包含Three.js） GIS瞧葩菜 vue.js javascript cesium
1.在不同的应用场景和技术栈中，模型加载方法和格式有所不同，下面主要从Web前端三维场景（使用Three.js和cesium）使用Three.js加载模型常见模型格式及加载方法GLTF/GLB格式格式特点：GLTF（GraphicsLibraryTransmissionFormat）是一种开放的、基于JSON的三维模型传输格式，GLB是其二进制版本。它们具有文件小、加载快、支持动画、材质和骨骼等优
动态规划之背包问题--python版本我是小码搬运工 #python基础动态规划背包问题 python版本
动态规划之背包问题–python版本问题已知一个最大量的背包，给定一组给定固定价值和固定体积的物品，求在不超过最大值的前提下，能放入背包中的最大总价值。解题思路该问题是典型的动态规划问题，分为三种不同的类型（0-1背包问题、完全背包和多重背包问题）解题关键–状态转移表达式：B(k,C)=max(B(k−1,C),B(k−1,C−ci)+vi)B(k,C)=max(B(k-1,C),B(k-1,C-
Centos7 搭建 Jupyter + Nginx 服务某龙兄 python nginx linux centos
JupyterNotebook（此前被称为IPythonnotebook）是一个交互式笔记本，支持运行40多种编程语言。JupyterNotebook的本质是一个Web应用程序，便于创建和共享文学化程序文档，支持实时代码，数学方程，可视化和markdown。用途包括：数据清理和转换，数值模拟，统计建模，机器学习等等。本文讲述如何搭建Jupyter+Nginx服务,仅供学习与交流，请勿用于商业用途一
简易java调用DeepSeek Api教程 m0_62519278 学习小本本 java 数据库开发语言
一、请求格式首先观察官方文档给出的访问api的样例脚本curlhttps://api.deepseek.com/chat/completions\-H"Content-Type:application/json"\-H"Authorization:Bearer"\-d'{"model":"deepseek-chat","messages":[{"role":"system","content":"
动态规划之背包问题的Python实现名侦探debug Python 数据结构 python 数据结构动态规划求解
目录1.问题描述2.动态规划之网格法3.python实现1.问题描述题目来源于《算法图解》第9章练习题9.2，如下图所示。对于背包问题，通常的做法有列举法、贪婪算法和动态规划（1）列举法：列举出所有的可能情况，再选择最优解，但当情况很多时，这种算法复杂度很高（2）贪婪算法：在容量允许范围内，每次都拿剩余物品中价值最高的，贪婪算法能够快速解决复杂度很高的问题，但通常得到的是次优解，但就对这个题目而言
总结10个Python赚钱的接单平台兼职月入5000+ begefefsef 面试学习路线阿里巴巴 android 前端后端
前言“如果说当下什么编程语言最靠谱或者比较适合搞副业？”答案肯定100%是：Pythonpython是所有语法中最简单易上手的语言，不需要特别的的英语词汇量，逻辑思维也不需要很差就能上手。而且学会了之后就能编写代码爬取各种数据，制作各种图表，提升工作效率。而且还能利用业余时间接点私活，一个月轻松收入过万不是问题，这样的生活他不香吗？今天就给大家盘点几个基本入门接私活的资源，让你轻松学python，
大学生学完python靠几个接单网站兼职，实现经济独立「已注销」 python 开发语言
大学生学完python靠几个接单网站兼职，实现经济独立程序员就是当今时代的手艺人，程序员可以通过个人的技术来谋生。而在工作之余接私单可以作为一种创富的途径，受到程序员的广泛认可。说句实在话，现在这个时代，很多人仅靠主业顶多维持基本生活，想让自己、家人生活好一点很难。我接的私活并不算多，加起来也就几万左右，只能算一半，我想把一些经验分享出来，毕竟现在生活都不容易，能赚一点是一点。一、程序员接活、新手
Python wifi 安装手机app yichengace python
目的当测试机数量越来越多时，测试包的安装会成为一个问题，用wifi安装来解决这个问题，并且用脚本语言来批量控制思路思路就是py调用pc端的adb命令，向手机发送请求，无线是因为，如果未来测试机越来越多，一台电脑的usb接口数量肯定不够准备工具python，adb，pycharm，测试用app，这里选择qq（https://qd.myapp.com/myapp/qqteam/AndroidQQ/mo
深度学习之目标检测的常用标注工具铭瑾熙人工智能机器学习深度学习深度学习目标检测目标跟踪
1LabelImgLabelImg是一款开源的图像标注工具，标签可用于分类和目标检测，它是用Python编写的，并使用Qt作为其图形界面，简单好用。注释以PASCALVOC格式保存为XML文件，这是ImageNet使用的格式。此外，它还支持COCO数据集格式。2labelmelabelme是一款开源的图像/视频标注工具，标签可用于目标检测、分割和分类。灵感是来自于MIT开源的一款标注工具Label
Python 舆论风向分析爬虫：全流程数据获取、清洗与情感剖析西攻城狮北 python 爬虫开发语言实战案例
引言在当今信息爆炸的时代，互联网上充斥着海量的用户言论和观点。了解舆论风向对于企业、政府机构以及研究者等具有重要的意义，可以帮助他们及时把握公众情绪、调整策略与决策。Python作为一种强大的编程语言，在数据爬取与分析方面具有得天独厚的优势，能够助力我们高效地实现舆情监测与深入剖析。一、环境搭建与目标确定1.环境搭建为了顺利完成爬虫与数据分析任务，首先需要确保你的开发环境已经安装了以下Python
PyCharm 集成 DeepSeek：本地运行 or API 直连？打造你的 AI 编程神器！ AI云极【AI智能系列】pycharm 人工智能 ide deepseek
在AI赋能编程的时代，如何让AI辅助写代码，提升开发效率？DeepSeek作为一款开源、强大、免费的AI编程助手，结合PyCharm，能够大幅提升Python编程体验。今天，我们就来详细讲解如何在PyCharm中接入DeepSeek，无论你想使用本地部署的DeepSeek，还是官方API版本，都能轻松实现！为什么选择DeepSeek+PyCharm？DeepSeekR1采用6710亿参数的MoE（
Python3.5源码分析-sys模块及site模块导入小屋子大侠 python Python分析 python源码
Python3源码分析本文环境python3.5.2。参考书籍>python官网Python3的sys模块初始化根据分析完成builtins初始化后，继续分析sys模块的初始化，继续分析_Py_InitializeEx_Private函数的执行，void_Py_InitializeEx_Private(intinstall_sigs,intinstall_importlib){...sysmod=
OpenLayers总结3 Super毛毛穗 WebGIS开发 OpenLayers GIS WebGIS
一、静态测距1.原理静态测距主要是针对地图上已有的矢量要素（如线要素），利用OpenLayers提供的几何计算函数来获取其长度。在实际操作中，先加载包含几何要素的GeoJSON数据到矢量图层，当鼠标指针移动到要素上时，获取该要素的几何信息，再调用getLength函数计算其长度。2.代码实现步骤及注释//引入必要的模块importVectorLayerfrom"ol/layer/Vector.js
vue3-video-play 插件在 Vue 3 项目上的应用放逐者-保持本心，方可放逐 vue3应用 vue.js 前端 javascript vue3-video-play
文章目录vue3-video-play插件在Vue3项目上的应用一、插件简介二、插件安装三、插件组件应用示例1.局部引入组件2.全局引入组件四、需要注意的事项五、本地环境将`package.json`中`"module":"./dist/index.es.js"`改为`"module":"./dist/index.mjs"`问题解析探索问题描述原因分析解决方案格式及应用实例vue3-video-p
iOS http封装 374016526 ios 服务器交互 http 网络请求
程序开发避免不了与服务器的交互，这里打包了一个自己写的http交互库。希望可以帮到大家。内置一个basehttp，当我们创建自己的service可以继承实现。 KuroAppBaseHttp *baseHttp = [[KuroAppBaseHttp alloc] init]; [baseHttp setDelegate:self]; [baseHttp
lolcat ：一个在 Linux 终端中输出彩虹特效的命令行工具 brotherlamp linux linux教程 linux视频 linux自学 linux资料
那些相信 Linux 命令行是单调无聊且没有任何乐趣的人们，你们错了，这里有一些有关 Linux 的文章，它们展示着 Linux 是如何的有趣和“淘气” 。在本文中，我将讨论一个名为“lolcat”的小工具 – 它可以在终端中生成彩虹般的颜色。何为 lolcat ? Lolcat 是一个针对 Linux，BSD 和 OSX 平台的工具，它类似于 cat 命令，并为 cat
MongoDB索引管理（1）——[九] eksliang mongodb MongoDB管理索引
转载请出自出处：http://eksliang.iteye.com/blog/2178427 一、概述数据库的索引与书籍的索引类似，有了索引就不需要翻转整本书。数据库的索引跟这个原理一样，首先在索引中找，在索引中找到条目以后，就可以直接跳转到目标文档的位置，从而使查询速度提高几个数据量级。不使用索引的查询称
Informatica参数及变量 18289753290 Informatica 参数变量
下面是本人通俗的理解，如有不对之处，希望指正 info参数的设置：在info中用到的参数都在server的专门的配置文件中（最好以parma）结尾下面的GLOBAl就是全局的，$开头的是系统级变量，$$开头的变量是自定义变量。如果是在session中或者mapping中用到的变量就是局部变量，那就把global换成对应的session或者mapping名字。 [GLOBAL] $Par
python 解析unicode字符串为utf8编码字符串酷的飞上天空 unicode
php返回的json字符串如果包含中文，则会被转换成\uxx格式的unicode编码字符串返回。在浏览器中能正常识别这种编码，但是后台程序却不能识别，直接输出显示的是\uxx的字符，并未进行转码。转换方式如下 >>> import json >>> q = '{"text":"\u4
Hibernate的总结永夜-极光 Hibernate
1.hibernate的作用,简化对数据库的编码,使开发人员不必再与复杂的sql语句打交道做项目大部分都需要用JAVA来链接数据库，比如你要做一个会员注册的页面，那么获取到用户填写的基本信后，你要把这些基本信息存入数据库对应的表中，不用hibernate还有mybatis之类的框架，都不用的话就得用JDBC，也就是JAVA自己的，用这个东西你要写很多的代码，比如保存注册信
SyntaxError: Non-UTF-8 code starting with '\xc4' 随便小屋 python
刚开始看一下Python语言，传说听强大的，但我感觉还是没Java强吧！写Hello World的时候就遇到一个问题，在Eclipse中写的，代码如下 ''' Created on 2014年10月27日 @author: Logic ''' print("Hello World!"); 运行结果 SyntaxError: Non-UTF-8
学会敬酒礼仪不做酒席菜鸟 aijuans 菜鸟
俗话说，酒是越喝越厚，但在酒桌上也有很多学问讲究，以下总结了一些酒桌上的你不得不注意的小细节。细节一：领导相互喝完才轮到自己敬酒。敬酒一定要站起来，双手举杯。细节二：可以多人敬一人，决不可一人敬多人，除非你是领导。细节三：自己敬别人，如果不碰杯，自己喝多少可视乎情况而定，比如对方酒量，对方喝酒态度，切不可比对方喝得少，要知道是自己敬人。细节四：自己敬别人，如果碰杯，一
《创新者的基因》读书笔记 aoyouzi 读书笔记《创新者的基因》
创新者的基因创新者的“基因”，即最具创意的企业家具备的五种“发现技能”：联想，观察，实验，发问，建立人脉。第一部分破坏性创新，从你开始第一章破坏性创新者的基因如何获得启示：发现以下的因素起到了催化剂的作用：(1) -个挑战现状的问题；(2)对某项技术、某个公司或顾客的观察；(3) -次尝试新鲜事物的经验或实验；(4)与某人进行了一次交谈，为他点醒
表单验证技术百合不是茶 JavaScript DOM对象 String对象事件
js最主要的功能就是验证表单,下面是我对表单验证的一些理解,贴出来与大家交流交流 ,数显我们要知道表单验证需要的技术点, String对象,事件,函数一:String对象;通常是对字符串的操作; 1,String的属性; 字符串.length;表示该字符串的长度; var str= "java"
web.xml配置详解之context-param bijian1013 java servlet web.xml context-param
一.格式定义： <context-param> <param-name>contextConfigLocation</param-name> <param-value>contextConfigLocationValue></param-value> </context-param> 作用：该元
Web系统常见编码漏洞（开发工程师知晓） Bill_chen sql PHP Web fckeditor 脚本
1.头号大敌：SQL Injection 原因：程序中对用户输入检查不严格，用户可以提交一段数据库查询代码，根据程序返回的结果，获得某些他想得知的数据，这就是所谓的SQL Injection，即SQL注入。本质: 对于输入检查不充分，导致SQL语句将用户提交的非法数据当作语句的一部分来执行。示例： String query = "SELECT id FROM users
【MongoDB学习笔记六】MongoDB修改器 bit1129 mongodb
本文首先介绍下MongoDB的基本的增删改查操作，然后，详细介绍MongoDB提供的修改器，以完成各种各样的文档更新操作 MongoDB的主要操作 show dbs 显示当前用户能看到哪些数据库 use foobar 将数据库切换到foobar show collections 显示当前数据库有哪些集合 db.people.update，update不带参数，可
提高职业素养，做好人生规划白糖_ 人生
培训讲师是成都著名的企业培训讲师，他在讲课中提出的一些观点很新颖，在此我收录了一些分享一下。注：讲师的观点不代表本人的观点，这些东西大家自己揣摩。 1、什么是职业规划：职业规划并不完全代表你到什么阶段要当什么官要拿多少钱，这些都只是梦想。职业规划是清楚的认识自己现在缺什么，这个阶段该学习什么，下个阶段缺什么，又应该怎么去规划学习，这样才算是规划。
国外的网站你都到哪边看？ bozch 技术网站国外
学习软件开发技术，如果没有什么英文基础，最好还是看国内的一些技术网站，例如：开源OSchina，csdn，iteye,51cto等等。个人感觉如果英语基础能力不错的话，可以浏览国外的网站来进行软件技术基础的学习，例如java开发中常用的到的网站有apache.org 里面有apache的很多Projects,springframework.org是spring相关的项目网站,还有几个感觉不错的
编程之美-光影切割问题 bylijinnan 编程之美
package a; public class DisorderCount { /**《编程之美》“光影切割问题” * 主要是两个问题： * 1.数学公式（设定没有三条以上的直线交于同一点）： * 两条直线最多一个交点，将平面分成了4个区域； * 三条直线最多三个交点，将平面分成了7个区域； * 可以推出：N条直线 M个交点，区域数为N+M+1。
关于Web跨站执行脚本概念 chenbowen00 Web 安全跨站执行脚本
跨站脚本攻击(XSS)是web应用程序中最危险和最常见的安全漏洞之一。安全研究人员发现这个漏洞在最受欢迎的网站,包括谷歌、Facebook、亚马逊、PayPal,和许多其他网站。如果你看看bug赏金计划,大多数报告的问题属于 XSS。为了防止跨站脚本攻击,浏览器也有自己的过滤器,但安全研究人员总是想方设法绕过这些过滤器。这个漏洞是通常用于执行cookie窃取、恶意软件传播,会话劫持,恶意重定向。在
[开源项目与投资]投资开源项目之前需要统计该项目已有的用户数 comsci 开源项目
现在国内和国外,特别是美国那边,突然出现很多开源项目,但是这些项目的用户有多少,有多少忠诚的粉丝,对于投资者来讲,完全是一个未知数,那么要投资开源项目,我们投资者必须准确无误的知道该项目的全部情况,包括项目发起人的情况,项目的维持时间..项目的技术水平,项目的参与者的势力,项目投入产出的效益.....
oracle alert log file（告警日志文件） daizj oracle 告警日志文件 alert log file
The alert log is a chronological log of messages and errors, and includes the following items: All internal errors (ORA-00600), block corruption errors (ORA-01578), and deadlock errors (ORA-00060)
关于 CAS SSO 文章声明 denger SSO
由于几年前写了几篇 CAS 系列的文章，之后陆续有人参照文章去实现，可都遇到了各种问题，同时经常或多或少的收到不少人的求助。现在这时特此说明几点： 1. 那些文章发表于好几年前了，CAS 已经更新几个很多版本了，由于近年已经没有做该领域方面的事情，所有文章也没有持续更新。 2. 文章只是提供思路，尽管 CAS 版本已经发生变化，但原理和流程仍然一致。最重要的是明白原理，然后
初二上学期难记单词 dcj3sjt126com english word
lesson 课 traffic 交通 matter 要紧；事物 happy 快乐的，幸福的 second 第二的 idea 主意；想法；意见 mean 意味着 important 重要的，重大的 never 从来，决不 afraid 害怕的 fifth 第五的 hometown 故乡，家乡 discuss 讨论；议论 east 东方的 agree 同意；赞成 bo
uicollectionview 纯代码布局, 添加头部视图 dcj3sjt126com Collection
#import <UIKit/UIKit.h> @interface myHeadView : UICollectionReusableView { UILabel *TitleLable; } -(void)setTextTitle; @end #import "myHeadView.h" @implementation m
N 位随机数字串的 JAVA 生成实现 FX夜归人 java Math 随机数 Random
/** * 功能描述随机数工具类<br /> * @author FengXueYeGuiRen * 创建时间 2014-7-25<br /> */ public class RandomUtil { // 随机数生成器 private static java.util.Random random = new java.util.R
Ehcache（09）——缓存Web页面 234390216 ehcache 页面缓存
页面缓存目录 1 SimplePageCachingFilter 1.1 calculateKey 1.2 可配置的初始化参数 1.2.1 cach
spring中少用的注解@primary解析 jackyrong primary
这次看下spring中少见的注解@primary注解，例子 @Component public class MetalSinger implements Singer{ @Override public String sing(String lyrics) { return "I am singing with DIO voice
Java几款性能分析工具的对比 lbwahoo java
Java几款性能分析工具的对比摘自：http://my.oschina.net/liux/blog/51800 在给客户的应用程序维护的过程中，我注意到在高负载下的一些性能问题。理论上，增加对应用程序的负载会使性能等比率的下降。然而，我认为性能下降的比率远远高于负载的增加。我也发现，性能可以通过改变应用程序的逻辑来提升，甚至达到极限。为了更详细的了解这一点，我们需要做一些性能
JVM参数配置大全 nickys jvm 应用服务器
JVM参数配置大全 /usr/local/jdk/bin/java -Dresin.home=/usr/local/resin -server -Xms1800M -Xmx1800M -Xmn300M -Xss512K -XX:PermSize=300M -XX:MaxPermSize=300M -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=5 -
搭建 CentOS 6 服务器(14) - squid、Varnish rensanning varnish
（一）squid 安装 # yum install httpd-tools -y # htpasswd -c -b /etc/squid/passwords squiduser 123456 # yum install squid -y 设置 # cp /etc/squid/squid.conf /etc/squid/squid.conf.bak # vi /etc/
Spring缓存注解@Cache使用 tom_seed spring
参考资料 http://www.ibm.com/developerworks/cn/opensource/os-cn-spring-cache/ http://swiftlet.net/archives/774 缓存注解有以下三个： @Cacheable @CacheEvict @CachePut
dom4j解析XML时出现"java.lang.noclassdeffounderror: org/jaxen/jaxenexception"错误 xp9802
java.lang.NoClassDefFoundError: org/jaxen/JaxenExc 关键字: java.lang.noclassdeffounderror: org/jaxen/jaxenexception 使用dom4j解析XML时，要快速获取某个节点的数据，使用XPath是个不错的方法，dom4j的快速手册里也建议使用这种方式执行时却抛出以下异常： Exceptio