python网络爬虫

运行环境:python3

BeautifulSoup4解析库

中文文档: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

BeautifulSoup4 是 HTML/XML 的解析器,主要的功能便是解析和提取 HTML/XML 中的数据。

Python中用于爬取静态网页的基本方法/模块有三种:正则表达式、BeautifulSoup和Lxml。三种方法的特点大致如下: python网络爬虫_第1张图片

beautifulSoup 的功能和 lxml 一样,但是 lxml 只会局部遍历数据,而 BeautifulSoup是基于HTML DOM的,所以会载入整个文档来解析整个DOM树。因此在性能上来说 BeautifulSoup 是低于lxml 的。

安装 BeautifulSoup4:

在 python3 中安装 BeautifulSoup4 的方法如下:

pip3 install beautifulsoup4

BeautifulSoup4使用

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,如果我们不安装它,则 Python 会使用 Python默认的解析器,lxml 解析器更加强大,速度更快,推荐安装。

python网络爬虫_第2张图片

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html.read(), 'html.parser')

#bs.find_all(tagName, tagAttributes) 可以获取页面中所有指定的标签
nameList = bs.findAll('span', {'class':'green'})
title = bs.body.h1
print(title)

head=bs.findAll(['h1','h2'])
print(head)

nameList1 = bs.find_all(text='the prince')  #文本参数 text 有点不同,它是用标签的文本内容去匹配,而不是用标签的属性
print(len(nameList1))

for name in nameList:
    print(name.get_text())

bs.find_all(tagName, tagAttributes) 可以获取页面中所有指定的标签

BeautifulSoup的find()和find_all()

BeautifulSoup 文档里两者的定义就是这样:

  find_all(tag, attributes, recursive, text, limit, keywords)
  
  find(tag, attributes, recursive, text, keywords)
  

正则表达式和BeautifulSoup

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img',
                     {'src': re.compile('\.\.\/img\/gifts\/img.*\.jpg')})
for image in images:
    print(image['src'])

编写网络爬虫

全面彻底地抓取网站的常用方法是从一个顶级页面(比如主页)开始,然后搜索该页面上 的所有内链,形成列表。之后,抓取这些链接跳转到的每一个页面,再把在每个页面上找 到的链接形成新的列表,接着执行下一轮抓取。

1. 搜索维基百科上凯文 • 贝肯词条里所有指向其他词条的链接

  • 一个函数 getLinks,可以用一个 /wiki/< 词条名称 > 形式的维基百科词条 URL 作为参数, 然后以同样的形式返回一个列表,里面包含所有的词条 URL。

  • 一个主函数,以某个起始词条为参数调用 getLinks,然后从返回的 URL 列表里随机选 择一个词条链接,再次调用 getLinks,直到你主动停止程序,或者在新的页面上没有词 条链接了。

    完整的代码如下所示:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re

random.seed(datetime.datetime.now())

def getLinks(articleUrl):
    html = urlopen('http://en.wikipedia.org{}'.format(articleUrl))
    bs = BeautifulSoup(html, 'html.parser')
    return bs.find('div', {'id': 'bodyContent'}).find_all('a',
                                                          href=re.compile('^(/wiki/)((?!:).)*$'))
links = getLinks('/wiki/Kevin_Bacon')
while len(links) > 0:
    newArticle = links[random.randint(0, len(links) - 1)].attrs['href']
    print(newArticle)
    links = getLinks(newArticle)

2.收集网站数据

通过观察几个维基百科页面,包括词条页面和非词条页面,比如隐私策略页 面,就会得出下面的规则。

  • 所有的标题(所有页面上,不论是词条页面、编辑历史页面还是其他页面)都是在 h1 → span 标签里,而且页面上只有一个 h1 标签。

  • 前面提到过,所有的正文文本都在 div#bodyContent 标签里。但是,如果我们只想获取 第一段文字,可能用 div#mw-content-text → p 更好(只选择第一段的标签)。这个规则 对所有内容页面都适用,除了文件页面(例如,https://en.wikipedia.org/wiki/File:Orbit_ of_274301_Wikipedia.svg),它们不包含内容文本(content text)部分。

  • 编辑链接只出现在词条页面上。如果有编辑链接,都位于 li#ca-edit 标签的 li#ca- edit → span → a 里面。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()

def getLinks(pageUrl):
    global pages
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    try:
        print(bs.h1.get_text())
        print(bs.find(id='mw-content-text').find_all('p')[0])
        print(bs.find(id='ca-edit').find('span')
              .find('a').attrs['href'])
    except AttributeError:
        print("页面缺少一些属性!不过不用担心!")
    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
           if link.attrs['href'] not in pages:  # 我们遇到了新页面
               newPage = link.attrs['href']
               print('-' * 20)
               print(newPage)
               pages.add(newPage)
               getLinks(newPage)

爬chakracore的label为bug的网址:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()

def getLinks(pageUrl):
    global pages
    html = urlopen('https://github.com/chakra-core/ChakraCore/labels/Bug{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.find_all('a', href=re.compile('^(\/chakra-core\/ChakraCore\/issues\/)[0-9]+')):
        if 'href' in link.attrs:
           if link.attrs['href'] not in pages:  # 我们遇到了新页面
               newPage = link.attrs['href']
               print('-' * 20)
               print(newPage)
               pages.add(newPage)
               getLinks(newPage)

getLinks('')

Scrapy

1.安装Scrapy:

 conda install -c conda-forge scrapy
  • 一个蜘蛛(spider)就是一 个 Scrapy 项目,和它的名称一样,就是用来爬网(抓取网页)的

  • “爬虫”(crawler)表示“任意用或不用 Scrapy 抓取网页的程序”

https://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html

2.编写第一个爬虫(Spider)

Spider是用户编写用于从单个网站(或者一些网站)爬取数据的类。

其包含了一个用于下载的初始URL,如何跟进网页中的链接以及如何分析页面中的内容, 提取生成 item 的方法。

为了创建一个Spider,您必须继承 scrapy.Spider 类, 且定义以下三个属性:

  • name: 用于区别Spider。 该名字必须是唯一的,您不可以为不同的Spider设定相同的名字。
  • start_urls: 包含了Spider在启动时进行爬取的url列表。 因此,第一个被获取到的页面将是其中之一。 后续的URL则从初始的URL获取到的数据中提取。
  • parse() 是spider的一个方法。 被调用时,每个初始URL完成下载后生成的 Response 对象将会作为唯一的参数传递给该函数。 该方法负责解析返回的数据(response data),提取数据(生成item)以及生成需要进一步处理的URL的 Request 对象。

创建项目

在开始爬取之前,您必须创建一个新的Scrapy项目。 进入您打算存储代码的目录中,运行下列命令:

scrapy startproject tutorial

该命令将会创建包含下列内容的 tutorial 目录:

tutorial/
    scrapy.cfg
    tutorial/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

这些文件分别是:

  • scrapy.cfg: 项目的配置文件
  • tutorial/: 该项目的python模块。之后您将在此加入代码。
  • tutorial/items.py: 项目中的item文件.
  • tutorial/pipelines.py: 项目中的pipelines文件.
  • tutorial/settings.py: 项目的设置文件.
  • tutorial/spiders/: 放置spider代码的目录.

定义Item

Item 是保存爬取到的数据的容器;其使用方法和python字典类似, 并且提供了额外保护机制来避免拼写错误导致的未定义字段错误。

提取Item

Selectors选择器简介

从网页中提取数据有很多方法。Scrapy使用了一种基于 XPath 和 CSS 表达式机制: Scrapy Selectors 。 关于selector和其他提取机制的信息请参考 Selector文档 。

这里给出XPath表达式的例子及对应的含义:

  • /html/head/title: 选择HTML文档中 标签内的 </code> 元素</li> <li><code>/html/head/title/text()</code>: 选择上面提到的 <code><title></code> 元素的文字</li> <li><code>//td</code>: 选择所有的 <code><td></code> 元素</li> <li><code>//div[@class="mine"]</code>: 选择所有具有 <code>class="mine"</code> 属性的 <code>div</code> 元素</li> </ul> <hr> <p>为了配合XPath,Scrapy除了提供了 <code>Selector</code> 之外,还提供了方法来避免每次从response中提取数据时生成selector的麻烦。</p> <p>Selector有四个基本的方法(点击相应的方法可以看到详细的API文档):</p> <ul> <li><code>xpath()</code>: 传入xpath表达式,返回该表达式所对应的所有节点的selector list列表 。</li> <li><code>css()</code>: 传入CSS表达式,返回该表达式所对应的所有节点的selector list列表.</li> <li><code>extract()</code>: 序列化该节点为unicode字符串并返回list。</li> <li><code>re()</code>: 根据传入的正则表达式对数据进行提取,返回unicode字符串list列表。</li> </ul> <p>在查看了网页的源码后,您会发现网站的信息是被包含在 <em>第二个</em> <code><ul></code> 元素中。</p> <p>我们可以通过这段代码选择该页面中网站列表里所有 <code><li></code> 元素:</p> <pre><code>response.xpath('//ul/li') </code></pre> <p>网站的描述:</p> <pre><code>response.xpath('//ul/li/text()').extract() </code></pre> <p>网站的标题:</p> <pre><code>response.xpath('//ul/li/a/text()').extract() </code></pre> <p>以及网站的链接:</p> <pre><code>response.xpath('//ul/li/a/@href').extract() </code></pre> <p>之前提到过,每个 <code>.xpath()</code> 调用返回selector组成的list,因此我们可以拼接更多的 <code>.xpath()</code> 来进一步获取某个节点。我们将在下边使用这样的特性:</p> <pre><code class="prism language-python"><span class="token keyword">for</span> sel <span class="token keyword">in</span> response<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//ul/li'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> title <span class="token operator">=</span> sel<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'a/text()'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract<span class="token punctuation">(</span><span class="token punctuation">)</span> link <span class="token operator">=</span> sel<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'a/@href'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract<span class="token punctuation">(</span><span class="token punctuation">)</span> desc <span class="token operator">=</span> sel<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'text()'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">print</span> title<span class="token punctuation">,</span> link<span class="token punctuation">,</span> desc </code></pre> <h1>mysql数据库</h1> <h3>1.启动:</h3> <pre><code class="prism language-sql">mysql <span class="token operator">-</span>u root </code></pre> <p>密码为:12345678</p> <h3>2.<strong>显示所有数据库</strong></h3> <p>输入show databases;命令,显示所有数据库</p> <pre><code class="prism language-javascript">mysql<span class="token operator">></span> show databases<span class="token punctuation">;</span> </code></pre> <h3>3.创建数据库:</h3> <pre><code class="prism language-javascript">mysql<span class="token operator">></span> create database studb<span class="token punctuation">;</span> </code></pre> <h3><strong>4. 使用数据库</strong></h3> <p>在上面显示的数据库中,实例中使用studb数据库,输入下面命令:</p> <pre><code class="prism language-javascript">mysql<span class="token operator">></span> use studb<span class="token punctuation">;</span> </code></pre> <h3>5.创建表</h3> <pre><code class="prism language-mysql">mysql> create table test -> ( -> sid varchar(20) not null primary key, -> sname varchar(20) not null, -> sddress varchar(40) -> ); </code></pre> <h3><strong>6. 打印表结构</strong></h3> <pre><code class="prism language-javascript">mysql<span class="token operator">></span> desc t_stu<span class="token punctuation">;</span> </code></pre> <p>打印结果:</p> <pre><code class="prism language-javascript"><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span> <span class="token operator">|</span> Field <span class="token operator">|</span> Type <span class="token operator">|</span> Null <span class="token operator">|</span> Key <span class="token operator">|</span> Default <span class="token operator">|</span> Extra <span class="token operator">|</span> <span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span> <span class="token operator">|</span> sid <span class="token operator">|</span> <span class="token function">varchar</span><span class="token punctuation">(</span><span class="token number">20</span><span class="token punctuation">)</span> <span class="token operator">|</span> <span class="token constant">NO</span> <span class="token operator">|</span> <span class="token constant">PRI</span> <span class="token operator">|</span> <span class="token constant">NULL</span> <span class="token operator">|</span> <span class="token operator">|</span> <span class="token operator">|</span> sname <span class="token operator">|</span> <span class="token function">varchar</span><span class="token punctuation">(</span><span class="token number">20</span><span class="token punctuation">)</span> <span class="token operator">|</span> <span class="token constant">NO</span> <span class="token operator">|</span> <span class="token operator">|</span> <span class="token constant">NULL</span> <span class="token operator">|</span> <span class="token operator">|</span> <span class="token operator">|</span> address <span class="token operator">|</span> <span class="token function">varchar</span><span class="token punctuation">(</span><span class="token number">50</span><span class="token punctuation">)</span> <span class="token operator">|</span> <span class="token constant">YES</span> <span class="token operator">|</span> <span class="token operator">|</span> <span class="token constant">NULL</span> <span class="token operator">|</span> <span class="token operator">|</span> <span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span> <span class="token number">3</span> rows <span class="token keyword">in</span> <span class="token function">set</span> <span class="token punctuation">(</span><span class="token number">0.02</span> sec<span class="token punctuation">)</span> </code></pre> <h3><strong>7. 表中增加数据</strong></h3> <pre><code class="prism language-javascript">mysql<span class="token operator">></span> insert into t_stu <span class="token operator">-</span><span class="token operator">></span> select <span class="token string">'s001'</span> <span class="token punctuation">,</span> <span class="token string">'jin'</span> <span class="token punctuation">,</span> <span class="token string">'changzhou'</span> <span class="token operator">-</span><span class="token operator">></span> union <span class="token operator">-</span><span class="token operator">></span> select <span class="token string">'s002'</span> <span class="token punctuation">,</span> <span class="token string">'tom'</span> <span class="token punctuation">,</span> <span class="token string">'yangzhou'</span> <span class="token operator">-</span><span class="token operator">></span> union <span class="token operator">-</span><span class="token operator">></span> select <span class="token string">'s003'</span> <span class="token punctuation">,</span> <span class="token string">'kate'</span> <span class="token punctuation">,</span> <span class="token string">'suzhou'</span> <span class="token operator">-</span><span class="token operator">></span> <span class="token punctuation">;</span> </code></pre> <h3><strong>8. 查看表数据</strong></h3> <pre><code class="prism language-javascript">mysql<span class="token operator">></span> select <span class="token operator">*</span> <span class="token keyword">from</span> t_stu<span class="token punctuation">;</span> </code></pre> <p>查看结果:</p> <pre><code class="prism language-javascript"><span class="token operator">|</span> sid <span class="token operator">|</span> sname <span class="token operator">|</span> address <span class="token operator">|</span> <span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">+</span> <span class="token operator">|</span> s001 <span class="token operator">|</span> jin <span class="token operator">|</span> wuhan <span class="token operator">|</span> <span class="token operator">|</span> s002 <span class="token operator">|</span> tom <span class="token operator">|</span> shanghai <span class="token operator">|</span> <span class="token operator">|</span> s003 <span class="token operator">|</span> kate <span class="token operator">|</span> suzhou <span class="token operator">|</span> <span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">+</span> <span class="token number">3</span> rows <span class="token keyword">in</span> <span class="token function">set</span> <span class="token punctuation">(</span><span class="token number">0.01</span> sec<span class="token punctuation">)</span> </code></pre> <h3><strong>9. 修改表中数据</strong></h3> <pre><code class="prism language-javascript">mysql<span class="token operator">></span> update t_stu <span class="token keyword">set</span> sname <span class="token operator">=</span> <span class="token string">"fby"</span> where sid <span class="token operator">=</span> <span class="token string">"s001"</span><span class="token punctuation">;</span> </code></pre> <h3><strong>10. 删除表中数据</strong></h3> <p>删除表中sid = “s002”的数据</p> <pre><code class="prism language-javascript">mysql<span class="token operator">></span> <span class="token keyword">delete</span> <span class="token keyword">from</span> t_stu where sid <span class="token operator">=</span> <span class="token string">"s002"</span><span class="token punctuation">;</span> </code></pre> <h1>读csv文件</h1> <pre><code class="prism language-python"><span class="token keyword">from</span> urllib<span class="token punctuation">.</span>request <span class="token keyword">import</span> urlopen <span class="token keyword">from</span> io <span class="token keyword">import</span> StringIO <span class="token keyword">import</span> csv data <span class="token operator">=</span> urlopen<span class="token punctuation">(</span><span class="token string">'http://pythonscraping.com/files/MontyPythonAlbums.csv'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>read<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>decode<span class="token punctuation">(</span><span class="token string">'ascii'</span><span class="token punctuation">,</span> <span class="token string">'ignore'</span><span class="token punctuation">)</span> dataFile <span class="token operator">=</span> StringIO<span class="token punctuation">(</span>data<span class="token punctuation">)</span> csvReader <span class="token operator">=</span> csv<span class="token punctuation">.</span>reader<span class="token punctuation">(</span>dataFile<span class="token punctuation">)</span> <span class="token keyword">for</span> row <span class="token keyword">in</span> csvReader<span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>row<span class="token punctuation">)</span> </code></pre> <h1>Python使用pandas处理CSV文件</h1> <p>https://blog.csdn.net/atnanyang/article/details/70832257</p> <p>Python中有许多方便的库可以用来进行数据处理,尤其是Numpy和Pandas,再搭配matplot画图专用模块,功能十分强大。</p> <p>CSV(Comma-Separated Values)格式的文件是<strong>指以纯文本形式存储的表格数据</strong>,这意味着不能简单的使用Excel表格工具进行处理,而且Excel表格处理的数据量十分有限,而<strong>使用Pandas来处理数据量巨大的CSV文件</strong>就容易的多了。</p> <ul> <li><strong>Pandas读取本地CSV文件并设置Dataframe(数据格式)</strong></li> </ul> <pre><code class="prism language-python"><span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd <span class="token keyword">import</span> numpy <span class="token keyword">as</span> np df<span class="token operator">=</span>pd<span class="token punctuation">.</span>read_csv<span class="token punctuation">(</span><span class="token string">'filename'</span><span class="token punctuation">,</span>header<span class="token operator">=</span><span class="token boolean">None</span><span class="token punctuation">,</span>sep<span class="token operator">=</span><span class="token string">' '</span><span class="token punctuation">)</span> <span class="token comment">#filename可以直接从盘符开始,标明每一级的文件夹直到csv文件,header=None表示头部为空,sep=' '表示数据间使用空格作为分隔符,如果分隔符是逗号,只需换成 ‘,’即可。</span> <span class="token keyword">print</span> df<span class="token punctuation">.</span>head<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">print</span> df<span class="token punctuation">.</span>tail<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment">#作为示例,输出CSV文件的前5行和最后5行,这是pandas默认的输出5行,可以根据需要自己设定输出几行的值</span> </code></pre> <ul> <li><strong>使用pandas直接读取本地的csv文件后,csv文件的列索引默认为从0开始的数字,重定义列索引的语句如下:</strong></li> </ul> <pre><code class="prism language-python"><span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd <span class="token keyword">import</span> numpy <span class="token keyword">as</span> np df<span class="token operator">=</span>pd<span class="token punctuation">.</span>read_csv<span class="token punctuation">(</span><span class="token string">'filename'</span><span class="token punctuation">,</span>header<span class="token operator">=</span><span class="token boolean">None</span><span class="token punctuation">,</span>sep<span class="token operator">=</span><span class="token string">' '</span><span class="token punctuation">,</span>names<span class="token operator">=</span><span class="token punctuation">[</span><span class="token string">"week"</span><span class="token punctuation">,</span><span class="token string">'month'</span><span class="token punctuation">,</span><span class="token string">'date'</span><span class="token punctuation">,</span><span class="token string">'time'</span><span class="token punctuation">,</span><span class="token string">'year'</span><span class="token punctuation">,</span><span class="token string">'name1'</span><span class="token punctuation">,</span><span class="token string">'freq1'</span><span class="token punctuation">,</span><span class="token string">'name2'</span><span class="token punctuation">,</span><span class="token string">'freq2'</span><span class="token punctuation">,</span><span class="token string">'name3'</span><span class="token punctuation">,</span><span class="token string">'data1'</span><span class="token punctuation">,</span><span class="token string">'name4'</span><span class="token punctuation">,</span><span class="token string">'data2'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token keyword">print</span> df </code></pre> <h2>使用pandas按列合并CSV文件</h2> <p>1.列合并两个csv文件</p> <pre><code class="prism language-python"><span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd df1 <span class="token operator">=</span> pd<span class="token punctuation">.</span>read_csv<span class="token punctuation">(</span><span class="token string">'dataset/easy29.csv'</span><span class="token punctuation">)</span> df2 <span class="token operator">=</span> pd<span class="token punctuation">.</span>read_csv<span class="token punctuation">(</span><span class="token string">'dataset/easy210.csv'</span><span class="token punctuation">)</span> frames <span class="token operator">=</span> <span class="token punctuation">[</span>df1<span class="token punctuation">,</span> df2<span class="token punctuation">]</span> all_csv <span class="token operator">=</span> pd<span class="token punctuation">.</span>concat<span class="token punctuation">(</span>frames<span class="token punctuation">)</span> </code></pre> <p><a href="http://img.e-com-net.com/image/info8/d0cf5b8dc58a40fab1783fe6d2cfab9c.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/d0cf5b8dc58a40fab1783fe6d2cfab9c.jpg" alt="python网络爬虫_第3张图片" width="430" height="272" style="border:1px solid black;"></a></p> <p>2.通过追加的方式合并csv文件。</p> <pre><code class="prism language-python"><span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">'1.csv'</span><span class="token punctuation">,</span><span class="token string">'ab'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">'2.csv'</span><span class="token punctuation">,</span><span class="token string">'rb'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>read<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token comment">#将2.csv内容追加到1.csv的后面</span> </code></pre> <p>3.在将多个csv文件拼接到一起的时候,可以用Python通过pandas包的read_csv和to_csv两个方法来完成。这里不采用pandas.merge()来进行csv的拼接,而只是通过简单的文件的读取和附加方式的写入来完成拼接。</p> <p>3.1</p> <pre><code class="prism language-python"><span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd <span class="token keyword">for</span> inputfile <span class="token keyword">in</span> os<span class="token punctuation">.</span>listdir<span class="token punctuation">(</span>inputfile_dir<span class="token punctuation">)</span><span class="token punctuation">:</span> pd<span class="token punctuation">.</span>read_csv<span class="token punctuation">(</span>inputfile<span class="token punctuation">,</span> header<span class="token operator">=</span><span class="token boolean">None</span><span class="token punctuation">)</span>                    <span class="token comment">#header=None表示原始文件数据没有列索引,这样的话read_csv会自动加上列索引</span> pd<span class="token punctuation">.</span>to_csv<span class="token punctuation">(</span>outputfile<span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">'a'</span><span class="token punctuation">,</span> index<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">,</span> header<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span>      <span class="token comment">#header=0表示不保留列名,index=False表示不保留行索引,mode='a'表示附加方式写入,文件原有内容不会被清除</span> </code></pre> <p>3.2</p> <pre><code class="prism language-python"><span class="token comment"># 将该文件夹下的所有文件名存入列表</span> csv_name_list <span class="token operator">=</span> os<span class="token punctuation">.</span>listdir<span class="token punctuation">(</span><span class="token string">'E:\jupyternotebook_space\yimiaodatas'</span><span class="token punctuation">)</span> <span class="token comment"># 获取列表的长度</span> length <span class="token operator">=</span> <span class="token builtin">len</span><span class="token punctuation">(</span>csv_name_list<span class="token punctuation">)</span> <span class="token comment"># 读取第一个CSV文件并包含表头,用于后续的csv文件拼接</span> f<span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span>csv_name_list<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">,</span>encoding <span class="token operator">=</span> <span class="token string">"utf-8"</span><span class="token punctuation">)</span> df <span class="token operator">=</span> pd<span class="token punctuation">.</span>read_csv<span class="token punctuation">(</span> f<span class="token punctuation">)</span> <span class="token comment"># 读取第一个CSV文件并保存</span> df<span class="token punctuation">.</span>to_csv<span class="token punctuation">(</span> <span class="token string">"E:\jupyternotebook_space\Alldatas.csv"</span><span class="token punctuation">,</span>index<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> <span class="token comment"># 循环遍历列表中各个CSV文件名,并完成文件拼接</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span>length<span class="token punctuation">)</span><span class="token punctuation">:</span> f<span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span>csv_name_list<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">,</span>encoding <span class="token operator">=</span> <span class="token string">"utf-8"</span><span class="token punctuation">)</span> df <span class="token operator">=</span> pd<span class="token punctuation">.</span>read_csv<span class="token punctuation">(</span> f <span class="token punctuation">)</span> df<span class="token punctuation">.</span>to_csv<span class="token punctuation">(</span><span class="token string">"E:\jupyternotebook_space\Alldatas.csv"</span><span class="token punctuation">,</span>index<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">,</span> header<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">'a+'</span><span class="token punctuation">)</span> </code></pre> <h2>pandas在dataframe最左侧新增一个自增列</h2> <p>有如下表格,需要在最左侧新增一列为“序号”,编号从1开始</p> <p><a href="http://img.e-com-net.com/image/info8/ce467e8caf9b4bc5bd1bd1ea4d3fefea.png" target="_blank"><img src="http://img.e-com-net.com/image/info8/ce467e8caf9b4bc5bd1bd1ea4d3fefea.png" alt="python网络爬虫_第4张图片" width="382" height="370" style="border:1px solid black;"></a></p> <p>代码如下:</p> <pre><code class="prism language-python"><span class="token comment">#打开文件</span> <span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd df <span class="token operator">=</span> pd<span class="token punctuation">.</span>read_excel<span class="token punctuation">(</span><span class="token string">r'test.xlsx'</span><span class="token punctuation">)</span> <span class="token comment">#序号列为从1开始的自增列,默认加在dataframe最右侧</span> df<span class="token punctuation">[</span><span class="token string">'序号'</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span><span class="token builtin">len</span><span class="token punctuation">(</span>df<span class="token punctuation">)</span><span class="token operator">+</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token comment">#对原始列重新排序,使自增列位于最左侧</span> df <span class="token operator">=</span> df<span class="token punctuation">[</span><span class="token punctuation">[</span><span class="token string">'序号'</span><span class="token punctuation">,</span><span class="token string">'seats'</span><span class="token punctuation">,</span><span class="token string">'price'</span><span class="token punctuation">,</span><span class="token string">'price-sign'</span><span class="token punctuation">]</span><span class="token punctuation">]</span> <span class="token comment">#输出</span> df<span class="token punctuation">.</span>to_excel<span class="token punctuation">(</span><span class="token string">'test_new.xlsx'</span><span class="token punctuation">,</span>index<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> </code></pre> <p><a href="http://img.e-com-net.com/image/info8/381bd055cd514f90b68d5f2328a8e72a.png" target="_blank"><img src="http://img.e-com-net.com/image/info8/381bd055cd514f90b68d5f2328a8e72a.png" alt="python网络爬虫_第5张图片" width="480" height="372" style="border:1px solid black;"></a></p> <h1>爬取github项目的issues</h1> <h5>lxml中etree.HTML()和etree.tostring()用法</h5> <p>https://blog.csdn.net/qq_38410428/article/details/82792730</p> <ul> <li>etree.HTML():构造了一个XPath解析对象并对HTML文本进行自动修正。(修复html文件中代码,把缺的头或尾节点补齐;)</li> <li>etree.tostring():输出修正后的结果,类型是bytes</li> </ul> <pre><code class="prism language-python"><span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree <span class="token keyword">import</span> requests <span class="token comment"># 根据关键词获取项目列表</span> <span class="token keyword">def</span> <span class="token function">get_repos_list</span><span class="token punctuation">(</span>key_words<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 初始化列表</span> repos_list <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token comment"># 默认</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">100</span><span class="token punctuation">)</span><span class="token punctuation">:</span> url <span class="token operator">=</span> <span class="token string">'https://github.com/search?p='</span> <span class="token operator">+</span> <span class="token builtin">str</span><span class="token punctuation">(</span>i<span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token string">'&q='</span> <span class="token operator">+</span> key_words <span class="token operator">+</span> <span class="token string">'&type=repositories'</span> response <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token comment"># 获取页面源码</span> page_source <span class="token operator">=</span> response<span class="token punctuation">.</span>text <span class="token comment"># print(page_source)</span> <span class="token comment">#etree.HTML():构造了一个XPath解析对象并对HTML文本进行自动修正。yyy</span> tree <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>page_source<span class="token punctuation">)</span> <span class="token comment"># 获取项目超链接</span> arr <span class="token operator">=</span> tree<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//*[@class="f4 text-normal"]/a/@href'</span><span class="token punctuation">)</span> repos_list <span class="token operator">+=</span> arr <span class="token keyword">return</span> repos_list <span class="token comment"># 获取一个项目的issues列表</span> <span class="token keyword">def</span> <span class="token function">get_issues_list</span><span class="token punctuation">(</span>repo_name<span class="token punctuation">)</span><span class="token punctuation">:</span> issues_list <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> url <span class="token operator">=</span> <span class="token string">'https://github.com'</span> <span class="token operator">+</span> repo_name <span class="token operator">+</span> <span class="token string">'/issues'</span> <span class="token comment"># print(url)</span> response <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token comment"># 获取源码</span> page_source <span class="token operator">=</span> response<span class="token punctuation">.</span>text tree <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>page_source<span class="token punctuation">)</span> <span class="token comment"># 获取issues数量</span> number <span class="token operator">=</span> tree<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="js-repo-pjax-container"]/div[1]/nav/ul/li[2]/a/span[2]'</span><span class="token punctuation">)</span> <span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>number<span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">:</span> number <span class="token operator">=</span> <span class="token string">'0'</span> <span class="token keyword">else</span><span class="token punctuation">:</span> number <span class="token operator">=</span> number<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">.</span>text <span class="token comment"># 超过1K就爬取1000条(够用了)</span> <span class="token keyword">if</span> number<span class="token punctuation">.</span>isdigit<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> number <span class="token operator">=</span> <span class="token builtin">int</span><span class="token punctuation">(</span>number<span class="token punctuation">)</span> <span class="token keyword">else</span><span class="token punctuation">:</span> number <span class="token operator">=</span> <span class="token number">1000</span> <span class="token keyword">print</span><span class="token punctuation">(</span>number<span class="token punctuation">)</span> <span class="token comment"># 计算分页数量,每页25个issues</span> page <span class="token operator">=</span> <span class="token number">0</span> <span class="token keyword">if</span> number <span class="token operator">%</span> <span class="token number">25</span> <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">:</span> page <span class="token operator">=</span> <span class="token builtin">int</span><span class="token punctuation">(</span>number <span class="token operator">/</span> <span class="token number">25</span><span class="token punctuation">)</span> <span class="token keyword">else</span><span class="token punctuation">:</span> page <span class="token operator">=</span> <span class="token builtin">int</span><span class="token punctuation">(</span>number <span class="token operator">/</span> <span class="token number">25</span><span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token number">1</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span> page <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">:</span> url <span class="token operator">=</span> <span class="token string">'https://github.com'</span> <span class="token operator">+</span> repo_name <span class="token operator">+</span> <span class="token string">'/issues?page='</span> <span class="token operator">+</span> <span class="token builtin">str</span><span class="token punctuation">(</span>i<span class="token punctuation">)</span> response <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token comment"># 获取源码</span> page_source <span class="token operator">=</span> response<span class="token punctuation">.</span>text tree <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>page_source<span class="token punctuation">)</span> <span class="token comment"># 获取issues超链接</span> arr <span class="token operator">=</span> tree<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//*[@class="d-block d-md-none position-absolute top-0 bottom-0 left-0 right-0"]/@href'</span><span class="token punctuation">)</span> issues_list <span class="token operator">+=</span> arr <span class="token comment"># /combust/mleap/issues/716</span> <span class="token comment"># 返回issues数量和列表</span> <span class="token keyword">return</span> number<span class="token punctuation">,</span> issues_list <span class="token comment"># 获取一个issue的内容及评论</span> <span class="token keyword">def</span> <span class="token function">get_issue_content</span><span class="token punctuation">(</span>issue_name<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 拼接issue地址</span> url <span class="token operator">=</span> <span class="token string">'https://github.com'</span> <span class="token operator">+</span> issue_name <span class="token comment"># print(url)</span> response <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> page_source <span class="token operator">=</span> response<span class="token punctuation">.</span>text tree <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>page_source<span class="token punctuation">)</span> <span class="token comment"># 获取issue内容</span> issue_content <span class="token operator">=</span> tree<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//table//td'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'string(.)'</span><span class="token punctuation">)</span> <span class="token keyword">return</span> issue_content <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> <span class="token comment"># 测试</span> <span class="token comment"># get_repos_list('ML pipeline')</span> <span class="token comment"># get_issues('/combust/mleap')</span> <span class="token comment"># get_issue_content('/combust/mleap/issues/716')</span> <span class="token triple-quoted-string string">''' issue="/rust-lang/rust/issues/76833" content=get_issue_content(issue) print(content) '''</span> <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">r'result.md'</span><span class="token punctuation">,</span> <span class="token string">'w+'</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> key_words <span class="token operator">=</span> <span class="token builtin">input</span><span class="token punctuation">(</span><span class="token string">'please input a keyword:'</span><span class="token punctuation">)</span> <span class="token comment"># 获取项目列表</span> repos_list <span class="token operator">=</span> get_repos_list<span class="token punctuation">(</span>key_words<span class="token punctuation">)</span> <span class="token comment"># 格式:/combust/mleap</span> <span class="token keyword">for</span> repo <span class="token keyword">in</span> repos_list<span class="token punctuation">:</span> <span class="token comment"># 拼接项目url</span> repos_url <span class="token operator">=</span> <span class="token string">'https://github.com'</span> <span class="token operator">+</span> repo <span class="token keyword">print</span><span class="token punctuation">(</span>repos_url<span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'\n\n'</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span>repos_url<span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'\n'</span><span class="token punctuation">)</span> <span class="token comment"># 获取项目的issues列表</span> number<span class="token punctuation">,</span> issues_list <span class="token operator">=</span> get_issues_list<span class="token punctuation">(</span>repo<span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token builtin">str</span><span class="token punctuation">(</span>number<span class="token punctuation">)</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'\n'</span><span class="token punctuation">)</span> <span class="token comment"># 格式:/combust/mleap/issues/716</span> <span class="token keyword">for</span> issue <span class="token keyword">in</span> issues_list<span class="token punctuation">:</span> <span class="token comment"># 获取issue的内容</span> issue_url <span class="token operator">=</span> <span class="token string">'https://github.com'</span> <span class="token operator">+</span> issue content <span class="token operator">=</span> get_issue_content<span class="token punctuation">(</span>issue<span class="token punctuation">)</span> <span class="token comment"># content=filter_emoji(content)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>issue_url<span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span>issue_url<span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'\n'</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'>'</span> <span class="token operator">*</span> <span class="token number">100</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'\n'</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token builtin">str</span><span class="token punctuation">(</span>content<span class="token punctuation">)</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'\n'</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'<'</span> <span class="token operator">*</span> <span class="token number">100</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'\n'</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>flush<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># print(content)</span> <span class="token comment"># print(issue)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'The end!'</span><span class="token punctuation">)</span> </code></pre> <hr> <h1>爬commit信息</h1> <h3>获取commit每一页的网址url</h3> <pre><code class="prism language-python"><span class="token keyword">import</span> re <span class="token keyword">from</span> urllib<span class="token punctuation">.</span>request <span class="token keyword">import</span> urlopen <span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoup <span class="token keyword">from</span> urllib <span class="token keyword">import</span> request <span class="token keyword">import</span> time <span class="token keyword">import</span> os <span class="token keyword">from</span> urllib<span class="token punctuation">.</span>parse <span class="token keyword">import</span> urlparse <span class="token triple-quoted-string string">''' 获取了每一页的网址 接下来:爬取每一页内的历史commit信息,包括具体的commit_url 、时间等 '''</span> <span class="token comment"># 请求函数</span> <span class="token keyword">def</span> <span class="token function">get_html</span><span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">:</span> req <span class="token operator">=</span> request<span class="token punctuation">.</span>Request<span class="token punctuation">(</span>url<span class="token punctuation">)</span> response <span class="token operator">=</span> request<span class="token punctuation">.</span>urlopen<span class="token punctuation">(</span>req<span class="token punctuation">)</span> html <span class="token operator">=</span> response<span class="token punctuation">.</span>read<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>decode<span class="token punctuation">(</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span> <span class="token keyword">return</span> html <span class="token keyword">def</span> <span class="token function">get_sha</span><span class="token punctuation">(</span>user<span class="token punctuation">,</span> repo_name<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 用户的每个repo对应一个commit sha</span> url <span class="token operator">=</span> <span class="token string">"https://github.com/{user}/{repo_name}/commits/master"</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>user<span class="token operator">=</span>user<span class="token punctuation">,</span> repo_name<span class="token operator">=</span>repo_name<span class="token punctuation">)</span> html<span class="token operator">=</span>urlopen<span class="token punctuation">(</span>url<span class="token punctuation">)</span> bs<span class="token operator">=</span>BeautifulSoup<span class="token punctuation">(</span>html<span class="token punctuation">,</span><span class="token string">'html.parser'</span><span class="token punctuation">)</span> link<span class="token operator">=</span>bs<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'a'</span><span class="token punctuation">,</span>href<span class="token operator">=</span>re<span class="token punctuation">.</span><span class="token builtin">compile</span><span class="token punctuation">(</span><span class="token string">"https://github.com/.*commit/(.*?)"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> commit_url<span class="token operator">=</span>link<span class="token punctuation">.</span>attrs<span class="token punctuation">[</span><span class="token string">'href'</span><span class="token punctuation">]</span> <span class="token comment">#print(type(commit_url)) <class 'str'></span> <span class="token comment">#print(commit_url)</span> <span class="token comment">#req=urlparse(commit_url)</span> <span class="token comment">#print(req)</span> list_commit<span class="token operator">=</span>commit_url<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">'/'</span><span class="token punctuation">)</span> <span class="token comment">#print(list_commit[6]) (第6个元素才是对应的哈希值)</span> commit_sha<span class="token operator">=</span>list_commit<span class="token punctuation">[</span><span class="token number">6</span><span class="token punctuation">]</span> <span class="token comment">#print(commit_sha)</span> <span class="token keyword">return</span> commit_sha <span class="token keyword">def</span> <span class="token function">single_repo_commits</span><span class="token punctuation">(</span>user<span class="token punctuation">,</span> repo_name<span class="token punctuation">)</span><span class="token punctuation">:</span> num <span class="token operator">=</span> <span class="token number">0</span> page_flag <span class="token operator">=</span> <span class="token number">66</span> <span class="token comment"># 设置页面初始标志,用于判断是否到达末页</span> page_num <span class="token operator">=</span> <span class="token number">0</span> data_num <span class="token operator">=</span> <span class="token number">0</span> commit_sha <span class="token operator">=</span> get_sha<span class="token punctuation">(</span>user<span class="token punctuation">,</span> repo_name<span class="token punctuation">)</span> all_date <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token comment"># 储存时间数据</span> url_data<span class="token operator">=</span><span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token comment">#存储每页的网址</span> <span class="token keyword">while</span> <span class="token punctuation">(</span>page_flag <span class="token keyword">and</span> page_num<span class="token operator"><</span><span class="token number">5</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 测试前五页</span> url <span class="token operator">=</span> <span class="token string">"https://github.com/{user}/{repo_name}/commits/master?after={commit_sha}+{num}&branch=master"</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>user<span class="token operator">=</span>user<span class="token punctuation">,</span> repo_name<span class="token operator">=</span>repo_name<span class="token punctuation">,</span> commit_sha<span class="token operator">=</span>commit_sha<span class="token punctuation">,</span> num<span class="token operator">=</span>num<span class="token punctuation">)</span> <span class="token comment"># 构建链接</span> html <span class="token operator">=</span> get_html<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token comment"># 获取页面内容</span> url_data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token comment">#每一页的url,然后接下来在这页开始搜索commit_url和提交时间</span> time_data <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span><span class="token string">r'<relative-time datetime=(.*)</relative-time>'</span><span class="token punctuation">,</span>html<span class="token punctuation">)</span> <span class="token comment"># re匹配时间元素</span> <span class="token comment">#page_flag = len(time_data)</span> page_num <span class="token operator">=</span> page_num <span class="token operator">+</span> <span class="token number">1</span> num <span class="token operator">=</span> num <span class="token operator">+</span><span class="token number">35</span> <span class="token comment"># 进入下一页</span> data_num <span class="token operator">=</span> data_num<span class="token operator">+</span><span class="token builtin">len</span><span class="token punctuation">(</span>time_data<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"page %d is ok\n get %d date"</span> <span class="token operator">%</span> <span class="token punctuation">(</span>page_num<span class="token punctuation">,</span> <span class="token builtin">len</span><span class="token punctuation">(</span>time_data<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment">#print(time_data[0]) 可查看第一个time_data元素的完整输出</span> <span class="token keyword">for</span> date <span class="token keyword">in</span> time_data<span class="token punctuation">:</span> all_date<span class="token punctuation">.</span>append<span class="token punctuation">(</span>date<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">:</span><span class="token number">20</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token comment">#1:20是日期的内容,之后是其他属性 </span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token comment"># 适当延时一下 单位:s</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"the repo <%s> totally get %d commits'date"</span> <span class="token operator">%</span> <span class="token punctuation">(</span>repo_name<span class="token punctuation">,</span> data_num<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>url_data<span class="token punctuation">)</span> <span class="token keyword">return</span> all_date user<span class="token operator">=</span><span class="token string">'chakra-core'</span> repo_name<span class="token operator">=</span><span class="token string">'ChakraCore'</span> <span class="token comment">#get_sha(user,repo_name)</span> all_data<span class="token operator">=</span>single_repo_commits<span class="token punctuation">(</span>user<span class="token punctuation">,</span>repo_name<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>all_data<span class="token punctuation">)</span> </code></pre> <h3>get_data函数获取指定页面的全部commit_url</h3> <pre><code class="prism language-python"><span class="token keyword">from</span> urllib<span class="token punctuation">.</span>request <span class="token keyword">import</span> urlopen <span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoup <span class="token keyword">import</span> re <span class="token triple-quoted-string string">''' get_data函数获取指定页面的全部commit_url 接下来要做的是:如何搜索提交的内容:title、issue?等,是否存储为excel? '''</span> <span class="token keyword">def</span> <span class="token function">get_data</span><span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">:</span> headers <span class="token operator">=</span> <span class="token punctuation">{</span> <span class="token string">'User-Agent'</span><span class="token punctuation">:</span> <span class="token string">'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1'</span><span class="token punctuation">}</span> html <span class="token operator">=</span> urlopen<span class="token punctuation">(</span>url<span class="token punctuation">)</span> baseurl <span class="token operator">=</span> <span class="token string">'https://github.com'</span> bs <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>html<span class="token punctuation">,</span> <span class="token string">'html.parser'</span><span class="token punctuation">)</span> pages <span class="token operator">=</span> <span class="token builtin">set</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 不存在重复</span> <span class="token comment"># print(bs.contents)</span> commit_url <span class="token operator">=</span> bs<span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">'a'</span><span class="token punctuation">,</span> href<span class="token operator">=</span>re<span class="token punctuation">.</span><span class="token builtin">compile</span><span class="token punctuation">(</span><span class="token string">'^(/chakra-core/ChakraCore/commit/).*$'</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># print(commit_url)</span> fp<span class="token operator">=</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">'commit_url.txt'</span><span class="token punctuation">,</span> <span class="token string">'w+'</span><span class="token punctuation">)</span> <span class="token keyword">for</span> link <span class="token keyword">in</span> commit_url<span class="token punctuation">:</span> <span class="token keyword">if</span> <span class="token string">'href'</span> <span class="token keyword">in</span> link<span class="token punctuation">.</span>attrs<span class="token punctuation">:</span> <span class="token keyword">if</span> link<span class="token punctuation">.</span>attrs<span class="token punctuation">[</span><span class="token string">'href'</span><span class="token punctuation">]</span> <span class="token keyword">not</span> <span class="token keyword">in</span> pages<span class="token punctuation">:</span> <span class="token comment"># 我们遇到了新页面</span> newPage <span class="token operator">=</span> link<span class="token punctuation">.</span>attrs<span class="token punctuation">[</span><span class="token string">'href'</span><span class="token punctuation">]</span> pages<span class="token punctuation">.</span>add<span class="token punctuation">(</span>newPage<span class="token punctuation">)</span> fp<span class="token punctuation">.</span>write<span class="token punctuation">(</span>newPage<span class="token punctuation">)</span> <span class="token comment"># 将字符串写入文件中</span> fp<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">"\n"</span><span class="token punctuation">)</span> <span class="token comment"># 换行</span> <span class="token keyword">print</span><span class="token punctuation">(</span>newPage<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span><span class="token punctuation">)</span> fp<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> get_data<span class="token punctuation">(</span><span class="token string">'https://github.com/chakra-core/ChakraCore/commits/master'</span><span class="token punctuation">)</span> </code></pre> <h1>读取文件</h1> <ol> <li>从文件members.txt中以字典形式读取数据,名字作为键,年龄作为值。文件中的内容如下,以制表符(’\t’)分隔数据</li> </ol> <pre><code class="prism language-python">content <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">'members.txt'</span><span class="token punctuation">,</span> <span class="token string">'r'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> <span class="token keyword">for</span> line <span class="token keyword">in</span> f<span class="token punctuation">.</span>readlines<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> line_list <span class="token operator">=</span> line<span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token string">'\n'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">'\t'</span><span class="token punctuation">)</span> <span class="token comment"># 去除换行符,以制表符分隔</span> content<span class="token punctuation">.</span>append<span class="token punctuation">(</span>line_list<span class="token punctuation">)</span> keys <span class="token operator">=</span> content<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token builtin">len</span><span class="token punctuation">(</span>content<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">:</span> content_dict <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token punctuation">}</span> <span class="token keyword">for</span> k<span class="token punctuation">,</span> v <span class="token keyword">in</span> <span class="token builtin">zip</span><span class="token punctuation">(</span>keys<span class="token punctuation">,</span> content<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span> content_dict<span class="token punctuation">[</span>k<span class="token punctuation">]</span> <span class="token operator">=</span> v <span class="token keyword">print</span><span class="token punctuation">(</span>content_dict<span class="token punctuation">)</span> <span class="token triple-quoted-string string">''' result: {'Name': 'Andy', 'age': '32'} {'Name': 'Bob', 'age': '20'} {'Name': 'Jenny', 'age': '43'} {'Name': 'Holly', 'age': '48'} {'Name': 'Danie', 'age': '27'} '''</span> </code></pre> <h1>函数的意思</h1> <h3>etree.HTML(), etree.tostring()</h3> <pre><code class="prism language-python"><span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree <span class="token keyword">import</span> requests url <span class="token operator">=</span> <span class="token string">'https://github.com/chakra-core/ChakraCore/issues'</span> response <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token comment"># 获取页面源码</span> page_source <span class="token operator">=</span> response<span class="token punctuation">.</span>text <span class="token comment"># print(page_source)</span> tree <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>page_source<span class="token punctuation">)</span> result<span class="token operator">=</span>etree<span class="token punctuation">.</span>tostring<span class="token punctuation">(</span>tree<span class="token punctuation">)</span> <span class="token comment">#etree.HTML():构造了一个XPath解析对象并对HTML文本进行自动修正。</span> <span class="token comment">#etree.tostring():输出修正后的结果,类型是bytes</span> </code></pre> <h1>路飞学城爬虫教程</h1> <h3>第一章 爬虫基础介绍</h3> <pre><code class="prism language-text">爬虫究竟是合法还是违法的? - 在法律中是不被禁止 - 具有违法风险 - 善意爬虫 恶意爬虫 爬虫带来的风险可以体现在如下2方面: - 爬虫干扰了被访问网站的正常运营 - 爬虫抓取了收到法律保护的特定类型的数据或信息 如何在使用编写爬虫的过程中避免进入局子的厄运呢? - 时常的优化自己的程序,避免干扰被访问网站的正常运行 - 在使用,传播爬取到的数据时,审查抓取到的内容,如果发现了涉及到用户隐私 商业机密等敏感内容需要及时停止爬取或传播 爬虫在使用场景中的分类 - 通用爬虫: 抓取系统重要组成部分。抓取的是一整张页面数据。 - 聚焦爬虫: 是建立在通用爬虫的基础之上。抓取的是页面中特定的局部内容。 - 增量式爬虫: 检测网站中数据更新的情况。只会抓取网站中最新更新出来的数据。 爬虫的矛与盾 反爬机制 门户网站,可以通过制定相应的策略或者技术手段,防止爬虫程序进行网站数据的爬取。 反反爬策略 爬虫程序可以通过制定相关的策略或者技术手段,破解门户网站中具备的反爬机制,从而可以获取门户网站中相关的数据。 robots.txt协议: 君子协议。规定了网站中哪些数据可以被爬虫爬取哪些数据不可以被爬取。 http协议 - 概念:就是服务器和客户端进行数据交互的一种形式。 常用请求头信息 - User-Agent:请求载体的身份标识 - Connection:请求完毕后,是断开连接还是保持连接 常用响应头信息 - Content-Type:服务器响应回客户端的数据类型 https协议: - 安全的超文本传输协议 加密方式 - 对称秘钥加密 - 非对称秘钥加密 - 证书秘钥加密 </code></pre> <h3>第二章 requests基础模块</h3> <pre><code>requests模块 - urllib模块 - requests模块 requests模块:python中原生的一款基于网络请求的模块,功能非常强大,简单便捷,效率极高。 作用:模拟浏览器发请求。 如何使用:(requests模块的编码流程) - 指定url - UA伪装 - 请求参数的处理 - 发起请求 - 获取响应数据 - 持久化存储 环境安装: pip install requests 实战编码: - 需求:爬取搜狗首页的页面数据 实战巩固 - 需求:爬取搜狗指定词条对应的搜索结果页面(简易网页采集器) - UA检测 - UA伪装 - 需求:破解百度翻译 - post请求(携带了参数) - 响应数据是一组json数据 - 需求:爬取豆瓣电影分类排行榜 https://movie.douban.com/中的电影详情数据 - 作业:爬取肯德基餐厅查询http://www.kfc.com.cn/kfccda/index.aspx中指定地点的餐厅数据 - 需求:爬取国家药品监督管理总局中基于中华人民共和国化妆品生产许可证相关数据 http://125.35.6.84:81/xk/ - 动态加载数据 - 首页中对应的企业信息数据是通过ajax动态请求到的。 http://125.35.6.84:81/xk/itownet/portal/dzpz.jsp?id=e6c1aa332b274282b04659a6ea30430a http://125.35.6.84:81/xk/itownet/portal/dzpz.jsp?id=f63f61fe04684c46a016a45eac8754fe - 通过对详情页url的观察发现: - url的域名都是一样的,只有携带的参数(id)不一样 - id值可以从首页对应的ajax请求到的json串中获取 - 域名和id值拼接处一个完整的企业对应的详情页的url - 详情页的企业详情数据也是动态加载出来的 - http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById - http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById - 观察后发现: - 所有的post请求的url都是一样的,只有参数id值是不同。 - 如果我们可以批量获取多家企业的id后,就可以将id和url形成一个完整的详情页对应详情数据的ajax请求的url 数据解析: 聚焦爬虫 正则 bs4 xpath </code></pre> <h3>第三章 数据解析</h3> <pre><code>聚焦爬虫:爬取页面中指定的页面内容。 - 编码流程: - 指定url - 发起请求 - 获取响应数据 - 数据解析 - 持久化存储 数据解析分类: - 正则 - bs4 - xpath(***) 数据解析原理概述: - 解析的局部的文本内容都会在标签之间或者标签对应的属性中进行存储 - 1.进行指定标签的定位 - 2.标签或者标签对应的属性中存储的数据值进行提取(解析) 正则解析: <div class="thumb"> <a href="/article/121721100" target="_blank"> <img src="//pic.qiushibaike.com/system/pictures/12172/121721100/medium/DNXDX9TZ8SDU6OK2.jpg" alt="指引我有前进的方向"> </a> </div> ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>' bs4进行数据解析 - 数据解析的原理: - 1.标签定位 - 2.提取标签、标签属性中存储的数据值 - bs4数据解析的原理: - 1.实例化一个BeautifulSoup对象,并且将页面源码数据加载到该对象中 - 2.通过调用BeautifulSoup对象中相关的属性或者方法进行标签定位和数据提取 - 环境安装: - pip install bs4 - pip install lxml - 如何实例化BeautifulSoup对象: - from bs4 import BeautifulSoup - 对象的实例化: - 1.将本地的html文档中的数据加载到该对象中 fp = open('./test.html','r',encoding='utf-8') soup = BeautifulSoup(fp,'lxml') - 2.将互联网上获取的页面源码加载到该对象中 page_text = response.text soup = BeatifulSoup(page_text,'lxml') - 提供的用于数据解析的方法和属性: - soup.tagName:返回的是文档中第一次出现的tagName对应的标签 - soup.find(): - find('tagName'):等同于soup.div - 属性定位: -soup.find('div',class_/id/attr='song') - soup.find_all('tagName'):返回符合要求的所有标签(列表) - select: - select('某种选择器(id,class,标签...选择器)'),返回的是一个列表。 - 层级选择器: - soup.select('.tang > ul > li > a'):>表示的是一个层级 - oup.select('.tang > ul a'):空格表示的多个层级 - 获取标签之间的文本数据: - soup.a.text/string/get_text() - text/get_text():可以获取某一个标签中所有的文本内容 - string:只可以获取该标签下面直系的文本内容 - 获取标签中属性值: - soup.a['href'] xpath解析:最常用且最便捷高效的一种解析方式。通用性。 - xpath解析原理: - 1.实例化一个etree的对象,且需要将被解析的页面源码数据加载到该对象中。 - 2.调用etree对象中的xpath方法结合着xpath表达式实现标签的定位和内容的捕获。 - 环境的安装: - pip install lxml - 如何实例化一个etree对象:from lxml import etree - 1.将本地的html文档中的源码数据加载到etree对象中: etree.parse(filePath) - 2.可以将从互联网上获取的源码数据加载到该对象中 etree.HTML('page_text') - xpath('xpath表达式') - xpath表达式: - /:表示的是从根节点开始定位。表示的是一个层级。 - //:表示的是多个层级。可以表示从任意位置开始定位。 - 属性定位://div[@class='song'] tag[@attrName="attrValue"] - 索引定位://div[@class="song"]/p[3] 索引是从1开始的。 - 取文本: - /text() 获取的是标签中直系的文本内容 - //text() 标签中非直系的文本内容(所有的文本内容) - 取属性: /@attrName ==>img/src 作业: 爬取站长素材中免费简历模板 </code></pre> <h3>第四章 验证码</h3> <pre><code>验证码识别 验证码和爬虫之间的爱恨情仇? 反爬机制:验证码.识别验证码图片中的数据,用于模拟登陆操作。 识别验证码的操作: - 人工肉眼识别。(不推荐) - 第三方自动识别(推荐) - 云打码:http://www.yundama.com/demo.html 云打码的使用流程: - 注册:普通和开发者用户 - 登录: - 普通用户的登录:查询该用户是否还有剩余的题分 - 开发者用户的登录: - 创建一个软件: 我的软件-》添加新软件-》录入软件名称-》提交(软件id和秘钥) - 下载示例代码:开发文档-》点此下载:云打码接口DLL-》PythonHTTP示例下载 实战:识别古诗文网登录页面中的验证码。 使用打码平台识别验证码的编码流程: - 将验证码图片进行本地下载 - 调用平台提供的示例代码进行图片数据识别 </code></pre> <h3>第五章 requests模块高级</h3> <pre><code>模拟登录: - 爬取基于某些用户的用户信息。 需求:对人人网进行模拟登录。 - 点击登录按钮之后会发起一个post请求 - post请求中会携带登录之前录入的相关的登录信息(用户名,密码,验证码......) - 验证码:每次请求都会变化 需求:爬取当前用户的相关的用户信息(个人主页中显示的用户信息) http/https协议特性:无状态。 没有请求到对应页面数据的原因: 发起的第二次基于个人主页页面请求的时候,服务器端并不知道该此请求是基于登录状态下的请求。 cookie:用来让服务器端记录客户端的相关状态。 - 手动处理:通过抓包工具获取cookie值,将该值封装到headers中。(不建议) - 自动处理: - cookie值的来源是哪里? - 模拟登录post请求后,由服务器端创建。 session会话对象: - 作用: 1.可以进行请求的发送。 2.如果请求过程中产生了cookie,则该cookie会被自动存储/携带在该session对象中。 - 创建一个session对象:session = requests.Session() - 使用session对象进行模拟登录post请求的发送(cookie就会被存储在session中) - session对象对个人主页对应的get请求进行发送(携带了cookie) 代理:破解封IP这种反爬机制。 什么是代理: - 代理服务器。 代理的作用: - 突破自身IP访问的限制。 - 隐藏自身真实IP 代理相关的网站: - 快代理 - 西祠代理 - www.goubanjia.com 代理ip的类型: - http:应用到http协议对应的url中 - https:应用到https协议对应的url中 代理ip的匿名度: - 透明:服务器知道该次请求使用了代理,也知道请求对应的真实ip - 匿名:知道使用了代理,不知道真实ip - 高匿:不知道使用了代理,更不知道真实的ip </code></pre> <h3>第六章 高性能异步爬虫</h3> <pre><code>高性能异步爬虫 目的:在爬虫中使用异步实现高性能的数据爬取操作。 异步爬虫的方式: - 1.多线程,多进程(不建议): 好处:可以为相关阻塞的操作单独开启线程或者进程,阻塞操作就可以异步执行。 弊端:无法无限制的开启多线程或者多进程。 - 2.线程池、进程池(适当的使用): 好处:我们可以降低系统对进程或者线程创建和销毁的一个频率,从而很好的降低系统的开销。 弊端:池中线程或进程的数量是有上限。 - 3.单线程+异步协程(推荐): event_loop:事件循环,相当于一个无限循环,我们可以把一些函数注册到这个事件循环上, 当满足某些条件的时候,函数就会被循环执行。 coroutine:协程对象,我们可以将协程对象注册到事件循环中,它会被事件循环调用。 我们可以使用 async 关键字来定义一个方法,这个方法在调用时不会立即被执行,而是返回 一个协程对象。 task:任务,它是对协程对象的进一步封装,包含了任务的各个状态。 future:代表将来执行或还没有执行的任务,实际上和 task 没有本质区别。 async 定义一个协程. await 用来挂起阻塞方法的执行。 </code></pre> <h3>第七章 动态加载数据处理</h3> <pre><code>selenium模块的基本使用 问题:selenium模块和爬虫之间具有怎样的关联? - 便捷的获取网站中动态加载的数据 - 便捷实现模拟登录 什么是selenium模块? - 基于浏览器自动化的一个模块。 selenium使用流程: - 环境安装:pip install selenium - 下载一个浏览器的驱动程序(谷歌浏览器) - 下载路径:http://chromedriver.storage.googleapis.com/index.html - 驱动程序和浏览器的映射关系:http://blog.csdn.net/huilan_same/article/details/51896672 - 实例化一个浏览器对象 - 编写基于浏览器自动化的操作代码 - 发起请求:get(url) - 标签定位:find系列的方法 - 标签交互:send_keys('xxx') - 执行js程序:excute_script('jsCode') - 前进,后退:back(),forward() - 关闭浏览器:quit() - selenium处理iframe - 如果定位的标签存在于iframe标签之中,则必须使用switch_to.frame(id) - 动作链(拖动):from selenium.webdriver import ActionChains - 实例化一个动作链对象:action = ActionChains(bro) - click_and_hold(div):长按且点击操作 - move_by_offset(x,y) - perform()让动作链立即执行 - action.release()释放动作链对象 12306模拟登录 - 超级鹰:http://www.chaojiying.com/about.html - 注册:普通用户 - 登录:普通用户 - 题分查询:充值 - 创建一个软件(id) - 下载示例代码 - 12306模拟登录编码流程: - 使用selenium打开登录页面 - 对当前selenium打开的这张页面进行截图 - 对当前图片局部区域(验证码图片)进行裁剪 - 好处:将验证码图片和模拟登录进行一一对应。 - 使用超级鹰识别验证码图片(坐标) - 使用动作链根据坐标实现点击操作 - 录入用户名密码,点击登录按钮实现登录 </code></pre> <h3>第八章 scrapy框架</h3> <pre><code>scrapy框架 - 什么是框架? - 就是一个集成了很多功能并且具有很强通用性的一个项目模板。 - 如何学习框架? - 专门学习框架封装的各种功能的详细用法。 - 什么是scrapy? - 爬虫中封装好的一个明星框架。功能:高性能的持久化存储,异步的数据下载,高性能的数据解析,分布式 - scrapy框架的基本使用 - 环境的安装: - mac or linux:pip install scrapy - windows: - pip install wheel - 下载twisted,下载地址为http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted - 安装twisted:pip install Twisted‑17.1.0‑cp36‑cp36m‑win_amd64.whl - pip install pywin32 - pip install scrapy 测试:在终端里录入scrapy指令,没有报错即表示安装成功! - 创建一个工程:scrapy startproject xxxPro - cd xxxPro - 在spiders子目录中创建一个爬虫文件 - scrapy genspider spiderName www.xxx.com - 执行工程: - scrapy crawl spiderName - scrapy数据解析 - scrapy持久化存储 - 基于终端指令: - 要求:只可以将parse方法的返回值存储到本地的文本文件中 - 注意:持久化存储对应的文本文件的类型只可以为:'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle - 指令:scrapy crawl xxx -o filePath - 好处:简介高效便捷 - 缺点:局限性比较强(数据只可以存储到指定后缀的文本文件中) https://www.bilibili.com/video/BV1ha4y1H7sx?p=64&spm_id_from=pageDriver - 基于管道: - 编码流程: - 数据解析 - 在item类中定义相关的属性 - 将解析的数据封装存储到item类型的对象 - 将item类型的对象提交给管道进行持久化存储的操作 - 在管道类的process_item中要将其接受到的item对象中存储的数据进行持久化存储操作 - 在配置文件中开启管道 - 好处: - 通用性强。 - 面试题:将爬取到的数据一份存储到本地一份存储到数据库,如何实现? - 管道文件中一个管道类对应的是将数据存储到一种平台 - 爬虫文件提交的item只会给管道文件中第一个被执行的管道类接受 - process_item中的return item表示将item传递给下一个即将被执行的管道类 - 基于Spider的全站数据爬取 - 就是将网站中某板块下的全部页码对应的页面数据进行爬取 - 需求:爬取校花网中的照片的名称 - 实现方式: - 将所有页面的url添加到start_urls列表(不推荐) - 自行手动进行请求发送(推荐) - 手动请求发送: - yield scrapy.Request(url,callback):callback专门用做于数据解析 - 五大核心组件 引擎(Scrapy) 用来处理整个系统的数据流处理, 触发事务(框架核心) 调度器(Scheduler) 用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址 下载器(Downloader) 用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的) 爬虫(Spiders) 爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面 项目管道(Pipeline) 负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。 - 请求传参 - 使用场景:如果爬取解析的数据不在同一张页面中。(深度爬取) - 需求:爬取boss的岗位名称,岗位描述 - 图片数据爬取之ImagesPipeline - 基于scrapy爬取字符串类型的数据和爬取图片类型的数据区别? - 字符串:只需要基于xpath进行解析且提交管道进行持久化存储 - 图片:xpath解析出图片src的属性值。单独的对图片地址发起请求获取图片二进制类型的数据 - ImagesPipeline: - 只需要将img的src的属性值进行解析,提交到管道,管道就会对图片的src进行请求发送获取图片的二进制类型的数据,且还会帮我们进行持久化存储。 - 需求:爬取站长素材中的高清图片 - 使用流程: - 数据解析(图片的地址) - 将存储图片地址的item提交到制定的管道类 - 在管道文件中自定制一个基于ImagesPipeLine的一个管道类 - get_media_request - file_path - item_completed - 在配置文件中: - 指定图片存储的目录:IMAGES_STORE = './imgs_bobo' - 指定开启的管道:自定制的管道类 - 中间件 - 下载中间件 - 位置:引擎和下载器之间 - 作用:批量拦截到整个工程中所有的请求和响应 - 拦截请求: - UA伪装:process_request - 代理IP:process_exception:return request - 拦截响应: - 篡改响应数据,响应对象 - 需求:爬取网易新闻中的新闻数据(标题和内容) - 1.通过网易新闻的首页解析出五大板块对应的详情页的url(没有动态加载) - 2.每一个板块对应的新闻标题都是动态加载出来的(动态加载) - 3.通过解析出每一条新闻详情页的url获取详情页的页面源码,解析出新闻内容 - CrawlSpider:类,Spider的一个子类 - 全站数据爬取的方式 - 基于Spider:手动请求 - 基于CrawlSpider - CrawlSpider的使用: - 创建一个工程 - cd XXX - 创建爬虫文件(CrawlSpider): - scrapy genspider -t crawl xxx www.xxxx.com - 链接提取器: - 作用:根据指定的规则(allow)进行指定链接的提取 - 规则解析器: - 作用:将链接提取器提取到的链接进行指定规则(callback)的解析 #需求:爬取sun网站中的编号,新闻标题,新闻内容,标号 - 分析:爬取的数据没有在同一张页面中。 - 1.可以使用链接提取器提取所有的页码链接 - 2.让链接提取器提取所有的新闻详情页的链接 - 分布式爬虫 - 概念:我们需要搭建一个分布式的机群,让其对一组资源进行分布联合爬取。 - 作用:提升爬取数据的效率 - 如何实现分布式? - 安装一个scrapy-redis的组件 - 原生的scarapy是不可以实现分布式爬虫,必须要让scrapy结合着scrapy-redis组件一起实现分布式爬虫。 - 为什么原生的scrapy不可以实现分布式? - 调度器不可以被分布式机群共享 - 管道不可以被分布式机群共享 - scrapy-redis组件作用: - 可以给原生的scrapy框架提供可以被共享的管道和调度器 - 实现流程 - 创建一个工程 - 创建一个基于CrawlSpider的爬虫文件 - 修改当前的爬虫文件: - 导包:from scrapy_redis.spiders import RedisCrawlSpider - 将start_urls和allowed_domains进行注释 - 添加一个新属性:redis_key = 'sun' 可以被共享的调度器队列的名称 - 编写数据解析相关的操作 - 将当前爬虫类的父类修改成RedisCrawlSpider - 修改配置文件settings - 指定使用可以被共享的管道: ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 400 } - 指定调度器: # 增加了一个去重容器类的配置, 作用使用Redis的set集合来存储请求的指纹数据, 从而实现请求去重的持久化 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用scrapy-redis组件自己的调度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 配置调度器是否要持久化, 也就是当爬虫结束了, 要不要清空Redis中请求队列和去重指纹的set。如果是True, 就表示要持久化存储, 就不清空数据, 否则清空数据 SCHEDULER_PERSIST = True - 指定redis服务器: - redis相关操作配置: - 配置redis的配置文件: - linux或者mac:redis.conf - windows:redis.windows.conf - 代开配置文件修改: - 将bind 127.0.0.1进行删除 - 关闭保护模式:protected-mode yes改为no - 结合着配置文件开启redis服务 - redis-server 配置文件 - 启动客户端: - redis-cli - 执行工程: - scrapy runspider xxx.py - 向调度器的队列中放入一个起始的url: - 调度器的队列在redis的客户端中 - lpush xxx www.xxx.com - 爬取到的数据存储在了redis的proName:items这个数据结构中 </code></pre> <h3>第九章 增量式爬虫</h3> <pre><code>增量式爬虫 - 概念:监测网站数据更新的情况,只会爬取网站最新更新出来的数据。 - 分析: - 指定一个起始url - 基于CrawlSpider获取其他页码链接 - 基于Rule将其他页码链接进行请求 - 从每一个页码对应的页面源码中解析出每一个电影详情页的URL - 核心:检测电影详情页的url之前有没有请求过 - 将爬取过的电影详情页的url存储 - 存储到redis的set数据结构 - 对详情页的url发起请求,然后解析出电影的名称和简介 - 进行持久化存储 </code></pre> <h1>动态加载页面分析、POST请求参数和内容爬取</h1> <p>https://blog.csdn.net/Strive_0902/article/details/88972722</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests <span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree <span class="token keyword">import</span> time <span class="token keyword">import</span> os <span class="token keyword">import</span> sys <span class="token keyword">import</span> json ua <span class="token operator">=</span> <span class="token string">"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10240"</span> cookie1 <span class="token operator">=</span> <span class="token string">"trs_uv=jtz38ebv_373_14pv; BIGipServerjigou=1079027904.20480.0000; JSESSIONID=gyDbm3t9JVAlnN7VBkEH7Gk9CrEcAsd65-YfiCCqMLv-IkyP53TY!499435313"</span> host1 <span class="token operator">=</span> <span class="token string">"jg.sac.net.cn"</span> orgin1 <span class="token operator">=</span> <span class="token string">"http://jg.sac.net.cn"</span> data1 <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token string">"filter_EQS_O#otc_id"</span><span class="token punctuation">:</span><span class="token string">"01"</span><span class="token punctuation">,</span><span class="token string">"filter_EQS_O#sac_id"</span><span class="token punctuation">:</span><span class="token string">""</span><span class="token punctuation">,</span><span class="token string">"filter_LIKES_aoi_name"</span><span class="token punctuation">:</span><span class="token string">""</span><span class="token punctuation">,</span><span class="token string">"sqlkey"</span><span class="token punctuation">:</span> <span class="token string">"publicity"</span><span class="token punctuation">,</span><span class="token string">"sqlval"</span><span class="token punctuation">:</span> <span class="token string">"ORG_BY_TYPE_INFO"</span><span class="token punctuation">}</span> headers1 <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token string">'User-agent'</span><span class="token punctuation">:</span> ua<span class="token punctuation">,</span><span class="token string">'Cookie'</span><span class="token punctuation">:</span>cookie1<span class="token punctuation">,</span><span class="token string">'Host'</span><span class="token punctuation">:</span>host1<span class="token punctuation">,</span><span class="token string">'Orgin'</span><span class="token punctuation">:</span>orgin1<span class="token punctuation">}</span> Base_url <span class="token operator">=</span> <span class="token string">"http://jg.sac.net.cn/pages/publicity/resource!search.action"</span> page_url <span class="token operator">=</span> <span class="token string">"http://jg.sac.net.cn/pages/publicity/resource!search.action"</span> req <span class="token operator">=</span> requests<span class="token punctuation">.</span>post<span class="token punctuation">(</span>Base_url<span class="token punctuation">,</span>data <span class="token operator">=</span> data1<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers1<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>req<span class="token punctuation">.</span>text<span class="token punctuation">)</span> res <span class="token operator">=</span> req<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment">#print(res[0]['AOI_ID'])</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token builtin">len</span><span class="token punctuation">(</span>res<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">:</span> page_data1 <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token string">"filter_EQS_aoi_id"</span><span class="token punctuation">:</span> res<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'AOI_ID'</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token string">"sqlkey"</span><span class="token punctuation">:</span> <span class="token string">"publicity"</span><span class="token punctuation">,</span> <span class="token string">"sqlval"</span><span class="token punctuation">:</span> <span class="token string">"SELECT_ZQ_REG_INFO"</span><span class="token punctuation">}</span> page_data2 <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token string">"filter_EQS_aoi_id"</span><span class="token punctuation">:</span> res<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'AOI_ID'</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token string">"sqlkey"</span><span class="token punctuation">:</span> <span class="token string">"publicity"</span><span class="token punctuation">,</span> <span class="token string">"sqlval"</span><span class="token punctuation">:</span> <span class="token string">"SEARCH_ZQGS_QUALIFATION"</span><span class="token punctuation">}</span> company_info <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token punctuation">}</span> page_req1 <span class="token operator">=</span> requests<span class="token punctuation">.</span>post<span class="token punctuation">(</span>page_url<span class="token punctuation">,</span> data<span class="token operator">=</span>page_data1<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers1<span class="token punctuation">)</span><span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span> page_req2 <span class="token operator">=</span> requests<span class="token punctuation">.</span>post<span class="token punctuation">(</span>page_url<span class="token punctuation">,</span> data<span class="token operator">=</span>page_data2<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers1<span class="token punctuation">)</span><span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span> company_info<span class="token punctuation">[</span><span class="token string">"Chinese_Name"</span><span class="token punctuation">]</span> <span class="token operator">=</span> page_req1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'MRI_CHINESE_NAME'</span><span class="token punctuation">]</span> company_info<span class="token punctuation">[</span><span class="token string">"Info_Reg"</span><span class="token punctuation">]</span> <span class="token operator">=</span> page_req1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'MRI_INFO_REG'</span><span class="token punctuation">]</span> company_info<span class="token punctuation">[</span><span class="token string">"Legal_Represent"</span><span class="token punctuation">]</span> <span class="token operator">=</span> page_req1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'MRI_LEGAL_REPRESENTATIVE'</span><span class="token punctuation">]</span> company_info<span class="token punctuation">[</span><span class="token string">"License_Code"</span><span class="token punctuation">]</span> <span class="token operator">=</span> page_req1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'MRI_LICENSE_CODE'</span><span class="token punctuation">]</span> company_info<span class="token punctuation">[</span><span class="token string">"Reg_Capital"</span><span class="token punctuation">]</span> <span class="token operator">=</span> page_req1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'MRI_REG_CAPITAL'</span><span class="token punctuation">]</span> company_info<span class="token punctuation">[</span><span class="token string">"Office_Address"</span><span class="token punctuation">]</span> <span class="token operator">=</span> page_req1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'MRI_OFFICE_ADDRESS'</span><span class="token punctuation">]</span> company_info<span class="token punctuation">[</span><span class="token string">"Office_Post_Code"</span><span class="token punctuation">]</span> <span class="token operator">=</span> page_req1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'MRI_OFFICE_ZIP_CODE'</span><span class="token punctuation">]</span> company_info<span class="token punctuation">[</span><span class="token string">"Com_Website"</span><span class="token punctuation">]</span> <span class="token operator">=</span> page_req1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'MRI_COM_WEBSITE'</span><span class="token punctuation">]</span> company_info<span class="token punctuation">[</span><span class="token string">"Customer_Service_Tel"</span><span class="token punctuation">]</span> <span class="token operator">=</span> page_req1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'MRI_CUSTOMER_SERVICE_TEL'</span><span class="token punctuation">]</span> <span class="token comment"># print(page_req2)</span> <span class="token comment"># exit()</span> con <span class="token operator">=</span> <span class="token string">""</span> <span class="token keyword">for</span> j <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token builtin">len</span><span class="token punctuation">(</span>page_req2<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">:</span> con <span class="token operator">+=</span> page_req2<span class="token punctuation">[</span>j<span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'PTSC_NAME'</span><span class="token punctuation">]</span><span class="token operator">+</span><span class="token string">","</span> company_info<span class="token punctuation">[</span><span class="token string">"Qualification_info"</span><span class="token punctuation">]</span> <span class="token operator">=</span> con <span class="token keyword">try</span><span class="token punctuation">:</span> <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"2.json"</span><span class="token punctuation">,</span> <span class="token string">'a+'</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">"utf-8"</span><span class="token punctuation">)</span> <span class="token keyword">as</span> fp<span class="token punctuation">:</span> fp<span class="token punctuation">.</span>write<span class="token punctuation">(</span>json<span class="token punctuation">.</span>dumps<span class="token punctuation">(</span>company_info<span class="token punctuation">,</span> ensure_ascii<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token string">"\n"</span><span class="token punctuation">)</span> <span class="token keyword">except</span> IOError <span class="token keyword">as</span> err<span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'error'</span> <span class="token operator">+</span> <span class="token builtin">str</span><span class="token punctuation">(</span>err<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">finally</span><span class="token punctuation">:</span> fp<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">pass</span> </code></pre> <h3>http返回状态码详解:</h3> <p>https://blog.csdn.net/ithomer/article/details/10240351</p> <p>当用户点击或搜索引擎向网站服务器发出浏览请求时,服务器将返回Http Header Http头信息状态码,常见几种如下:</p> <p>1、Http/1.1 200 OK 访问正常<br> 表示成功访问,为网站可正常访问时的状态。</p> <p>2、Http/1.1 301 Moved Permanently 301重定向永久重定向<br> 对搜索引擎相对友好的跳转方式,当网站更换域名时可将原域名作301永久重定向到新域名,原域名权重可传递到新域名,也常有将不含www的域名301跳转到含www的,如xxx.com通过301跳转到www.xxx.com</p> <p>3、Http/1.1 302 Found 为临时重定向<br> 易被搜索引擎判为作弊,比如asp程序的response.Redirect()跳转、js跳转或静态http跳转。</p> <p>4、Http/1.1 400 Bad Request 域名绑定错误<br> 一般是服务器上域名未绑定成功,未备案等情况。</p> <p>5、Http/1.1 403 Forbidden 没有权限访问此站<br> 你的IP被列入黑名单,连接的用户过多,可以过后再试,网站域名解析到了空间,但空间未绑定此域名等情况。</p> <p>6、Http/1.1 404 Not Found 文件或目录不存在<br> 表示请求文件、目录不存在或删除,设置404错误页时需确保返回值为404。常有因为404错误页设置不当导致不存在的网页返回的不是404而导致搜索引擎降权。</p> <p>7、Http/1.1 500 Internal Server Error 程序或服务器错误<br> 表示服务器内部程序错误,出现这样的提示一般是程序页面中出现错误,如小的语法错误,数据连接故障等。</p> <h3>curl</h3> <table> <thead> <tr> <th>参数</th> <th>说明</th> <th>示例</th> </tr> </thead> <tbody> <tr> <td>-A</td> <td>设置user-agent</td> <td>curl -A “Chrome” http://www.baidu.com</td> </tr> <tr> <td>-X</td> <td>用指定方法请求</td> <td>curl -X POST http://httpbin.org/post</td> </tr> <tr> <td>-I</td> <td>只返回请求的头信息</td> <td></td> </tr> <tr> <td>-d</td> <td>以POST方法请求url,并发送相应的参数</td> <td>-d a=1 -d b=2 -d c=3 | -d “a=1&b=2&c=3” |-d @filename</td> </tr> <tr> <td>-O</td> <td>下载文件并以远程的文件名保存</td> <td></td> </tr> <tr> <td>-o</td> <td>下载文件并以指定的文件名保存</td> <td></td> </tr> <tr> <td>-H</td> <td>设置头信息</td> <td></td> </tr> <tr> <td>-k</td> <td>允许发起不安全的SSL请求</td> <td></td> </tr> </tbody> </table> <p>https://www.ruanyifeng.com/blog/2019/09/curl-reference.html</p> <h1>AJAX 尚硅谷教程</h1> <p>https://www.wrysmile.cn/Learn-AJAX.html</p> <h3>一、基础内容</h3> <h4>1.AJAX</h4> <ul> <li>AJAX 是异步的 JS 和 XML,通过 AJAX 可以在浏览器中向服务器中发送异步请求</li> <li>优点: <ul> <li>可以无需刷新页面与服务器进行通信</li> <li>允许根据用户时间来更新部分页面内容</li> </ul> </li> <li>缺点: <ul> <li>没有浏览历史,不能回退</li> <li>存在跨域问题(同源)</li> <li>SEO 不太好</li> </ul> </li> </ul> <h4>2.XML</h4> <ul> <li> <p>XML 被设计用来传输和存储数据</p> </li> <li> <h3>(1).请求报文</h3> <ul> <li> <p>请求行:GET或POST / url / HTTP协议版本</p> </li> <li> <p>请求头:格式为</p> <p>键值对</p> <ul> <li>Host:xxxx</li> <li>Cookie:name=wrysmile</li> </ul> </li> <li> <p>请求空行:固定的</p> </li> <li> <p>请求体:</p> <ul> <li>如果请求行是 GET 请求,请求体就为空</li> <li>如果请求行是 POST 请求,请求体可以不为空</li> </ul> </li> </ul> <h3>(2).响应报文</h3> <ul> <li>响应行:HTTP协议版本 / 响应状态码 / 响应状态字符串 <ul> <li>1xx:信息,服务器收到请求,需要请求者继续执行操作</li> <li>2xx:成功,操作被成功接收并处理</li> <li>3xx:重定向,需要进一步的操作以完成请求</li> <li>4xx:客户端错误,请求包含语法错误或无法完成请求</li> <li>5xx:服务器错误,服务器在处理请求的过程中发生了错误</li> <li>具体状态码可以看这里</li> </ul> </li> <li>响应头: <ul> <li>Content-Type:text/html;charset=utf-8</li> </ul> </li> <li>响应空行:固定必须有</li> <li>响应体:html中的所有内容</li> </ul> <h2>xml 与 html 的区别:</h2> </li> <li> <ul> <li>前者没有预定义标签,全是自定义标签,用来表示一些数据</li> <li>后者都是预定义标签</li> </ul> </li> <li> <p>目前已被 JSON 取代</p> </li> </ul> <p>Express服务器端框架:简单框架使用</p> <h4>3.HTTP</h4> <ul> <li>超文本传输协议,详细规定了浏览器和万维网服务器之间互相通信的规则</li> </ul> <h3></h3> </div> </div> </div> </div> </div> <!--PC和WAP自适应版--> <div id="SOHUCS" sid="1529245259433799680"></div> <script type="text/javascript" src="/views/front/js/chanyan.js"></script> <!-- 文章页-底部 动态广告位 --> <div class="youdao-fixed-ad" id="detail_ad_bottom"></div> </div> <div class="col-md-3"> <div class="row" id="ad"> <!-- 文章页-右侧1 动态广告位 --> <div id="right-1" class="col-lg-12 col-md-12 col-sm-4 col-xs-4 ad"> <div class="youdao-fixed-ad" id="detail_ad_1"> </div> </div> <!-- 文章页-右侧2 动态广告位 --> <div id="right-2" class="col-lg-12 col-md-12 col-sm-4 col-xs-4 ad"> <div class="youdao-fixed-ad" id="detail_ad_2"></div> </div> <!-- 文章页-右侧3 动态广告位 --> <div id="right-3" class="col-lg-12 col-md-12 col-sm-4 col-xs-4 ad"> <div class="youdao-fixed-ad" id="detail_ad_3"></div> </div> </div> </div> </div> </div> </div> <div class="container"> <h4 class="pt20 mb15 mt0 border-top">你可能感兴趣的:(爬虫,python,爬虫,pycharm)</h4> <div id="paradigm-article-related"> <div class="recommend-post mb30"> <ul class="widget-links"> <li><a href="/article/1882817355252297728.htm" title="python execjs库_python3调用js的库之execjs" target="_blank">python execjs库_python3调用js的库之execjs</a> <span class="text-muted">一盏Online</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/execjs%E5%BA%93/1.htm">execjs库</a> <div>针对现在大部分的网站都是使用js加密,js加载的,并不能直接抓取出来,这时候就不得不适用一些三方类库来执行js语句执行JS的类库:execjs,PyV8,selenium,node这里主要讲一下execjs,一个比较好用且容易上手的类库(支持py2,与py3),支持JSruntime。(一)安装:pipinstallPyExecJSoreasy_installPyExecJS(二)运行时环境exe</div> </li> <li><a href="/article/1882817229007941632.htm" title="Python 执行 javascript PyExecJS 模块" target="_blank">Python 执行 javascript PyExecJS 模块</a> <span class="text-muted">weixin_30376083</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/javascript/1.htm">javascript</a><a class="tag" taget="_blank" href="/search/json/1.htm">json</a><a class="tag" taget="_blank" href="/search/ViewUI/1.htm">ViewUI</a> <div>PyExecJS安装pipinstallPyExecJSPyExecJS的基本使用:>>>importexecjs>>>execjs.eval("'redyellowblue'.split('')")['red','yellow','blue']>>>ctx=execjs.compile("""...functionadd(x,y){...returnx+y;...}...""")>>>ctx.c</div> </li> <li><a href="/article/1882814201991327744.htm" title="「QT」经验篇 之 界面代码与逻辑代码的分离思想" target="_blank">「QT」经验篇 之 界面代码与逻辑代码的分离思想</a> <span class="text-muted">何曾参静谧</span> <a class="tag" taget="_blank" href="/search/%E3%80%8CQT%E3%80%8DQT5%E7%A8%8B%E5%BA%8F%E8%AE%BE%E8%AE%A1/1.htm">「QT」QT5程序设计</a><a class="tag" taget="_blank" href="/search/qt/1.htm">qt</a><a class="tag" taget="_blank" href="/search/%E7%B3%BB%E7%BB%9F%E6%9E%B6%E6%9E%84/1.htm">系统架构</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%BA%93/1.htm">数据库</a> <div>✨博客主页何曾参静谧的博客(✅关注、点赞、⭐收藏、转发)全部专栏(专栏会有变化,以最新发布为准)「Win」Windows程序设计「IDE」集成开发环境「定制」定制开发集合「C/C++」C/C++程序设计「DSA」数据结构与算法「UG/NX」NX二次开发「QT」QT5程序设计「File」数据文件格式「UG/NX」BlockUI集合「Py」Python程序设计「Math」探秘数学世界「PK」Paras</div> </li> <li><a href="/article/1882813192841785344.htm" title="在Python中运行JavaScript代码(使用execjs模块)" target="_blank">在Python中运行JavaScript代码(使用execjs模块)</a> <span class="text-muted">飞起来fly呀</span> <a class="tag" taget="_blank" href="/search/Python/1.htm">Python</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>使用execjs模块可以在Python中运行JavaScript代码。以下是使用execjs模块的基本步骤:1.安装execjs模块:可以使用pip命令进行安装:pipinstall execjs2.导入execjs模块:import execjs3.使用compile方法可以将JavaScript代码编译为可执行的函数compiled_func = execjs.compile(code)#执行</div> </li> <li><a href="/article/1882808652293795840.htm" title="Python快速使用js接口" target="_blank">Python快速使用js接口</a> <span class="text-muted">程序媛小本</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/javascript/1.htm">javascript</a><a class="tag" taget="_blank" href="/search/udp/1.htm">udp</a> <div>在跨语言编程和Web开发中,Python和JavaScript是两种常用的编程语言。有时候,我们可能需要在Python环境中执行JavaScript代码。这就是execjs库发挥作用的地方。一、安装ExecJS在命令行中输入以下命令:pipinstallPyExecJS二、ExecJS的基本使用ExecJS支持多种JavaScript运行时环境,包括Node.js、SpiderMonkey、Web</div> </li> <li><a href="/article/1882806381879291904.htm" title="Python设计模式详解之5 —— 原型模式" target="_blank">Python设计模式详解之5 —— 原型模式</a> <span class="text-muted">拾工</span> <a class="tag" taget="_blank" href="/search/Python%E8%AE%BE%E8%AE%A1%E6%A8%A1%E5%BC%8F/1.htm">Python设计模式</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E8%AE%BE%E8%AE%A1%E6%A8%A1%E5%BC%8F/1.htm">设计模式</a> <div>Prototype设计模式是一种创建型设计模式,它通过复制已有的实例来创建新对象,而不是通过从头实例化。这种模式非常适合对象的创建成本较高或者需要避免复杂的构造过程时使用。Prototype模式提供了一种通过克隆来快速创建对象的方式。1.Prototype模式简介Prototype模式通过定义一个接口来克隆自身,使得客户端代码可以通过复制原型来创建新对象。Python中,Prototype模式可以</div> </li> <li><a href="/article/1882806383275995136.htm" title="Python中的23种设计模式:详细分类与总结" target="_blank">Python中的23种设计模式:详细分类与总结</a> <span class="text-muted">拾工</span> <a class="tag" taget="_blank" href="/search/Python%E8%AE%BE%E8%AE%A1%E6%A8%A1%E5%BC%8F/1.htm">Python设计模式</a><a class="tag" taget="_blank" href="/search/%E8%BD%AF%E4%BB%B6%E8%AE%BE%E8%AE%A1/1.htm">软件设计</a><a class="tag" taget="_blank" href="/search/%E8%AE%BE%E8%AE%A1%E6%A8%A1%E5%BC%8F/1.htm">设计模式</a> <div>设计模式是解决特定问题的通用方法,分为创建型模式、结构型模式和行为型模式三大类。以下是对每种模式的详细介绍,包括其核心思想、应用场景和优缺点。一、创建型模式(CreationalPatterns)创建型模式关注对象的创建,旨在解耦对象的创建过程,提高灵活性和可扩展性。1.单例模式(Singleton)核心思想:确保一个类只有一个实例,并提供全局访问点。应用场景:数据库连接、配置管理器、日志记录器。</div> </li> <li><a href="/article/1882804238862577664.htm" title="华为OD机试E卷 -最长方连续方波信号(Java & Python& JS & C++ & C )" target="_blank">华为OD机试E卷 -最长方连续方波信号(Java & Python& JS & C++ & C )</a> <span class="text-muted">算法大师</span> <a class="tag" taget="_blank" href="/search/%E6%9C%80%E6%96%B0%E5%8D%8E%E4%B8%BAOD%E6%9C%BA%E8%AF%95/1.htm">最新华为OD机试</a><a class="tag" taget="_blank" href="/search/%E5%8D%8E%E4%B8%BAod/1.htm">华为od</a><a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/javascript/1.htm">javascript</a><a class="tag" taget="_blank" href="/search/c%E8%AF%AD%E8%A8%80/1.htm">c语言</a><a class="tag" taget="_blank" href="/search/%E5%8D%8E%E4%B8%BAod%E6%9C%BA%E8%80%83e%E5%8D%B7/1.htm">华为od机考e卷</a> <div>最新华为OD机试真题目录:点击查看目录华为OD面试真题精选:点击立即查看题目描述输入一串方波信号,求取最长的完全连续交替方波信号,并将其输出,如果有相同长度的交替方波信号,输出任一即可。方波信号高位用1标识,低位用0标识。说明:一个完整的信号一定以0开始然后以0结尾,即010是一个完整信号,但101,1010,0101不是输入的一串方波信号是由一个或多个完整信号组成两个相邻信号之间可能有0个或多个</div> </li> <li><a href="/article/1882797934198714368.htm" title="「Py」进阶语法篇 之 Python中的异常捕获与处理" target="_blank">「Py」进阶语法篇 之 Python中的异常捕获与处理</a> <span class="text-muted">何曾参静谧</span> <a class="tag" taget="_blank" href="/search/%E3%80%8CPy%E3%80%8DPython%E7%A8%8B%E5%BA%8F%E8%AE%BE%E8%AE%A1/1.htm">「Py」Python程序设计</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%BA%93/1.htm">数据库</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>✨博客主页何曾参静谧的博客(✅关注、点赞、⭐收藏、转发)全部专栏(专栏会有变化,以最新发布为准)「Win」Windows程序设计「IDE」集成开发环境「UG/NX」BlockUI集合「C/C++」C/C++程序设计「DSA」数据结构与算法「UG/NX」NX二次开发「QT」QT5程序设计「File」数据文件格式「UG/NX」NX定制开发「Py」Python程序设计「Math」探秘数学世界「PK」Pa</div> </li> <li><a href="/article/1882796296327196672.htm" title="AI Agent的记忆系统实现:从短期对话到长期知识" target="_blank">AI Agent的记忆系统实现:从短期对话到长期知识</a> <span class="text-muted">技术出海录</span> <a class="tag" taget="_blank" href="/search/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD/1.htm">人工智能</a><a class="tag" taget="_blank" href="/search/AI/1.htm">AI</a><a class="tag" taget="_blank" href="/search/ai/1.htm">ai</a><a class="tag" taget="_blank" href="/search/agent/1.htm">agent</a> <div>在上一篇文章中,我们搭建了AIAgent的基础框架。今天,我想深入讲讲AIAgent最核心的部分之一:记忆系统。说实话,我在实现记忆系统时走了不少弯路,希望通过这篇文章,能帮大家少走一些弯路。从一个bug说起还记得在开发知识助手的过程中,我遇到了一个很有意思的问题。一天我正在测试多轮对话功能:我:Python的装饰器是什么?助手:装饰器是Python中用于修改函数或类行为的一种设计模式...(省略</div> </li> <li><a href="/article/1882793903246077952.htm" title="python如何在一个类里面调用另一个类里面的东西" target="_blank">python如何在一个类里面调用另一个类里面的东西</a> <span class="text-muted">xiamu_CDA</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>Python高手必备:轻松实现在一个类里调用另一个类的方法和属性Python是一门强大且灵活的编程语言,它的面向对象特性使得开发者可以轻松地组织和管理代码。然而,在实际开发过程中,我们经常会遇到这样一个问题:如何在一个类里面调用另一个类里面的东西?这看似简单的问题背后其实涉及到了许多面向对象编程的核心概念。本文将深入探讨这个问题,并提供几种实现方法,帮助你更好地理解和应用Python的类。为什么需</div> </li> <li><a href="/article/1882791384079986688.htm" title="python给PDF添加水印" target="_blank">python给PDF添加水印</a> <span class="text-muted">icon920</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/pdf/1.htm">pdf</a> <div>#添加水印fromPyPDF2importPdfReader,PdfWriterfromcopyimportcopysy=PdfReader("C:\\test\\watermark.pdf")#水印所在位置mark_page=sy.pages[0]#水印所在的页数#读取添加水印的文件file_reader=PdfReader("C:\\test\\PDF.pdf")#需要添加水印的PDFfile</div> </li> <li><a href="/article/1882789493098999808.htm" title="使用python对pdf批量添加水印,并且水印字体,大小,位置,旋转角度都是可以调节" target="_blank">使用python对pdf批量添加水印,并且水印字体,大小,位置,旋转角度都是可以调节</a> <span class="text-muted">不懂python不懂R</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/pdf/1.htm">pdf</a> <div>1.使用python对pdf批量添加水印,并且水印字体,大小,位置,旋转角度都是可以调节的importosfromPyPDF2importPdfReader,PdfWriterfromreportlab.pdfgenimportcanvasfromreportlab.lib.pagesizesimportletterfromreportlab.lib.colorsimportColordefcre</div> </li> <li><a href="/article/1882789114521120768.htm" title="Python批量为PDF添加水印:让你的文件瞬间高大上!" target="_blank">Python批量为PDF添加水印:让你的文件瞬间高大上!</a> <span class="text-muted">码无止尽</span> <a class="tag" taget="_blank" href="/search/Python%E5%8A%9E%E5%85%AC%E8%87%AA%E5%8A%A8%E5%8C%96/1.htm">Python办公自动化</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/pdf/1.htm">pdf</a> <div>嗨,各位可爱的小伙伴们!小编在此奉上今天的超级干货:如何用Python给一大堆PDF文件添加水印。请放心,这不是在交朋友圈秀操作,而是有实际需求的哦!有时候我们需要在PDF文件上添加水印,比如“草稿”、“保密”、“审阅”等标识,来提醒自己或他人。今天就让我来教你如何用Python轻松搞定这件事!首先,让我给你看一下大致的实现思路,然后再附上实际代码。实现思路1、首先,我们需要一个PDF处理的Pyt</div> </li> <li><a href="/article/1882789115045408768.htm" title="构建自动化网页内容监控系统:使用Python" target="_blank">构建自动化网页内容监控系统:使用Python</a> <span class="text-muted">爱你不会累</span> <div>本文还有配套的精品资源,点击获取简介:网页监控更新工具是一个由Python开发的软件,用于检测和记录网页内容的变化。该工具利用Python在Web抓取和数据分析方面的优势,包括利用requests,BeautifulSoup,lxml,和diff-match-patch等库来获取网页内容、解析HTML文档及计算文本差异。工具支持在Windows7及Python2.7.3环境下运行,并允许用户设定监</div> </li> <li><a href="/article/1882787602575192064.htm" title="python监控网页更新_【小白教程】Python3监控网页" target="_blank">python监控网页更新_【小白教程】Python3监控网页</a> <span class="text-muted">weixin_39553904</span> <a class="tag" taget="_blank" href="/search/python%E7%9B%91%E6%8E%A7%E7%BD%91%E9%A1%B5%E6%9B%B4%E6%96%B0/1.htm">python监控网页更新</a> <div>之前用RSS来监控网页更新内容,可惜刷新时间太长了,三个小时。。只能看看新闻啥的,又没有小钱钱充会员(摊手听说Python可以做这个功能,抱着试试看的态度,本以为会很麻烦,没想到这么简单哈哈~我从来没有用过Python都做出来了,相信你也没问题!(我真是纯小白,路过的大佬请指教(⊙o⊙)ノ)所用模块#监控模块fromurllibimportrequestfrombs4importBeautiful</div> </li> <li><a href="/article/1882787603061731328.htm" title="python鸢尾花数据集knn_【python+机器学习1】python 实现 KNN" target="_blank">python鸢尾花数据集knn_【python+机器学习1】python 实现 KNN</a> <span class="text-muted">weixin_39629269</span> <a class="tag" taget="_blank" href="/search/python%E9%B8%A2%E5%B0%BE%E8%8A%B1%E6%95%B0%E6%8D%AE%E9%9B%86knn/1.htm">python鸢尾花数据集knn</a> <div>欢迎关注哈希大数据微信公众号【哈希大数据】1KNN算法基本介绍K-NearestNeighbor(k最邻近分类算法),简称KNN,是最简单的一种有监督的机器学习算法。也是一种懒惰学习算法,即开始训练仅仅是保存所有样本集的信息,直到测试样本到达才开始进行分类决策。KNN算法的核心思想:要想确定测试样本属于哪一类,就先寻找所有训练样本中与该测试样本“距离”最近的前K个样本,然后判断这K个样本中大部分所</div> </li> <li><a href="/article/1882787347767029760.htm" title="实时监控网页变化,并增加多种提示信息" target="_blank">实时监控网页变化,并增加多种提示信息</a> <span class="text-muted">安替-AnTi</span> <a class="tag" taget="_blank" href="/search/%E8%87%AA%E5%8A%A8%E5%8C%96%E5%B7%A5%E5%85%B7/1.htm">自动化工具</a><a class="tag" taget="_blank" href="/search/linux/1.htm">linux</a><a class="tag" taget="_blank" href="/search/%E8%BF%90%E7%BB%B4/1.htm">运维</a><a class="tag" taget="_blank" href="/search/%E6%9C%8D%E5%8A%A1%E5%99%A8/1.htm">服务器</a><a class="tag" taget="_blank" href="/search/%E7%9B%91%E6%8E%A7/1.htm">监控</a><a class="tag" taget="_blank" href="/search/%E7%BD%91%E9%A1%B5%E5%8F%98%E5%8C%96/1.htm">网页变化</a> <div>文章目录python代码实现优势手动部署下载源码安装依赖初次登录设置Docker部署设置监控chromeJS插件实现插件1背景介绍使用方法插件2参考文献通过订阅本篇文章,您可以实现在任意打开网页情况下,监控网页内指定内容或者全部内容的变化,变化的内容、时间点可以通过邮箱、微信等方式进行提醒。使用场景可以用来监控足球比赛的赔率、京东商品库存、价格等因素,并且可以为订阅用户添加各种定制化的服务。如在订</div> </li> <li><a href="/article/1882787347351793664.htm" title="用python监控网页某个位置的值的变化" target="_blank">用python监控网页某个位置的值的变化</a> <span class="text-muted">老光私享</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a><a class="tag" taget="_blank" href="/search/%E7%88%AC%E8%99%AB/1.htm">爬虫</a> <div>可以使用Python的第三方库来监控网页上某个位置的值的变化。一种方法是使用BeautifulSoup库来爬取网页并解析HTML/XML。然后,您可以使用正则表达式或其他方法来提取所需信息。另一种方法是使用Selenium库来模拟浏览器行为,并使用JavaScript来获取网页上的信息。下面是一个使用BeautifulSoup的例子:importrequestsfrombs4importBeaut</div> </li> <li><a href="/article/1882783942877179904.htm" title="python向pdf添加水印" target="_blank">python向pdf添加水印</a> <span class="text-muted">ChenWenKen</span> <a class="tag" taget="_blank" href="/search/Python%E5%BA%94%E7%94%A8/1.htm">Python应用</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%89%8D%E7%AB%AF/1.htm">前端</a> <div>fromtypingimportUnion,Tuplefromreportlab.libimportunitsfromreportlab.pdfgenimportcanvasfromreportlab.pdfbaseimportpdfmetricsfromreportlab.pdfbase.ttfontsimportTTFontpdfmetrics.registerFont(TTFont('msy</div> </li> <li><a href="/article/1882779152684216320.htm" title="python笔记(3)(re库和pandas库)" target="_blank">python笔记(3)(re库和pandas库)</a> <span class="text-muted">Techer_Y</span> <a class="tag" taget="_blank" href="/search/%E7%AC%94%E8%AE%B0/1.htm">笔记</a> <div>参考链接:Python正则表达式|菜鸟教程(runoob.com)1、re库,python正则表达式正则表达式是一个特殊的字符序列它能帮助你检查一个字符串是否与某种模式匹配。re模块使python语言拥有全部的正则表达式功能。re.match尝试从字符串起始位置匹配一个模式,如果不是起始位置匹配成功的话,match()就返回none。re.match(pattern,string,flags=0)</div> </li> <li><a href="/article/1882778773787570176.htm" title="Python PDF添加水印" target="_blank">Python PDF添加水印</a> <span class="text-muted">lxccc9</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E7%AC%94%E8%AE%B0/1.htm">笔记</a> <div>PDF添加水印加载模块:fromPyPDF2importPdfFileReader,PdfFileWriterimportosPDF添加水印:watermark_pdf=PdfFileReader('./tests/watermark.pdf')#读取第一页watermark=watermark_pdf.getPage(0)#读取需要加水印的pdf文件input_pdf=PdfFileReader</div> </li> <li><a href="/article/1882777133370109952.htm" title="用Python写前端" target="_blank">用Python写前端</a> <span class="text-muted">eternity_ld</span> <a class="tag" taget="_blank" href="/search/%E5%89%8D%E7%AB%AF/1.htm">前端</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>分享一个让开发交互式Webapp超级简单的工具。不会HTML,CSS,JAVASCRIPT也没事。交互式Webapp非常实用,比如说做一个问卷调查页面、一个投票系统、一个信息收集表单,上传文件等等,因为网页是可视化的,因此还可以作为一个没有服务端的图片界面应用程序而使用。如果你有这样的开发需求,那用Python真的是太简单了。借助于PyWebIO(pipinstallpywebio),你可以分分钟</div> </li> <li><a href="/article/1882772721662750720.htm" title="使用python做出一只懒羊羊" target="_blank">使用python做出一只懒羊羊</a> <span class="text-muted">大G哥</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>今天使用Python的Turtle库做出一只懒羊羊PythonTurtle库功能与用途一、绘图基础功能Turtle库提供了一种简单易用的方式来进行图形绘制。通过控制屏幕上的海龟指针移动来完成线条和形状的创建。可以设置画笔的颜色、大小以及方向等属性,从而实现多样化的视觉效果。importturtlet=turtle.Turtle()t.forward(100)#向前走100像素距离t.right(9</div> </li> <li><a href="/article/1882772469459251200.htm" title="【全栈】SprintBoot+vue3迷你商城-扩展:vue3项目创建及目录介绍" target="_blank">【全栈】SprintBoot+vue3迷你商城-扩展:vue3项目创建及目录介绍</a> <span class="text-muted">杰九</span> <a class="tag" taget="_blank" href="/search/vue.js/1.htm">vue.js</a><a class="tag" taget="_blank" href="/search/javascript/1.htm">javascript</a><a class="tag" taget="_blank" href="/search/%E5%89%8D%E7%AB%AF/1.htm">前端</a><a class="tag" taget="_blank" href="/search/spring/1.htm">spring</a><a class="tag" taget="_blank" href="/search/boot/1.htm">boot</a> <div>【全栈】SprintBoot+vue3迷你商城-扩展:vue3项目创建及目录介绍往期的文章都在这里啦,大家有兴趣可以看一下【全栈】SprintBoot+vue3迷你商城(1)【全栈】SprintBoot+vue3迷你商城(2)【全栈】SprintBoot+vue3迷你商城-扩展:利用python爬虫爬取商品数据【全栈】SprintBoot+vue3迷你商城(3)【全栈】SprintBoot+vue</div> </li> <li><a href="/article/1882772467840249856.htm" title="【算法】动态规划:从斐波那契数列到背包问题" target="_blank">【算法】动态规划:从斐波那契数列到背包问题</a> <span class="text-muted">杰九</span> <a class="tag" taget="_blank" href="/search/%E4%BC%98%E8%B4%A8%E6%96%87%E7%AB%A0/1.htm">优质文章</a><a class="tag" taget="_blank" href="/search/%E7%AE%97%E6%B3%95/1.htm">算法</a><a class="tag" taget="_blank" href="/search/%E5%8A%A8%E6%80%81%E8%A7%84%E5%88%92/1.htm">动态规划</a> <div>【算法】动态规划:从斐波那契数列到背包问题文章目录【算法】动态规划:从斐波那契数列到背包问题1.斐波那契数列2.爬楼梯3.零钱转换Python代码4.零钱兑换II5.组合数dp和排列数dp6.为什么动态规划的核心思想计算组合数的正确方法代码实现为什么先遍历硬币再遍历金额可以计算组合数详细解释举例说明最终结果具体组合情况为什么有效7.背包问题01背包问题定义完全背包问题定义示例为什么需要倒序遍历8.</div> </li> <li><a href="/article/1882771837503467520.htm" title="通过Python为PDF添加图片水印" target="_blank">通过Python为PDF添加图片水印</a> <span class="text-muted">nini!</span> <a class="tag" taget="_blank" href="/search/pdf/1.htm">pdf</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/vscode/1.htm">vscode</a><a class="tag" taget="_blank" href="/search/%E5%AE%89%E5%85%A8/1.htm">安全</a> <div>前言之前介绍了如何通过Python向PDF添加文本水印。事实上,添加图片水印也同样实用。例如将公司或产品logo添加到文档中,从而提升品牌效应或防止他人随意盗用。或者将图片插入到文档中以注明文档用处或状态。与文本水印类似,添加图片水印时,也可以设置添加单个图片水印或者多个重复水印。下面是以Python平台为例,为PDF添加图片水印的方法介绍。所需工具VisualStudioCodeSpire.PD</div> </li> <li><a href="/article/1882767171755503616.htm" title="282道Python面试八股文(答案、分析和深入提问)整理" target="_blank">282道Python面试八股文(答案、分析和深入提问)整理</a> <span class="text-muted">ocean2103</span> <a class="tag" taget="_blank" href="/search/%E9%9D%A2%E8%AF%95%E9%A2%98/1.htm">面试题</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E9%9D%A2%E8%AF%95/1.htm">面试</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>1.请解释Python中的模块和包。回答在Python中,模块和包是组织代码的重要工具,它们有助于代码的重用和结构化。模块(Module)模块是一个包含Python代码的文件,通常以.py作为文件扩展名。模块可以定义函数、类和变量,也可以包含可执行的代码。通过模块,可以将相关的功能分组到一个文件中,从而使得代码更加结构化和可维护。创建和使用模块创建模块:你可以创建一个Python文件(例如mymo</div> </li> <li><a href="/article/1882766289592709120.htm" title="【Pip】深入理解 `requirements.txt` 文件:Python 项目依赖管理的核心工具" target="_blank">【Pip】深入理解 `requirements.txt` 文件:Python 项目依赖管理的核心工具</a> <span class="text-muted">丶2136</span> <a class="tag" taget="_blank" href="/search/%23/1.htm">#</a><a class="tag" taget="_blank" href="/search/pip/1.htm">pip</a><a class="tag" taget="_blank" href="/search/pip/1.htm">pip</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a> <div>目录引言1.什么是`requirements.txt`?2.创建`requirements.txt`文件2.1手动创建2.2使用`pipfreeze`命令2.3使用`pipreqs`生成2.4使用`pipenv`或`poetry`3.安装依赖4.版本管理与更新4.1版本管理的最佳实践5.依赖关系的管理5.1使用`pip-tools`5.2使用虚拟环境5.3使用Docker5.4`requireme</div> </li> <li><a href="/article/1882761370802384896.htm" title="数字孪生技术:虚拟与现实的完美融合" target="_blank">数字孪生技术:虚拟与现实的完美融合</a> <span class="text-muted">Echo_Wish</span> <a class="tag" taget="_blank" href="/search/Python%E8%BF%9B%E9%98%B6/1.htm">Python进阶</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD/1.htm">人工智能</a><a class="tag" taget="_blank" href="/search/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0/1.htm">深度学习</a><a class="tag" taget="_blank" href="/search/%E8%99%9A%E6%8B%9F%E7%8E%B0%E5%AE%9E/1.htm">虚拟现实</a> <div>在现代技术飞速发展的时代,数字孪生技术(DigitalTwin)逐渐成为工业、医疗、城市规划等领域的重要工具。通过数字孪生技术,我们可以创建一个与现实世界对象高度一致的虚拟模型,从而实现对现实对象的监测、分析和优化。本文将深入探讨数字孪生技术的原理、应用场景,并结合Python代码示例,展示如何实现一个简单的数字孪生应用。一、数字孪生技术的基本概念数字孪生技术是指利用传感器、物联网(IoT)、大数</div> </li> <li><a href="/article/57.htm" title="多线程编程之join()方法" target="_blank">多线程编程之join()方法</a> <span class="text-muted">周凡杨</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/JOIN/1.htm">JOIN</a><a class="tag" taget="_blank" href="/search/%E5%A4%9A%E7%BA%BF%E7%A8%8B/1.htm">多线程</a><a class="tag" taget="_blank" href="/search/%E7%BC%96%E7%A8%8B/1.htm">编程</a><a class="tag" taget="_blank" href="/search/%E7%BA%BF%E7%A8%8B/1.htm">线程</a> <div>现实生活中,有些工作是需要团队中成员依次完成的,这就涉及到了一个顺序问题。现在有T1、T2、T3三个工人,如何保证T2在T1执行完后执行,T3在T2执行完后执行?问题分析:首先问题中有三个实体,T1、T2、T3, 因为是多线程编程,所以都要设计成线程类。关键是怎么保证线程能依次执行完呢?   Java实现过程如下: public class T1 implements Runnabl</div> </li> <li><a href="/article/184.htm" title="java中switch的使用" target="_blank">java中switch的使用</a> <span class="text-muted">bingyingao</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/enum/1.htm">enum</a><a class="tag" taget="_blank" href="/search/break/1.htm">break</a><a class="tag" taget="_blank" href="/search/continue/1.htm">continue</a> <div>java中的switch仅支持case条件仅支持int、enum两种类型。 用enum的时候,不能直接写下列形式。 switch (timeType) { case ProdtransTimeTypeEnum.DAILY: break; default: br</div> </li> <li><a href="/article/311.htm" title="hive having count 不能去重" target="_blank">hive having count 不能去重</a> <span class="text-muted">daizj</span> <a class="tag" taget="_blank" href="/search/hive/1.htm">hive</a><a class="tag" taget="_blank" href="/search/%E5%8E%BB%E9%87%8D/1.htm">去重</a><a class="tag" taget="_blank" href="/search/having+count/1.htm">having count</a><a class="tag" taget="_blank" href="/search/%E8%AE%A1%E6%95%B0/1.htm">计数</a> <div>hive在使用having count()是,不支持去重计数   hive (default)> select imei from t_test_phonenum where ds=20150701 group by imei having count(distinct phone_num)>1 limit 10;  FAILED: SemanticExcep</div> </li> <li><a href="/article/438.htm" title="WebSphere对JSP的缓存" target="_blank">WebSphere对JSP的缓存</a> <span class="text-muted">周凡杨</span> <a class="tag" taget="_blank" href="/search/WAS+JSP+%E7%BC%93%E5%AD%98/1.htm">WAS JSP 缓存</a> <div>      对于线网上的工程,更新JSP到WebSphere后,有时会出现修改的jsp没有起作用,特别是改变了某jsp的样式后,在页面中没看到效果,这主要就是由于websphere中缓存的缘故,这就要清除WebSphere中jsp缓存。要清除WebSphere中JSP的缓存,就要找到WAS安装后的根目录。        现服务</div> </li> <li><a href="/article/565.htm" title="设计模式总结" target="_blank">设计模式总结</a> <span class="text-muted">朱辉辉33</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/%E8%AE%BE%E8%AE%A1%E6%A8%A1%E5%BC%8F/1.htm">设计模式</a> <div>1.工厂模式   1.1 工厂方法模式 (由一个工厂类管理构造方法)      1.1.1普通工厂模式(一个工厂类中只有一个方法)      1.1.2多工厂模式(一个工厂类中有多个方法)      1.1.3静态工厂模式(将工厂类中的方法变成静态方法) &n</div> </li> <li><a href="/article/692.htm" title="实例:供应商管理报表需求调研报告" target="_blank">实例:供应商管理报表需求调研报告</a> <span class="text-muted">老A不折腾</span> <a class="tag" taget="_blank" href="/search/finereport/1.htm">finereport</a><a class="tag" taget="_blank" href="/search/%E6%8A%A5%E8%A1%A8%E7%B3%BB%E7%BB%9F/1.htm">报表系统</a><a class="tag" taget="_blank" href="/search/%E6%8A%A5%E8%A1%A8%E8%BD%AF%E4%BB%B6/1.htm">报表软件</a><a class="tag" taget="_blank" href="/search/%E4%BF%A1%E6%81%AF%E5%8C%96%E9%80%89%E5%9E%8B/1.htm">信息化选型</a> <div>引言 随着企业集团的生产规模扩张,为支撑全球供应链管理,对于供应商的管理和采购过程的监控已经不局限于简单的交付以及价格的管理,目前采购及供应商管理各个环节的操作分别在不同的系统下进行,而各个数据源都独立存在,无法提供统一的数据支持;因此,为了实现对于数据分析以提供采购决策,建立报表体系成为必须。 业务目标 1、通过报表为采购决策提供数据分析与支撑 2、对供应商进行综合评估以及管理,合理管理和</div> </li> <li><a href="/article/819.htm" title="mysql" target="_blank">mysql</a> <span class="text-muted">林鹤霄</span> <div>转载源:http://blog.sina.com.cn/s/blog_4f925fc30100rx5l.html mysql -uroot -p ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: YES)   [root@centos var]# service mysql</div> </li> <li><a href="/article/946.htm" title="Linux下多线程堆栈查看工具(pstree、ps、pstack)" target="_blank">Linux下多线程堆栈查看工具(pstree、ps、pstack)</a> <span class="text-muted">aigo</span> <a class="tag" taget="_blank" href="/search/linux/1.htm">linux</a> <div>原文:http://blog.csdn.net/yfkiss/article/details/6729364   1. pstree pstree以树结构显示进程$ pstree -p work | grep adsshd(22669)---bash(22670)---ad_preprocess(4551)-+-{ad_preprocess}(4552)  &n</div> </li> <li><a href="/article/1073.htm" title="html input与textarea 值改变事件" target="_blank">html input与textarea 值改变事件</a> <span class="text-muted">alxw4616</span> <a class="tag" taget="_blank" href="/search/JavaScript/1.htm">JavaScript</a> <div>// 文本输入框(input) 文本域(textarea)值改变事件 // onpropertychange(IE) oninput(w3c) $('input,textarea').on('propertychange input', function(event) {      console.log($(this).val()) });   </div> </li> <li><a href="/article/1200.htm" title="String类的基本用法" target="_blank">String类的基本用法</a> <span class="text-muted">百合不是茶</span> <a class="tag" taget="_blank" href="/search/String/1.htm">String</a> <div>  字符串的用法;     // 根据字节数组创建字符串 byte[] by = { 'a', 'b', 'c', 'd' }; String newByteString = new String(by);         1,length()  获取字符串的长度     &nbs</div> </li> <li><a href="/article/1327.htm" title="JDK1.5 Semaphore实例" target="_blank">JDK1.5 Semaphore实例</a> <span class="text-muted">bijian1013</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/thread/1.htm">thread</a><a class="tag" taget="_blank" href="/search/java%E5%A4%9A%E7%BA%BF%E7%A8%8B/1.htm">java多线程</a><a class="tag" taget="_blank" href="/search/Semaphore/1.htm">Semaphore</a> <div>Semaphore类        一个计数信号量。从概念上讲,信号量维护了一个许可集合。如有必要,在许可可用前会阻塞每一个 acquire(),然后再获取该许可。每个 release() 添加一个许可,从而可能释放一个正在阻塞的获取者。但是,不使用实际的许可对象,Semaphore 只对可用许可的号码进行计数,并采取相应的行动。 S</div> </li> <li><a href="/article/1454.htm" title="使用GZip来压缩传输量" target="_blank">使用GZip来压缩传输量</a> <span class="text-muted">bijian1013</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/GZip/1.htm">GZip</a> <div>        启动GZip压缩要用到一个开源的Filter:PJL Compressing Filter。这个Filter自1.5.0开始该工程开始构建于JDK5.0,因此在JDK1.4环境下只能使用1.4.6。         PJL Compressi</div> </li> <li><a href="/article/1581.htm" title="【Java范型三】Java范型详解之范型类型通配符" target="_blank">【Java范型三】Java范型详解之范型类型通配符</a> <span class="text-muted">bit1129</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a> <div>    定义如下一个简单的范型类,   package com.tom.lang.generics; public class Generics<T> { private T value; public Generics(T value) { this.value = value; } } </div> </li> <li><a href="/article/1708.htm" title="【Hadoop十二】HDFS常用命令" target="_blank">【Hadoop十二】HDFS常用命令</a> <span class="text-muted">bit1129</span> <a class="tag" taget="_blank" href="/search/hadoop/1.htm">hadoop</a> <div>1. 修改日志文件查看器   hdfs oev -i edits_0000000000000000081-0000000000000000089 -o edits.xml cat edits.xml   修改日志文件转储为xml格式的edits.xml文件,其中每条RECORD就是一个操作事务日志   2. fsimage查看HDFS中的块信息等 &nb</div> </li> <li><a href="/article/1835.htm" title="怎样区别nginx中rewrite时break和last" target="_blank">怎样区别nginx中rewrite时break和last</a> <span class="text-muted">ronin47</span> <div>在使用nginx配置rewrite中经常会遇到有的地方用last并不能工作,换成break就可以,其中的原理是对于根目录的理解有所区别,按我的测试结果大致是这样的。 location /    {         proxy_pass http://test; </div> </li> <li><a href="/article/1962.htm" title="java-21.中兴面试题 输入两个整数 n 和 m ,从数列 1 , 2 , 3.......n 中随意取几个数 , 使其和等于 m" target="_blank">java-21.中兴面试题 输入两个整数 n 和 m ,从数列 1 , 2 , 3.......n 中随意取几个数 , 使其和等于 m</a> <span class="text-muted">bylijinnan</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a> <div> import java.util.ArrayList; import java.util.List; import java.util.Stack; public class CombinationToSum { /* 第21 题 2010 年中兴面试题 编程求解: 输入两个整数 n 和 m ,从数列 1 , 2 , 3.......n 中随意取几个数 , 使其和等</div> </li> <li><a href="/article/2089.htm" title="eclipse svn 帐号密码修改问题" target="_blank">eclipse svn 帐号密码修改问题</a> <span class="text-muted">开窍的石头</span> <a class="tag" taget="_blank" href="/search/eclipse/1.htm">eclipse</a><a class="tag" taget="_blank" href="/search/SVN/1.htm">SVN</a><a class="tag" taget="_blank" href="/search/svn%E5%B8%90%E5%8F%B7%E5%AF%86%E7%A0%81%E4%BF%AE%E6%94%B9/1.htm">svn帐号密码修改</a> <div>问题描述:      Eclipse的SVN插件Subclipse做得很好,在svn操作方面提供了很强大丰富的功能。但到目前为止,该插件对svn用户的概念极为淡薄,不但不能方便地切换用户,而且一旦用户的帐号、密码保存之后,就无法再变更了。 解决思路:      删除subclipse记录的帐号、密码信息,重新输入</div> </li> <li><a href="/article/2216.htm" title="[电子商务]传统商务活动与互联网的结合" target="_blank">[电子商务]传统商务活动与互联网的结合</a> <span class="text-muted">comsci</span> <a class="tag" taget="_blank" href="/search/%E7%94%B5%E5%AD%90%E5%95%86%E5%8A%A1/1.htm">电子商务</a> <div>       某一个传统名牌产品,过去销售的地点就在某些特定的地区和阶层,现在进入互联网之后,用户的数量群突然扩大了无数倍,但是,这种产品潜在的劣势也被放大了无数倍,这种销售利润与经营风险同步放大的效应,在最近几年将会频繁出现。。。。        如何避免销售量和利润率增加的</div> </li> <li><a href="/article/2343.htm" title="java 解析 properties-使用 Properties-可以指定配置文件路径" target="_blank">java 解析 properties-使用 Properties-可以指定配置文件路径</a> <span class="text-muted">cuityang</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/properties/1.htm">properties</a> <div>#mq xdr.mq.url=tcp://192.168.100.15:61618; import java.io.IOException; import java.util.Properties; public class Test { String conf = "log4j.properties"; private static final</div> </li> <li><a href="/article/2470.htm" title="Java核心问题集锦" target="_blank">Java核心问题集锦</a> <span class="text-muted">darrenzhu</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/%E5%9F%BA%E7%A1%80/1.htm">基础</a><a class="tag" taget="_blank" href="/search/%E6%A0%B8%E5%BF%83/1.htm">核心</a><a class="tag" taget="_blank" href="/search/%E9%9A%BE%E7%82%B9/1.htm">难点</a> <div>注意,这里的参考文章基本来自Effective Java和jdk源码 1)ConcurrentModificationException 当你用for each遍历一个list时,如果你在循环主体代码中修改list中的元素,将会得到这个Exception,解决的办法是: 1)用listIterator, 它支持在遍历的过程中修改元素, 2)不用listIterator, new一个</div> </li> <li><a href="/article/2724.htm" title="1分钟学会Markdown语法" target="_blank">1分钟学会Markdown语法</a> <span class="text-muted">dcj3sjt126com</span> <a class="tag" taget="_blank" href="/search/markdown/1.htm">markdown</a> <div>markdown 简明语法 基本符号 *,-,+ 3个符号效果都一样,这3个符号被称为 Markdown符号 空白行表示另起一个段落 `是表示inline代码,tab是用来标记 代码段,分别对应html的code,pre标签 换行 单一段落( <p>) 用一个空白行 连续两个空格 会变成一个 <br> 连续3个符号,然后是空行</div> </li> <li><a href="/article/2851.htm" title="Gson使用二(GsonBuilder)" target="_blank">Gson使用二(GsonBuilder)</a> <span class="text-muted">eksliang</span> <a class="tag" taget="_blank" href="/search/json/1.htm">json</a><a class="tag" taget="_blank" href="/search/gson/1.htm">gson</a><a class="tag" taget="_blank" href="/search/GsonBuilder/1.htm">GsonBuilder</a> <div>转载请出自出处:http://eksliang.iteye.com/blog/2175473 一.概述     GsonBuilder用来定制java跟json之间的转换格式   二.基本使用 实体测试类: 温馨提示:默认情况下@Expose注解是不起作用的,除非你用GsonBuilder创建Gson的时候调用了GsonBuilder.excludeField</div> </li> <li><a href="/article/2978.htm" title="报ClassNotFoundException: Didn't find class "...Activity" on path: DexPathList" target="_blank">报ClassNotFoundException: Didn't find class "...Activity" on path: DexPathList</a> <span class="text-muted">gundumw100</span> <a class="tag" taget="_blank" href="/search/android/1.htm">android</a> <div>有一个工程,本来运行是正常的,我想把它移植到另一台PC上,结果报: java.lang.RuntimeException: Unable to instantiate activity ComponentInfo{com.mobovip.bgr/com.mobovip.bgr.MainActivity}: java.lang.ClassNotFoundException: Didn't f</div> </li> <li><a href="/article/3105.htm" title="JavaWeb之JSP指令" target="_blank">JavaWeb之JSP指令</a> <span class="text-muted">ihuning</span> <a class="tag" taget="_blank" href="/search/javaweb/1.htm">javaweb</a> <div>  要点   JSP指令简介  page指令  include指令    JSP指令简介    JSP指令(directive)是为JSP引擎而设计的,它们并不直接产生任何可见输出,而只是告诉引擎如何处理JSP页面中的其余部分。 JSP指令的基本语法格式: <%@ 指令 属性名="</div> </li> <li><a href="/article/3232.htm" title="mac上编译FFmpeg跑ios" target="_blank">mac上编译FFmpeg跑ios</a> <span class="text-muted">啸笑天</span> <a class="tag" taget="_blank" href="/search/ffmpeg/1.htm">ffmpeg</a> <div>1、下载文件:https://github.com/libav/gas-preprocessor, 复制gas-preprocessor.pl到/usr/local/bin/下, 修改文件权限:chmod 777 /usr/local/bin/gas-preprocessor.pl 2、安装yasm-1.2.0 curl http://www.tortall.net/projects/yasm</div> </li> <li><a href="/article/3359.htm" title="sql mysql oracle中字符串连接" target="_blank">sql mysql oracle中字符串连接</a> <span class="text-muted">macroli</span> <a class="tag" taget="_blank" href="/search/oracle/1.htm">oracle</a><a class="tag" taget="_blank" href="/search/sql/1.htm">sql</a><a class="tag" taget="_blank" href="/search/mysql/1.htm">mysql</a><a class="tag" taget="_blank" href="/search/SQL+Server/1.htm">SQL Server</a> <div>有的时候,我们有需要将由不同栏位获得的资料串连在一起。每一种资料库都有提供方法来达到这个目的: MySQL: CONCAT() Oracle: CONCAT(), || SQL Server: + CONCAT() 的语法如下: Mysql 中 CONCAT(字串1, 字串2, 字串3, ...): 将字串1、字串2、字串3,等字串连在一起。 请注意,Oracle的CON</div> </li> <li><a href="/article/3486.htm" title="Git fatal: unab SSL certificate problem: unable to get local issuer ce rtificate" target="_blank">Git fatal: unab SSL certificate problem: unable to get local issuer ce rtificate</a> <span class="text-muted">qiaolevip</span> <a class="tag" taget="_blank" href="/search/%E5%AD%A6%E4%B9%A0%E6%B0%B8%E6%97%A0%E6%AD%A2%E5%A2%83/1.htm">学习永无止境</a><a class="tag" taget="_blank" href="/search/%E6%AF%8F%E5%A4%A9%E8%BF%9B%E6%AD%A5%E4%B8%80%E7%82%B9%E7%82%B9/1.htm">每天进步一点点</a><a class="tag" taget="_blank" href="/search/git/1.htm">git</a><a class="tag" taget="_blank" href="/search/%E7%BA%B5%E8%A7%82%E5%8D%83%E8%B1%A1/1.htm">纵观千象</a> <div>// 报错如下: $ git pull origin master fatal: unable to access 'https://git.xxx.com/': SSL certificate problem: unable to get local issuer ce rtificate   // 原因: 由于git最新版默认使用ssl安全验证,但是我们是使用的git未设</div> </li> <li><a href="/article/3613.htm" title="windows命令行设置wifi" target="_blank">windows命令行设置wifi</a> <span class="text-muted">surfingll</span> <a class="tag" taget="_blank" href="/search/windows/1.htm">windows</a><a class="tag" taget="_blank" href="/search/wifi/1.htm">wifi</a><a class="tag" taget="_blank" href="/search/%E7%AC%94%E8%AE%B0%E6%9C%ACwifi/1.htm">笔记本wifi</a> <div>还没有讨厌无线wifi的无尽广告么,还在耐心等待它慢慢启动么 教你命令行设置 笔记本电脑wifi: 1、开启wifi命令 netsh wlan set hostednetwork mode=allow ssid=surf8 key=bb123456 netsh wlan start hostednetwork pause 其中pause是等待输入,可以去掉 2、</div> </li> <li><a href="/article/3740.htm" title="Linux(Ubuntu)下安装sysv-rc-conf" target="_blank">Linux(Ubuntu)下安装sysv-rc-conf</a> <span class="text-muted">wmlJava</span> <a class="tag" taget="_blank" href="/search/linux/1.htm">linux</a><a class="tag" taget="_blank" href="/search/ubuntu/1.htm">ubuntu</a><a class="tag" taget="_blank" href="/search/sysv-rc-conf/1.htm">sysv-rc-conf</a> <div>安装:sudo apt-get install sysv-rc-conf 使用:sudo sysv-rc-conf 操作界面十分简洁,你可以用鼠标点击,也可以用键盘方向键定位,用空格键选择,用Ctrl+N翻下一页,用Ctrl+P翻上一页,用Q退出。     背景知识 sysv-rc-conf是一个强大的服务管理程序,群众的意见是sysv-rc-conf比chkconf</div> </li> <li><a href="/article/3867.htm" title="svn切换环境,重发布应用多了javaee标签前缀" target="_blank">svn切换环境,重发布应用多了javaee标签前缀</a> <span class="text-muted">zengshaotao</span> <a class="tag" taget="_blank" href="/search/javaee/1.htm">javaee</a> <div>更换了开发环境,从杭州,改变到了上海。svn的地址肯定要切换的,切换之前需要将原svn自带的.svn文件信息删除,可手动删除,也可通过废弃原来的svn位置提示删除.svn时删除。   然后就是按照最新的svn地址和规范建立相关的目录信息,再将原来的纯代码信息上传到新的环境。然后再重新检出,这样每次修改后就可以看到哪些文件被修改过,这对于增量发布的规范特别有用。   检出</div> </li> </ul> </div> </div> </div> <div> <div class="container"> <div class="indexes"> <strong>按字母分类:</strong> <a href="/tags/A/1.htm" target="_blank">A</a><a href="/tags/B/1.htm" target="_blank">B</a><a href="/tags/C/1.htm" target="_blank">C</a><a href="/tags/D/1.htm" target="_blank">D</a><a href="/tags/E/1.htm" target="_blank">E</a><a href="/tags/F/1.htm" target="_blank">F</a><a href="/tags/G/1.htm" target="_blank">G</a><a href="/tags/H/1.htm" target="_blank">H</a><a href="/tags/I/1.htm" target="_blank">I</a><a href="/tags/J/1.htm" target="_blank">J</a><a href="/tags/K/1.htm" target="_blank">K</a><a href="/tags/L/1.htm" target="_blank">L</a><a href="/tags/M/1.htm" target="_blank">M</a><a href="/tags/N/1.htm" target="_blank">N</a><a href="/tags/O/1.htm" target="_blank">O</a><a href="/tags/P/1.htm" target="_blank">P</a><a href="/tags/Q/1.htm" target="_blank">Q</a><a href="/tags/R/1.htm" target="_blank">R</a><a href="/tags/S/1.htm" target="_blank">S</a><a href="/tags/T/1.htm" target="_blank">T</a><a href="/tags/U/1.htm" target="_blank">U</a><a href="/tags/V/1.htm" target="_blank">V</a><a href="/tags/W/1.htm" target="_blank">W</a><a href="/tags/X/1.htm" target="_blank">X</a><a href="/tags/Y/1.htm" target="_blank">Y</a><a href="/tags/Z/1.htm" target="_blank">Z</a><a href="/tags/0/1.htm" target="_blank">其他</a> </div> </div> </div> <footer id="footer" class="mb30 mt30"> <div class="container"> <div class="footBglm"> <a target="_blank" href="/">首页</a> - <a target="_blank" href="/custom/about.htm">关于我们</a> - <a target="_blank" href="/search/Java/1.htm">站内搜索</a> - <a target="_blank" href="/sitemap.txt">Sitemap</a> - <a target="_blank" href="/custom/delete.htm">侵权投诉</a> </div> <div class="copyright">版权所有 IT知识库 CopyRight © 2000-2050 E-COM-NET.COM , All Rights Reserved. <!-- <a href="https://beian.miit.gov.cn/" rel="nofollow" target="_blank">京ICP备09083238号</a><br>--> </div> </div> </footer> <!-- 代码高亮 --> <script type="text/javascript" src="/static/syntaxhighlighter/scripts/shCore.js"></script> <script type="text/javascript" src="/static/syntaxhighlighter/scripts/shLegacy.js"></script> <script type="text/javascript" src="/static/syntaxhighlighter/scripts/shAutoloader.js"></script> <link type="text/css" rel="stylesheet" href="/static/syntaxhighlighter/styles/shCoreDefault.css"/> <script type="text/javascript" src="/static/syntaxhighlighter/src/my_start_1.js"></script> </body> </html>